0% found this document useful (0 votes)
7 views

2005.10855v1

This document discusses the modeling and optimization of latency in erasure-coded storage systems, addressing the challenges and methodologies for characterizing access latency. It reviews various scheduling approaches, including MDS-Reservation, Fork-Join, Probabilistic, and Delayed-Relaunch scheduling, and their applications in distributed storage systems. The document also highlights the practical implications of these theories through prototype implementations and outlines future research directions.

Uploaded by

陳徐行
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

2005.10855v1

This document discusses the modeling and optimization of latency in erasure-coded storage systems, addressing the challenges and methodologies for characterizing access latency. It reviews various scheduling approaches, including MDS-Reservation, Fork-Join, Probabilistic, and Delayed-Relaunch scheduling, and their applications in distributed storage systems. The document also highlights the practical implications of these theories through prototype implementations and outlines future research directions.

Uploaded by

陳徐行
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 141

arXiv:2005.10855v1 [cs.

NI] 21 May 2020

Modeling and Optimization of


Latency in Erasure-coded Storage
Systems

Vaneet Aggarwal
Purdue University
[email protected]
Tian Lan
George Washington University
[email protected]
Contents

1 Introduction 3
1.1 Erasure Coding in Distributed Storage . . . . . . . . . . . 3
1.2 Key Challenges in Latency Characterization . . . . . . . . 5
1.3 Problem Taxonomy . . . . . . . . . . . . . . . . . . . . . 8
1.4 Outline of the Monograph . . . . . . . . . . . . . . . . . . 12
1.5 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 MDS-Reservation Scheduling Approach 14


2.1 MDS-Reservation Queue . . . . . . . . . . . . . . . . . . 14
2.2 Characterization of Latency Upper Bound via MDS-Reservation
Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Characterization of Latency Lower Bound . . . . . . . . . 19
2.4 Extension to Redundant Requests . . . . . . . . . . . . . 20
2.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 Notes and Open Problems . . . . . . . . . . . . . . . . . 24

3 Fork-Join Scheduling Approach 26


3.1 Fork-Join Scheduling . . . . . . . . . . . . . . . . . . . . 26
3.2 Characterization of Latency . . . . . . . . . . . . . . . . . 28
3.3 Extension to General Service Time Distributions . . . . . . 32
3.4 Extension to Heterogeneous Systems . . . . . . . . . . . . 35
3.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6 Notes and Open Problems . . . . . . . . . . . . . . . . . 39

4 Probabilistic Scheduling Approach 42


4.1 Probabilistic Scheduling . . . . . . . . . . . . . . . . . . . 42
4.2 Characterization of Mean Latency . . . . . . . . . . . . . 47
4.3 Characterization of Tail Latency . . . . . . . . . . . . . . 51
4.4 Characterization of Asymptotic Latency . . . . . . . . . . 53
4.5 Proof of Asymptotic Optimality for Heavy Tailed Service
Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.6 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.7 Notes and Open Problems . . . . . . . . . . . . . . . . . 65

5 Delayed-Relaunch Scheduling Approach 69


5.1 Delayed-Relaunch Scheduling . . . . . . . . . . . . . . . . 69
5.2 Characterization of Inter-Service Times of Different Chunks
for Single Job . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3 Characterization of Mean Service Completion Time and
Mean Server Utilization Cost for Single Job . . . . . . . . 80
5.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.5 Notes and Open Problems . . . . . . . . . . . . . . . . . 86

6 Analyzing Latency for Video Content 89


6.1 Modeling Stall Duration for Video Requests . . . . . . . . 89
6.2 Modeling Download and Play Times . . . . . . . . . . . . 93
6.3 Characterization of Mean Stall Duration . . . . . . . . . . 98
6.4 Characterization of Tail Stall Duration . . . . . . . . . . . 102
6.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.6 Notes and Open Problems . . . . . . . . . . . . . . . . . 110

7 Lessons from prototype implementation 113


7.1 Exemplary implementation of erasure-coded storage . . . . 113
7.2 Illuminating key design tradeoffs . . . . . . . . . . . . . . 117
7.3 Applications in Caching and Content Distribution . . . . . 122

References 127
Modeling and Optimization of
Latency in Erasure-coded Storage
Systems
Vaneet Aggarwal1 and Tian Lan2
1 Purdue University
2 George Washington University

ABSTRACT
As consumers are increasingly engaged in social network-
ing and E-commerce activities, businesses grow to rely on
Big Data analytics for intelligence, and traditional IT in-
frastructures continue to migrate to the cloud and edge,
these trends cause distributed data storage demand to rise
at an unprecedented speed. Erasure coding has seen itself
quickly emerged as a promising technique to reduce storage
cost while providing similar reliability as replicated systems,
widely adopted by companies like Facebook, Microsoft and
Google. However, it also brings new challenges in character-
izing and optimizing the access latency when erasure codes
are used in distributed storage. The aim of this monograph
is to provide a review of recent progress (both theoretical
and practical) on systems that employ erasure codes for
distributed storage.
In this monograph, we will first identify the key challenges
and taxonomy of the research problems and then give an
overview of different approaches that have been developed
to quantify and model latency of erasure-coded storage. This
includes recent work leveraging MDS-Reservation, Fork-Join,
Probabilistic, and Delayed-Relaunch scheduling policies, as
well as their applications to characterize access latency (e.g.,
2

mean, tail, asymptotic latency) of erasure-coded distributed


storage systems. We will also extend the problem to the
case when users are streaming videos from erasure-coded
distributed storage systems. Next, we bridge the gap be-
tween theory and practice, and discuss lessons learned from
prototype implementation. In particular, we will discuss
exemplary implementations of erasure-coded storage, illu-
minate key design degrees of freedom and tradeoffs, and
summarize remaining challenges in real-world storage sys-
tems such as in content delivery and caching. Open problems
for future research are discussed at the end of each chapter.
1
Introduction

In this Chapter, we will introduce the problem in Section 1.1. This will
be followed by the key challenges in the problem in Section 1.2. Section
1.3 explains the different approaches for the problem considerd in this
monograph. Section 1.4 gives the outline for the remaining chapters,
and Chapter 1.5 provides additional notes.

1.1 Erasure Coding in Distributed Storage

Distributed systems such as Hadoop, AT&T Cloud Storage, Google


File System and Windows Azure have evolved to support different
types of erasure codes, in order to achieve the benefits of improved
storage efficiency while providing the same reliability as replication-
based schemes (Balaji et al., 2018). Various erasure code plug-ins and
libraries have been developed in storage systems like Ceph (Weil et al.,
2006; Aggarwal et al., 2017a), Tahoe (Xiang et al., 2016), Quantcast
(QFS) (Ovsiannikov et al., 2013), and Hadoop (HDFS) (Rashmi et al.,
2014).
We consider a data center consisting of m heterogeneous servers1 ,

1
We will use storage nodes and storage servers interchangeably throughout this

3
4 Introduction

denoted by M = {1, 2, . . . , m}, called storage nodes. To distributively


store a set of r files, indexed by i = 1, . . . , r, we partition each file i
into ki fixed-size chunks2 and then encode it using an (ni , ki ) MDS
erasure code to generate ni distinct chunks of the same size for file i.
The encoded chunks are assigned to and stored on ni distinct storage
nodes to store file i. Therefore, each chunk is placed on a different node
to provide high reliability in the event of node or network failures.
The use of (ni , ki ) MDS erasure code allows the file to be recon-
structed from any subset of ki -out-of-ni chunks, whereas it also intro-
duces a redundancy factor of ni /ki . For known erasure coding and chunk
placement, we shall now describe a queueing model of the distributed
storage system. We assume that the arrival of client requests for each
file i form an independent Poisson process with a known rate λi . We
consider chunk service time Xj of node j with arbitrary distributions,
whose statistics can be obtained inferred from existing work on network
delay (Abdelkefi and Jiang, 2011; Weatherspoon and Kubiatowicz, 2002)
and file-size distribution (Downey, 2001; Paganini et al., 2012). We note
that even though exponential service time distribution is common, real-
istic implementation in storage systems show that this is not a practical
assumption (Chen et al., 2014b; Xiang et al., 2016), where Amazon
S3 and Tahoe storage systems are considered. Both these works points
towards shifted exponential service times being a better approximation
(an example service time distribution from realistic system is depicted
in Fig. 7.2), while otherhdistributions
i may be used for better approx-
imation. Let Zj (τ ) = E e τ Xj be the moment generating function of
Xj . Under MDS codes, each file i can be retrieved from any ki distinct
nodes that store the file chunks. We model this by treating each file
request as a batch of ki chunk requests, so that a file request is served
when all ki chunk requests in the batch are processed by distinct storage
nodes. All requests are buffered in a common queue of infinite capacity.
We now introduced the definition of MDS queues according to the

monograph.
2
While we make the assumption of fixed chunk size here to simplify the problem
formulation, the results can be easily extended to variable chunk sizes. Nevertheless,
fixed chunk sizes are indeed used by many existing storage systems (Dimakis et al.,
2004; Aguilera et al., 2005; Lv et al., 2002).
1.2. Key Challenges in Latency Characterization 5

system model.

Definition 1.1. An MDS queue is associated to four sets of parameters


{m, r}, {(ni , ki ) : i = 1, 2, . . . , r}, {λi : i = 1, 2, . . . , r}, and {µj :
j = 1, 2, . . . , m} satisfying i) There are m servers and r files; ii) File-i
requests arrive in batches of ki chunk requests each; iii) Each batch of
ki chunk requests can be processed by any subset of ki out of ni distinct
servers; iv) These batches arrive as a Poisson process with a rate of λi ;
and v) The service time for a chunk request at any server is random
and follows some known distribution with mean µj , and is independent
of the arrival and service times of all other requests.

1.2 Key Challenges in Latency Characterization

An exact analysis of this MDS queue model is known to be an open


problem. The main challenge comes from the fact that since each
file request needs to be served by k distinct servers, a Markov-Chain
representation of the MDS queue must encapsulate not only the number
of file and chunk requests waiting in the shared buffer, but also the
processing history of each active file requests to meet such requirement
in future schedules. Let b be the number of current file requests in
the system and can take values in {0, 1, 2, . . .}. The Markov-Chain
representation could have Ωbk states, which becomes infinity in at least
k dimensions (Lee et al., 2017). This is extremely difficult to analyze as
the transitions along different dimensions are tightly coupled in MDS
queues.
The challenge can be illustrated by an abstracted example shown
in Fig. 1.1. We consider two files, each partitioned into k = 2 blocks of
equal size and encoded using maximum distance separable (MDS) codes.
Under an (n, k) MDS code, a file is encoded and stored in n storage
nodes such that the chunks stored in any k of these n nodes suffice to
recover the entire file. There is a centralized scheduler that buffers and
schedules all incoming requests. For instance, a request to retrieve file
A can be completed after it is successfully processed by 2 distinct nodes
chosen from {1, 2, 3, 4} where desired chunks of A are available. Due to
shared storage nodes and joint request scheduling, delay performances
6 Introduction

of the files are highly correlated and are collectively determined by


control variables of both files over three dimensions: (i) the scheduling
policy that decides what request in the buffer to process when a node
becomes available, (ii) the placement of file chunks over distributed
storage nodes, and (iii) erasure coding parameters that decide how
many chunks are created. The latency performances of different files
are tightly entangled. While increasing erasure code length of file B
allows it to be placed on more storage nodes, potentially leading to
smaller latency (because of improved load-balancing) at the price of
higher storage cost, it inevitably affects service latency of file A due to
resulting contention and interference on more shared nodes.

3 5
1
4
2 File B
File A
(3,2) coding
(4,2) coding 5: b1
1: a1 6: b2
2: a2 Scheduler 7: b1+b2
3: a1+a2
4: a1+2a2
…… Requests

Figure 1.1: An erasure-coded storage of 2 files, which partitioned into 2 blocks and
encoded using (4, 2) and (3, 2) MDS codes, respectively. Resulting file chunks are
spread over 5 storage nodes. Any file request must be processed by 2 distinct nodes
that have the desired chunks. Nodes 3, 4 are shared and can process requests for
both files.
In Figure 1.1, files A and B are encoded using (4, 2) and (3, 2) MDS
codes, respectively, file A will have chunks as A1 , A2 , A3 and A4 , and
file B will have chunks B1 , B2 and B3 . As depicted in Fig. 1.2, each file
request comes in as a batch of ki = 2 chunk requests, e.g., (R1A,1 , R1A,2 ),
(R2A,1 , R2A,2 ), and (R1B,1 , R1B,2 ), where RiA,j , denotes the ith request of
file A, j = 1, 2 denotes the first or second chunk request of this file
request. Denote the five nodes (from left to right) as servers 1, 2, 3, 4,
and 5, and we initialize 4 file requests for file A and 3 file requests for
file B, i.e., requests for the different files have different arrival rates.
The two chunks of one file request can be any two different chunks
1.2. Key Challenges in Latency Characterization 7

from A1 , A2 , A3 and A4 for file A and B1 , B2 and B3 for file B. Due


to chunk placement in the example, any 2 chunk requests in file A’s
batch must be processed by 2 distinct nodes from {1, 2, 3, 4}, while 2
chunk requests in file B’s batch must be served by 2 distinct nodes from
{3, 4, 5}. Suppose that the system is now in a state depicted by Fig. 1.2,
wherein the chunk requests R1A,1 , R2A,1 , R1A,2 , R1B,1 , and R2B,2 are served
by the 5 storage nodes, and there are 9 more chunk requests buffered in
the queue. Suppose that node 2 completes serving chunk request R2A,1
and is now free to serve another request waiting in the queue. Since
node 2 has already served a chunk request of batch (R2A,1 , R2A,2 ) and
node 2 does not host any chunk for file B, it is not allowed to serve
either R2A,2 or R2B,j , R3B,j where j = 1, 2 in the queue. One of the valid
requests, R3A,j and R4A,j , will be selected by a scheduling algorithm and
assigned to node 2. We denote the scheduling policy that minimizes
average expected latency in such a queuing model as optimal scheduling.
R A,1
2 R A,1
2

R 1A,1 R 1A,2 R 1B,1 R 1B,2 R 1A,1 R 1A,2 R 1B,1 R 1B,2

R A,2 R A,1 R B,1 R 3B,1 R B,2


R A,2
2
2 4 2 2

A,1 A,2
R R R A,2
R 3B,2
R B,1
2 R B,2
2
3 3 4

R 3B,1 R 3B,2
R 3A,1 R 3A,2 ……
R A,1
4 R A,2
4
Dispatch

(a) MDS scheduling (b) Probabilistic scheduling

Figure 1.2: Functioning of an optimal scheduling policy.

Definition 1.2. (Optimal scheduling) An optimal scheduling policy (i)


buffers all requests in a queue of infinite capacity; (ii) assigns at most 1
chunk request from a batch to each appropriate node, and (iii) schedules
requests to minimize average latency if multiple choices are available.

An exact analysis of optimal scheduling is extremely difficult. Even


for given erasure codes and chunk placement, it is unclear what schedul-
ing policy leads to minimum average latency of multiple heterogeneous
files. For example, when a shared storage node becomes free, one could
schedule either the earliest valid request in the queue or the request
8 Introduction

with scarcest availability, leading to different implications on average


latency.

1.3 Problem Taxonomy

Given that the optimal scheduling is hard to evaluate, many scheduling


strategies that aim to provide latency bounds have been explored. This
monograph aims to provide a detailed description of these strategies,
and the latency analysis for these strategies. The different scheduling
strategies considered in this monograph are:

1. MDS-Reservation Scheduling: Consider a single file request. In


this approach, the first t requests, the request is added in buffer.
Each server on finishing the current task, goes through the buffer
in order to find a task it can process (batch from which it has not
processed a task). The file requests after t + 1 can move ahead
in the buffer only when k of its chunk requests can move forward
together, and the request is assigned to those servers where it can
move forward together, for a given parameter t.

2. Fork-Join Scheduling: In this approach, the file request for file


i is sent to all ni servers, and the request is complete when ki
coded chunks are received. The remaining ni − ki requests are
then cancelled.

3. Probabilistic Scheduling: In this approach, each file request for


file i is randomly dispatched to ki out of ni storage nodes that
store the coded chunks of the file.

4. Delayed Relaunch Scheduling: In this approach, the request for


file i is first sent to ni,0 servers using probabilistic scheduling, and
when the request is completed from `i,0 servers, the request is sent
to the remaining ni − ni,0 servers. After ki chunks are received,
the request is cancelled from remaining ni − ki servers.

Even though the implementation of these scheduling strategies in


simulators and practical systems may be straightforward, the analysis
and optimization of latency are quite the opposite. In many cases, only
1.3. Problem Taxonomy 9

bounds and asymptotic guarantees can be provided using these queue


models. Further, a number of assumptions are commonly considered in
the literature to ensure tractability of the analysis. These assumptions
include homogeneous files (i.e., all files are of equal size and encoded
using the same (n, k) MDS code); homogeneous placement (i.e., there
are n servers and each file has exactly one chunk placed on each server);
homogeneous servers (i.e., all servers have i.i.d. service time with mean
µ); exponential service time distribution (i.e., all server has exponentially
distributed service times). We summarize the assumption made by
different strategies in each chapter in Table 1.1.

MDS- Fork- Probabilitic Delayed


Reservation Join Scheduling Relaunch
Homogenous Yes No No Yes
Files
Homogenous Yes Yes No Yes
Placement
Homogeneous Yes Yes No Yes
Servers
Exponential Yes No No No3
Service Time

Table 1.1: Assumptions considered in the analysis of different scheduling strategies.

We compare the different strategies for different service rates at


high arrival rates. We consider single file request, r = 1, and thus index
i for files is suppressed. Also, we assume that all m = n servers are
homogenous. If the service times are deterministic, fork-join scheduling
sends request to all n servers and finish at the same time. Thus, the
strategy wastes unncessary time at the n − k servers leading to non-
optimal stability region. In contrast, probabilistic scheduling can use
the probabilities of selection of different servers as uniform, and can
be shown to achieve optimal stability region. Further, delayed relaunch
scheduling has probabilistic scheduling as special case with n0 = `0 = k,

3
Queueing analysis is not applicable to delayed relaunch.
10 Introduction

and thus can achieve optimal stability region. For MDS-Reservation


scheduling, unless n is a multiple of k, there would be wasted time
at some servers and thus will not have optimal stability region. For
exponential service times, MDS-Reservation Scheduling would hold on
request scheduling at certain servers, and thus not achieve optimal
stability region. The other strategies will have optimal stability region.
Thus, the probabilistic scheduling and the delayed relaunch scheduling
are optimal in terms of stability region for both service distributions,
and can indeed be shown to achieve optimal stability region for general
service distribution.
We further note that the delayed relaunch scheduling has fork-join
scheduling as a special case when ni,0 = ni and `i,0 = ki , and has
probabilistic scheduling as a special case when ni,0 = `i,0 = ki , and thus
give a more tunable approach than the two scheduling approaches.

MDS- Fork- Probabilitic Delayed


Reservation Join Scheduling Relaunch
Optimal Homogenous No Exponential General General
Stability Region
Queuing Yes Yes Yes No
Analysis
Analysis for No Yes Yes No
general distribution
Closed Form No Yes Yes N/A4
Expressions
Asymptotic No No Yes Yes
Optimality
Tail No No Yes No
Characterization

Table 1.2: The different regimes for the known results of the different scheduling
algorithms

In Table 1.2, we will describe the different cases where the anal-
ysis of these algorithms have been studied. The first line is for the a
single file and homogenous servers. As mentioned earlier, the MDS-
Reservation scheduling does not achieve optimal stability region for
4
Queueing analysis is not applicable to delayed relaunch.
1.3. Problem Taxonomy 11

both the scenarios of determistic and exponential service times. The


Fork-Join scheduling has been shown to achieve the optimal stability
region only for exponential service times, while uniform probabilistic
scheduling achieves optimal stability region for general service distribu-
tions. The second line indicates that the queueing analysis to find upper
bound on latency using the proposed algorithms have been studied
for the first three schemes, while for delayed relaunch, no non-trivial
queueing analysis exists, while has been studied for single file request
in absence of queue. The next line considers whether there is latency
analysis (upper bounds) for general service-time distribution, which are
not available for the case of MDS-Reservation and Delayed Relaunch
scheduling. The fourth line indicates whether closed-form expressions
for the latency bounds exist in the queueing analysis, which is not true
for MDS-Reservation scheduling. Since there is no queueing analysis for
delayed relaunch, N/A is marked. In the next line, we note that there are
asymptotic guarantees for probabilistic scheduling with exponential ser-
vice times in two regimes. The first is in the case of homogenous servers
with m = n and single file, where n goes large. The second is where
the file sizes are heavy-tailed. Since the delayed relaunch scheduling is
a generalization of probabilistic scheduling, it inherits the guarantees.
The last line indicates whether the analysis exist for tail latency, which
is the probability that the latency is greater than the threshold which
exist for probabilistic scheduling.
As an example, we consider a shifted exponential distribution for
service times, Sexp(β, α) as defined in (3.14). The parameters are dif-
ferent for different servers, which for m = 12 servers are given in Table
4.1. We consider homogeneous files with ki = 7 and ni = 12. In or-
der to do the simulations, requests arrive for 104 seconds, and their
service time is used to calculate the latency based on the different
scheduling strategies. For varying arrival rate, we compare the three
strategies - MDS-Reservation(1000), Fork-Join scheduling, and Proba-
bilistic scheduling. Since optimized Delayed Relaunching includes the
Fork-Join and Probabilistic scheduling approaches, we do not compare
this. Since the Probabilistic scheduling have the probabilities of choosing
the servers as random, we run ten choices of the probability terms using
a uniform random variable between zero and 1 and normalization and
12 Introduction

25
Fork Join Scheduling
Mean Latency (in seconds) Probabilistic Scheduling
20 MDS-Reservation(1000)

15

10

0
20 22 24 26 28 30
Arrival Rate

Figure 1.3: Comparison of the different strategies in simulation. We note that prob-
abilistic scheduling outperforms the other strategies with the considered parameters.

choose the best one among these. Note that even though Fork-Join
queues have not been analyzed for heterogeneous servers, the results
indicate the simulated performance. The simulation results are provided
in Fig. 1.3. We note that the MDS-Reservation and the fork-join queue
does not achieve the optimal stable throughput region and it can be
seen that the mean latency starts diverging faster. Further, we note that
the probabilistic scheduling performs better than Fork-Join scheduling
for all arrival rates in this system.

1.4 Outline of the Monograph

In the remainder of the monograph, we will describe the four considered


scheduling strategies in Chapters 2, 3, 4, and 5, respectively. The
corresponding results for the latency characetrization, approximations,
generalizations, and asymptotic optimality guarantees will be provided.
In Chapter 6, we will demonstrate an extension of using the approach
of erasure-coded file storage to erasure-coded video storage, where the
metric of importance for the video is the stall duration rather than the
latency. The analysis of upper bound of stall duration using probabilistic
scheduling is provided. Finally, Chapter 7 demonstrates the insights of
1.5. Notes 13

the approaches on prototype evaluation in realistic systems. We also


discuss open problems at the end of each chapter to illuminate directions
for future work.

1.5 Notes

In this monograph, we consider using an maximal distance separable


(MDS) code for erasure-coded storage system. The problem to evaluate
scheduling algorithms, and the latency analysis will be analyzed in the
following chapters, with discussion on related papers.
In addtion to latency, a key importance of the use of erasure codes
is to deal with node failures. With node failures, one important aspect
is the reliability of the storage system, which is measured through the
mean time to data loss. Different models of failure and repair have been
considered in (Angus, 1988; Chen et al., 1994).
Another aspect of node failure is to minimize the amount of data
transferred to repair a failed node. In order to consider this problem, a
novel approach of regenerating codes was proposed in (Dimakis et al.,
2010). Functional repair has been proposed which aims to repair a failed
node with a content that satisfies similar properties as the original code.
The problem has connections with network coding, and such frameworks
have been used to provide efficient code designs. The regenerating codes
allow for a tradeoff between the storage capacity at the node and the
amount of data transferred from other d ≥ k nodes in an (n, k) erasure-
coded system. In many realistic systems, exact repair, where the failed
node is repaired with the exact same content, is needed. Construction
of erasure codes with exact repair guarantee has been widely studied
(Suh and Ramchandran, 2011; Rashmi et al., 2011; Papailiopoulos et al.,
2013; Tian et al., 2015; Goparaju et al., 2014). Regenerating codes have
also been used to evaluate the mean time to data loss of the storage
systems (Aggarwal et al., 2014). The regenerating codes includes a
minimum storage regenerating (MSR) point, where 1/k of the file is
placed at each node, and thus is best for the latency. However, any of
the point of the code can be used for the analysis in this work with
corresponding increased amount of data from each of the k nodes based
in the increased storage.
2
MDS-Reservation Scheduling Approach

In this Chapter, we introduce the model of MDS-Reservation(t) queues


in Section 2.1, which was first proposed in (Lee et al., 2017). It allows
us to develop an upper bound on the latency of MDS queues and
characterize the stability region in Section 2.2. We will also develop
a lower bound for MDS-Reservation(t) queues using M/M/n queues
in Section 2.3 and investigate the impact of redundant requests in
Section 2.4. Sections 2.5 and 2.6 contain simulation results and notes
on future directions.

2.1 MDS-Reservation Queue

For this chapter, we consider a homogeneous erasure-coded storage


system, where the files have identical sizes and are encoded using the
same MDS code with parameters (n, k). There are n identical storage
serves, each storing exactly one chunk of each file. Incoming file requests
follow a Poisson process and are independent of the state of the system,
and the chunk service times on storage nodes have an i.i.d. exponential
distribution. Under an MDS code, a file can be retrieved by downloading
chunks from any k of the n servers. Thus, a file request is considered
as served when all k of its chunk request have been scheduled and

14
2.1. MDS-Reservation Queue 15

completed service.

Algorithm 1 MDS-Reservation(t) Scheduling Policy


On arrival of a batch
If buffer has strictly fewer than t batches
Assign jobs of new batch to idle servers
Append remaining jobs of batch to end of buffer
On departure of job from a server (say, server s)
Find î=min{i≥1: s has not served job of ith waiting batch}
Let bt+1 be the (t + 1)th waiting batch (if any)
If î exists & î ≤ t
Assign a job of îth waiting batch to s
If î = 1 & the first waiting batch had only one job in the buffer &
bt+1 exists
To every remaining idle server, assign a job from batch bt+1

To derive a latency upper bound, a class of MDS-Reservation(t)


scheduling policies are proposed in (Lee et al., 2017). This class of
scheduling policies are indexed by an integer parameter t. When a file
is requested, a set of k tasks is created. The requests are scheduled as
many of the servers are idle (up to k). Further, the remaining tasks are
kept as a batch in the buffer. On departure of any task from a server,
the buffer is searched in order for a batch from which any job has not
been serverd by this server, and a task from that batch is served. For
the parameter t, an additional restriction is imposed: any file request
i ≥ t + 1 (i.e., the ith batch of chunk requests) can move forward in the
buffer only when all k of its chunk requests can move forward together
(i.e., when one of the t head-of-line file requests is completed). The basic
pseudo-code for the MDS-reservation(t) is described in Algorithm 1.
It is easy to see that by blocking any file requests for i ≥ t + 1, such
MDS-Reservation(t) scheduling policies provide an upper bound on the
file access latency of MDS queues, with a larger t leading to a tighter
bound, yet at the cost of more states to maintain and analyze.
An example of MDS-reservation(t) scheduling policy (for t = 1, 2
respectively) and the corresponding queuing policies are illustrated
in Figure 2.1 for (n, k) = (5, 2) codes. Since chunk request R3A,1 in
16 MDS-Reservation Scheduling Approach

batch 3 is already processed by server 2, the second chunk request


R3A,2 in the same batch cannot be processed by the same server. Under
MDS-reservation(1), a batch not at the head-of-line can proceed only
if all chunk requests in the batch can move forward together. Since
the condition is not satisfied, server 2 must enter an idle state next,
leading to resource under utilization and thus higher latency. On the
other hand, an MDS-reservation(2) policy allows any chunk requests
in the first two batches to move forward individually. Chunk request
R4A,1 moves into server 2 for processing. It is easy to see that as t grows,
MDS-reservation(t) scheduling policy becomes closer to the optimal
scheduling policy.

Figure 2.1: An illustration of MDS-reservation(1) (Left) and MDS-reservation(2)


(Right) scheduling.

2.2 Characterization of Latency Upper Bound via MDS-Reservation


Scheduling

The key idea in analyzing these MDS queues is to show that the
corresponding Markov chains belong to a class of processes known
as Quasi-Birth-Death (QBD) processes (Lee et al., 2017). Thus, the
steady-state distribution can be obtained by exploiting the properties
of QBD processes. More precisely, a birth-death process is defined as
a continuous-time Markov process on discrete states {0, 1, 2, . . .}, with
transition rate λ from state i to i + 1, transition rate µ from state i + 1
to i, and rates µ0 , λ0 to and from the boundary state i = 0, respectively.
A QBD process is a generalization of such birth-death processes whose
states i are each replaced by a set of states, known as a level. Thus a
QBD process could have transitions both within a level and between
2.2. Characterization of Latency Upper Bound via MDS-Reservation
Scheduling 17

adjacent levels. It has a diagonal transition probability matrix:


 
B1 B2 0 0 ...
B0 A1 A2 0 ...
 
 
 
 0 A0 A1 A2 ... 
(2.1)
 

 0 0 A0 A1 ... 

0 0 0 A0 ...
 
 
.. .. .. .. ..
 
. . . . .
where matrices B1 , B2 , B0 are transition probabilities within, from,
and to the boundary states, and A1 , A2 , A0 are transition probabilities
within each level, entering the next level, and entering the previous
level, respectively.

Theorem 2.1 ((Lee et al., 2017)). The Markovian representation of the


MDS-Reservation(t) queue has a state space {0, 1, . . . , k}t ×{0, 1, . . . , ∞}.
It is a QBD process with boundary states {0, 1, . . . , k}t ×{0, 1, . . . , n−k+
tk} and levels {0, 1, . . . , k}t × {0, 1, . . . , n + jk} for j = {t, t + 1, . . . , ∞}.

Proof. We briefly summarize the proof in (Lee et al., 2017). For any
state of the system (w1 , w2 , . . . , wt , m) ∈ {0, 1, . . . , k}t × {0, 1, . . . , ∞},
define


 0 if w1 = 0
q= t else if wt 6= 0 (2.2)


arg max{τ : wτ 6= 0, 1 ≤ τ ≤ t} otherwise.
Then we can find the number of waiting request batches (b), the number
of idle servers in the system (z), the number of jobs of ith waiting batch
in the servers (si ), and the number of jobs of ith waiting batch in the
buffer (wi ) as follows:



 0 if q = 0
q  else if 0 < 1 < t

b=  P (2.3)
 m− w −n
j j


 t + k otherwise.

 
X
z = n − m − wj − (b − t)+ k  , (2.4)
j
18 MDS-Reservation Scheduling Approach


 wi+1 − wi
 if i ∈ {1, . . . , q − 1}
si = k − z − wi if i = q for i ∈ {1, . . . , b} (2.5)


0 if i ∈ {q + 1, . . . , b}

wi = k, for i ∈ {t + 1, . . . , b}. (2.6)

These equations characterize the full transition. It is easy to verify


that the MDS-Reservation(t) queue has the following two key properties:
i) Any transitions change the value of m by at most k; and ii) For
m ≥ n − k + 1 + tk, the transition from any state (w1 , m) to any other
0 0
states (w1 , m ≥ n − k + 1 + tk) depends on m mod k and not on the
actual value of m. It is then straightforward to show that this satisfies
the boundary and level conditions of a QBD process, with boundary
and level transitions specified in the theorem.

The proof of Theorem provides a procedure to obtain the configura-


tion of the entire queuing system under the MDS-Reservation(t) schedul-
ing policies. Further, as t goes to infinity, the system approaches an
MDS queue, thus resulting in a tighter upper bound at the cost of more
complicated queuing analysis. This is because the MDS-Reservation(t)
scheduling policy follows the MDS scheduling policy when the number
of file requests in the buffer is less than or equal to t. Thus, it is identical
to MDS queue when t goes to infinity.

Theorem 2.2. The MDS-Reservation(t) queue, when t = ∞, is precisely


the MDS queue for homogenous files, homogenous servers, exponential
service times, and n = m.

These results allow us to employ any standard solver to obtain the


steady-state distribution of the QBD process, enabling latency analysis
under MDS-Reservation(t) scheduling policies. In particular, for t = 0,
the MDS-Reservation(0) policy is rather simple, as the file request
(consisting of a batch of chunk requests) at the head of the line may
move forward and enter service only if there are at least k idle servers.
When n = k, this becomes identical to a split-merge queue (Harrison
and Zertal, 2003). For t = 1, the MDS-Reservation(1) policy is identical
to the block-one scheduling policy proposed in (Huang et al., 2012b).
2.3. Characterization of Latency Lower Bound 19

Using this queue model, we can also find the stability region of the
MDS-Reservation(t) scheduling policy. While an exact characterization
is non-tractable in general, bounds on the maximum stability region,
defined as the maximum possible number of requests that can be served
by the system per unit time (without resulting in infinite queue length)
is find in (Lee et al., 2017).

Theorem 2.3 ((Lee et al., 2017)). For any given (n, k) and t > 1,
the maximum throughput λ∗Resv(t) in the stability region satisfies the
following inequalities when k is treated as a constant:
n n
(1 − O(n−2 )) µ ≤ λ∗Resv(t) ≤ µ. (2.7)
k k
Proof. First, we note that for t ≥ 2, latency of each of the MDS-
Reservation(t) queues is upper bounded by that of MDS-Reservation(1),
since less batches of chunk requests are blocked and not allowed to move
forward into the servers, as t increases.
Next, we evaluate the maximum throughput in the stability region of
MDS-Reservation(1) by exploiting properties of QBD systems. We follow
the proof in (Lee et al., 2017). Using the QBD process representations in
Equation (2.6), the maximum throughput λ∗Resv(t) of any QBD system
is the value of λ such that: ∃v satisfying v T (A0 + A1 + A2 ) = 0 and
v T A0 1 = v T A1 1, where 1 is an all-one vector. For fixed values of µ
and k, it is easy to verify that the matrices A0 , A1 , and A2 are affine
transformations of arrival rate λ. Plugging in the values of A0 , A1 and
A2 in the QBD representation of MDS-Reservation(1) queues, we can
show that such v vector exist if λ∗Resv(1) ≥ (1 − O(n−2 )) nk µ.
It then follows that λ∗Resv(t) ≥ λ∗Resv(1) ≥ (1 − O(n−2 )) nk µ for any
t ≥ 2. The upper bound on λ∗Resv(t) is straightforward since each batch
consists of k chunk requests, the rate at which batches exit the system
(for all n servers combined) is at most nµ/k.

2.3 Characterization of Latency Lower Bound

To derive a lower bound on service latency for MDS queues, we leverage


a class of M k /M/n(t) scheduling policies proposed in (Lee et al., 2017),
which relax the requirement that k chunk requests belonging to the
20 MDS-Reservation Scheduling Approach

same file request must be processed by distinct servers after the first
t requests. It applies the MDS scheduling policy whenever there are t
or fewer file requests (i.e., t or fewer batches of chunk requests) in the
system, while ignoring the requirement of distinct servers when there
are more than t file requests.
Theorem 2.4 ((Lee et al., 2017)). The Markovian representation of the
M k /M/n(t) queue has a state space {0, 1, . . . , k}t ×{0, 1, . . . , ∞}. It is a
QBD process with boundary states {0, 1, . . . , k}t × {0, 1, . . . , n + tk} and
levels {0, 1, . . . , k}t × {n − k + 1 + jk, . . . , n + jk} for j = {t + 1, . . . , ∞}.
Proof. We again define q for any system state (w1 , w2 , . . . , wt , m) ∈
{0, 1, . . . , k}t × {0, 1, . . . , ∞} as in Equation (2.2). The values of b, z, si ,
and wi can be derived accordingly and are identical to those in Chapter
2.2. These equations capture the entire state transitions. It is then easy
to see that the M k /M/n(t) queue satisfy the following two properties:
i) Any transitions change the value of m by at most k; and ii) For
m ≥ n − k + 1 + tk, the transition from any state (w1 , m) to any other
0 0
states (w1 , m ≥ n − k + 1 + tk) depends on m mod k and not on the
actual value of m. This results in a QBD process with boundary states
and levels described in the theorem.

Similar to the case of MDS-Reservation(t) scheduling policies, as t


goes to infinity in M k /M/n(t) scheduling policies, the resulting system
approaches an MDS queue, thus providing a tighter lower bound at the
cost of more complicated queuing analysis.
Theorem 2.5. The M k /M/n(t) queue, when t = ∞, is precisely the
MDS queue for homogenous files, homogenous servers, exponential
service times, and n = m.
Again, these results allow us to obtain the steady-state distribution
of the QBD process, which enables latency analysis under M k /M/n(t)
scheduling policies.

2.4 Extension to Redundant Requests

In erasure coded storage systems, access latency can be further reduced


by sending redundant requests to storage servers. Consider a scheme
2.4. Extension to Redundant Requests 21

under (n, k) MDS codes, which sends (redundantly) each file request
to v > k servers. Clearly, upon completion of any k-out-of-v chunk
requests, the file request is considered to be served, and the remaining
v − k active chunk requests could be canceled and removed from the
system. It is easy to see that redundant requests allow the reduction
of individual requests at the expense of an increase in overall queuing
delay due to the use of additional resources on v − k straggler requests.
We note that when k = 1, the redundant-request policy reduces to a
replication-based scheme.
Formally, an MDS queue with redundant requests is associated to
five parameters (n, k), [λ, µ], and the redundant level v ≥ k, satisfying
the following modified assumptions: i) File requests arrive in batches of
v chunk requests each; ii) Each of the v chunk requests in a batch can
be served by an arbitrary set of v distinct servers; iii) Each batch of v
chunk requests is served when any k of the v requests are served.
While empirical results in (Ananthanarayanan et al., Submitted;
Liang and Kozat, 2013; Vulimiri et al., 2012) demonstrated that the use
of redundant requests can lead to smaller latency under various settings,
a theoretical analysis of the latency - and thus a quantification of the
benefits - is still an open problem. Nevertheless, structural results have
been obtained in (Ananthanarayanan et al., Submitted) using MDS-
queue models, e.g., to show that request flooding can indeed reduce
latency in certain special cases.
Theorem 2.6. Consider a homogeneous MDS(n, k) queue with Poisson
arrivals, exponential service time, and identical service rates. If the
system is stable in the absence of redundant requests, a system with
the maximum number v = n of redundant requests achieves a strictly
smaller average latency than any other redundant request policies,
including no redundancy v = k and time-varying redundancy.
Proof. Consider two systems, system S1 with redundant level v < n
and system S2 with redundant level n. We need to prove that under
the same sequence of events (i.e., the arrivals and server completions),
the number of batches remaining in system S1 is at least as much as
that in S2 at any given time. To this end, we use the notion “time z" to
denote the time immediately following the zth arrival/departure event.
22 MDS-Reservation Scheduling Approach

Assume both systems, S1 and S2 , are empty at time Z = 0. Let bi (z)


denote the number of batches remaining in system Si at time z. We use
induction to complete this proof.
At any time z, we consider induction hypothesis: i) b1 (z) ≥ b2 (z),
and ii) if there are no further arrivals from time z onwards, then at any
0 0 0
time z > z, b1 (z ) ≥ b2 (z ). It is easy to see that both conditions hold
at z = 0. Next, we show that they also hold for z + 1.
First suppose that the event at z + q is the completion of a chunk
request at one of the n servers. Then the hypothesis to time z implies
the satisfaction of all the hypotheses at time z + 1, due to the second
hypothesis, as well as the fact of no further arrivals from z to z + 1.
Now suppose the event at z + 1 is the arrival of a new batch. Let
0 0
a1 (z ) and a2 (z ) be the number of batches remaining in the two systems
0
at time z if the new batch had not arrived. From the second hypothesis,
0 0
we have a1 (z ) ≥ a2 (z ). Since the MDS scheduling policy processes
all batches sequentially, under any sequence of departures, we should
0 0
have b1 (z ) = a1 (z ) if the the new batch has been served in Si , and
0 0 0 0
b1 (z ) = a1 (z ) + 1 otherwise. When b1 (z ) = a1 (z ) + 1, it is easy to see
0 0 0 0
that b1 (z ) = a1 (z ) + 1 ≥ a2 (z ) + 1 ≥ b2 (z ). Thus, we only need to
0 0
consider the case b1 (z ) = a1 (z ), which implies that the new batch has
0
been served in S1 at or before time z .
Let z1 , . . . , zk be the events when the k chunk requests of the new
batch have been served in S1 . At these times, the corresponding servers
must have been idle in S1 if the new batch had not arrived, implying
0 0 0
a1 (z ) = c1 (z ), where c1 (z ) is the number of batches remaining at time
0
z excluding the events at z1 , . . . , zk . From the second hypothesis, we
0 0 0 0
also have c1 (z ) ≥ c2 (z ). Then, it is sufficient to prove b2 (z ) = c2 (z )
next.
Note that if at time z, all b2 (z) batches present in the system had
all its k chunk requests remaining to be served, and there were no
further arrivals, then the total number of batches served between times
0
z and z > z can be counted by assigning n independent exponential
timers to the servers. Since the events z1 , . . . , zk must correspond to the
completion of exponential timers at k distinct servers, it must be that
0 0 0 0
b2 (z ) = c2 (z ). Putting the pieces together, it means b1 (z ) ≥ b2 (z ),
which completes the induction.
2.5. Simulations 23

Finally, when when S1 employs redundant level v < n, we need to


show that for a fraction of time that is strictly bounded away from
zero, the number of batches remaining in S2 is strictly smaller than
that in S1 . We start with the state where both systems S1 and S2 are
empty. Consider the arrival of a single batch followed by a sequence of
completion of the exponential timers at different servers. It is easy to
0 0
see that the systems can be taken to states b1 (z ) = 1 and b2 (z ) = 0
respectively, with a probability strictly bounded away from zero, when
v < n (since the batch in S2 is served when any k out of n timer
expire, rather than k out of v in S1 ). Thus, the fraction of time in which
0 0
b1 (z ) − b2 (z ) = 1 is strictly bounded away from zero. This completes
the proof.

2.5 Simulations

Figure 2.2: A comparison of average file access latency and its upper/lower bounds
through MDS-Reservation(t) and M k /M/n(t) scheduling policies.
The proposed latency bounds - using MDS-Reservation(t) and
M k /M/n(t) scheduling policies respectively - have been compared in
(Lee et al., 2017) through numerical examples. For an MDS system
with (n, k) = (10, 5) and µ = 1, Figure 2.2 plots the average file access
latency for various scheduling policies. Here, average latency bounds
under MDS-Reservation(t) and M k /M/n(t) scheduling policies are com-
puted using Little’s Law to the stationary distribution. A Monte-Carlo
simulation is employed to numerically find the exact latency of MDS
24 MDS-Reservation Scheduling Approach

queues. The results are also compared to a simple policy, “Replication-


II", in which the n servers are partitioned into a k sets of n/k servers
each, and each of the k chunk requests is served by a separate set
of servers. We note that tail latency performance cannot be analyzed
through MDS-Reservation(t) and M k /M/n(t) scheduling policies.

Figure 2.3: Simulation results showing the reduction of average latency with an
increase in the redundant level v for an MDS(10,5) queue.
For a homogeneous MDS(n, k) queue with redundant requests, Fig-
ure 2.3 shows the simulation of file access latency for varying redundant
levels v. It corroborates the analysis that when the service times are
i.i.d. exponential, the average latency is minimized by v = n requests.
Further, it also seems that average latency strictly reduces with an
increase in redundant level v. But it is unclear if such property carry
over to general service time distributions.

2.6 Notes and Open Problems

The study of latency using MDS queues was initiated by (Huang et al.,
2012b), which consider a special case of “block-one-scheduling" policy to
obtain a an upper bound on service latency. For arbitrary service time
distributions, an analysis of the blocking probability was presented in
(Ferner et al., 2012) in the absence of a shared request buffer. Later, these
results were extended in (Lee et al., 2017) to general MDS-Reservation(t)
queues and a tighter upper bound on request latency was provided.
There are a number of open problems that can be considered in future
work.
2.6. Notes and Open Problems 25

1. General service time: The proposed analysis on MDS queues


assumes an exponential service time distribution. More accurate
modeling of service time distribution based on the underlying prop-
erties of the storage devices and corresponding latency analysis
are open problems.

2. Heterogeneous queues: The analysis in this chapter is limited


to servers with equal service time and files of equal chunk sizes,
which are not practical. Further, when files are stored on different
sets of n servers, it is unclear if the analysis in this chapter using
MDS queues can be extended.

3. Redundant requests: The quantification of service-time distri-


butions under redundant requests is an open problem. Even in the
case of exponentially distributed service times, the precise amount
of latency improvement due to allowing redundant requests is
unknown.
3
Fork-Join Scheduling Approach

In this Chapter, we introduce the model of Fork-Join Scheduling in


Section 3.1, which was first proposed in (Joshi et al., 2014). We will first
consider homogenous files and exponential service times to derive upper
and lower bounds to the latency in Section 3.2. Further, approximate
characterization of latency expressions will also be provided. Section
3.3 extends the upper and lower bounds on latency to general service
time distributions. Section 3.4 extends the results in Section 3.2 for
heterogenous files, where the parameter k can be different for different
files, while each file is placed on all the n servers. Sections 3.5 and 3.6
contain simulation results and notes on future directions, respectively.

3.1 Fork-Join Scheduling

We consider a data center consisting of n homogeneous servers, denoted


by M = {1, 2, . . . , n}, called storage nodes. We consider one file request,
where the file arrival is Poisson distributed with rate λ. We partition
the file into k fixed-size chunks, and then encode it using an (n, k)
MDS erasure code to generate n distinct chunks of the same size. The
encoded chunks are assigned to and stored on n distinct storage nodes.
Therefore, each chunk is placed on a different node to provide high

26
3.1. Fork-Join Scheduling 27

reliability in the event of node or network failures. We assume that


service time distribution for each server is exponentially distributed
with rate µ. Even though we assume one file, multiple homogeneous
files (with the same size) can be easily incorporated.
We first introduce the Fork-Join system in the following definition.

Definition 3.1. An (n, k) fork-join system consists of n nodes. Every


arriving job is divided into n tasks which enter first-come first-serve
queues at each of the n nodes. The job departs the system when any k
out of n tasks are served by their respective nodes. The remaining n − k
tasks abandon their queues and exit the system before completion of
service.

The (n, n) fork-join system, known in literature as fork-join queue,


has been extensively studied in, e.g., (Kim and Agrawala, 1989; Nelson
and Tantawi, 1988; Varki et al., 2008). The (n, k) generalization was
first studied in (Joshi et al., 2014), and has been followed by multiple
works in the distributed storage literature (Gardner et al., 2015; Fidler
and Jiang, 2016; Kumar et al., 2017; Parag et al., 2017; Badita et al.,
2019).
It can be shown that for the (n, k) fork-join system to be stable, the
rate of Poisson arrivals kλ should be less than nµ. Thus, λ < nµ/k is
the stable region. We will now use the Fork-Join system as a scheduling
strategy, where the k tasks are encoded to n tasks, and the scheduling
starts the job on all n servers. The job departs the system when any k
out of n tasks are served by their respective servers. The remaining n−k
tasks abandon their queues and exit the system before completion of
service. In the following section, we will provide bounds on the latency
with Fork-Join scheduling.
An example of fork-join queue for (n, k) = (5, 2) is illustrated in Fig-
ure 3.1. Each batch of chunk requests are mapped to all 5 servers. They
depart the system together as soon as any 2 chunk requests are served
and the remaining 3 requests abandon processing. No other batches in
the queue can move forward before all servers become available. This
leads to underutilized of server capacity and thus provides an upper
bound on the optimal scheduling policy.
28 Fork-Join Scheduling Approach

Figure 3.1: An illustration of MDS-reservation(1) (Left) and MDS-reservation(2)


(Right) scheduling.

3.2 Characterization of Latency

We now provide bounds on the expected file latency, which is the mean
response time T(n,k) of the (n, k) fork-join system. It is the expected
time that a job spends in the system, from its arrival until k out of n
of its tasks are served by their respective nodes.
Since the n tasks are served by independent M/M/1 queues, intuition
suggests that T(n,k) is the k th order statistic of n exponential service
times. However this is not true, which makes the analysis of T(n,k)
challenging. The reason why the order statistics approach does not work
is the cancellation of jobs in the queue, where their abandonment has
to be taken into account.
z be a generalized harmonic number of order z defined by
Let Hx,y
y
z
X 1
Hx,y = , (3.1)
j=x+1
jz

for some positive integers x, y and z. The following result provides an


upper bound on the expected download time.
3.2. Characterization of Latency 29

Theorem 3.1 ((Joshi et al., 2014)). The expected file latency, T(n,k) ,
satisfies
1
 2 1 )2

Hn−k,n λ Hn−k,n + (Hn−k,n
T(n,k) ≤ +  1
 , (3.2)
µ 2µ2 1 − ρHn−k,n
where λ is the request arrival rate, µ is the service rate at each queue,
ρ = λ/µ is the load factor. We note that the bound is valid only when
1
ρHn−k,n < 1.

Proof. To find this upper bound, we use a model called the split-merge
system, which is similar but easier to analyze than the fork-join system.
In the (n, k) fork-join queueing model, after a node serves a task, it
can start serving the next task in its queue. On the contrary, in the
split-merge model, the n nodes are blocked until k of them finish service.
Thus, the job departs all the queues at the same time. Due to this
blocking of nodes, the mean response time of the (n, k) split-merge
model is an upper bound on (and a pessimistic estimate of) T(n,k) for
the (n, k) fork-join system.
The (n, k) split-merge system is equivalent to an M/G/1 queue where
arrivals are Poisson with rate λ and service time is a random variable
S distributed according to the k th order statistic of the exponential
distribution.
The mean and variance of S are given as
1 2
Hn−k,n
Hn−k.n
E[S] = and Var[S] = . (3.3)
µ µ2
The Pollaczek-Khinchin formula (Zwart and Boxma, 2000) gives the
mean response time T of an M/G/1 queue in terms of the mean and
variance of S as,
λ(E[S]2 + Var[S]
T = E[S] + . (3.4)
2(1 − λE[S])
Substituting the values of E[S] and Var[S] given by (3.3), we get the
upper bound (3.2). Note that the Pollaczek-Khinchin formula is valid
only when λ1 > E[S], the stability condition of the M/G/1 queue. Since
E[S] increases with k, there exists a k0 such that the M/G/1 queue
is unstable for all k ≥ k0 . The inequality λ1 > E[S] can be simplified
30 Fork-Join Scheduling Approach

1
to ρHn−k,n < 1 which is the condition for validity of the upper bound
given in Theorem 3.1.

We also note that the stability condition for the upper bound is
1
ρHn−k,n < 1 which is not the same as the stability condition of the
Fork-Join queue λ < nµ/k. This shows that the upper bound technique
is loose, and does not result in an efficient bound in the region close to
λ = nµ/k. We now find the lower bound on the latency in the following
theorem.

Theorem 3.2 ((Joshi et al., 2014)). The expected file latency, T(n,k) ,
satisfies
k−1
X 1
T(n,k)≥ , (3.5)
j=0
(n − j)µ − λ
where λ is the request arrival rate and µ is the service rate at each
queue.

Proof. The lower bound in (3.5) is a generalization of the bound for the
(n, n) fork-join system derived in (Varki et al., 2008). The bound for the
(n, n) system is derived by considering that a job goes through n stages
of processing. A job is said to be in the j th stage if j out of n tasks
have been served by their respective nodes for 0 ≤ j ≤ n − 1. The job
waits for the remaining n − j tasks to be served, after which it departs
the system. For the (n, k) fork-join system, since we only need k tasks
to finish service, each job now goes through k stages of processing. In
the j th stage, where 0 ≤ j ≤ k − 1, j tasks have been served and the
job will depart when k − j more tasks to finish service.
We now show that the service rate of a job in the j th stage of
processing is at most (n − j)µ. Consider two jobs B1 and B2 in the
ith and j th stages of processing respectively. Let i > j, that is, B1 has
completed more tasks than B2 . Job B2 moves to the (j + 1)th stage
when one of its n − j remaining tasks complete. If all these tasks are
at the heads of their respective queues, the service rate for job B2 is
exactly (n − j)µ. However since i > j, B1 ’s task could be ahead of B2 ’s
in one of the n − j pending queues, due to which that task of B2 cannot
be immediately served. Hence, we have shown that the service rate of
in the j th stage of processing is at most (n − j)µ.
3.2. Characterization of Latency 31

Thus, the time for a job to move from the j th to (j + 1)th stage
is lower bounded by 1/((n − j)µ − λ), the mean response time of an
M/M/1 queue with arrival rate λ and service rate (n − j)µ. The total
mean response time is the sum of the mean response times of each of
the k stages of processing and is bounded below as in the statement of
the theorem.

We note that the lower bound does not achieve the optimal stability
region, giving the threshold as λ < (n − k + 1)µ.
An approximate characterization of latency has also been studied
(Badita et al., 2019). The approach follows the structure of the lower
bound mentioned above, which goes in stages. A job is said to be in
the j th stage if j out of n tasks have been served by their respective
nodes for 0 ≤ j ≤ k − 1. Since the job goes from stage 0 to stage 1, all
the way to stage k − 1 and then get served when k chunks have been
serviced, the procedure is akin to a tandem queue where a service from
stage j leads to stage j + 1. Thus, we consider k − 1 tandem queues
for the approximation, which are assumed to be uncoupled, labeled as
queue j ∈ {0, · · · , k − 1}. The arrival rate at the tandem queue 0 is the
external arrivals which is Poisson at rate λ. Since it is tandem queue
and service time is assumed to be exponential, the arrival rate at each
queue will be λ (Ross, 2019). In the case of the lower bound, the service
rate for tandem queue j was taken as (n − j)µ, while this is where a
better approximation will be used. We let γj be the approximate service
rate of queue j and πj (r) be the probability that the queue length of
tandem queue j is r.
The service rate of queue k − 1 is γk−1 = (n − k + 1)µ as in the
lower bound. For the other queues, service rate includes µ and the
additional resources from the later queues, which for the lower bound
became (n − j)µ. However, the later queues are not always empty
and the resources cannot be used to serve the earlier queues. In the
approximation, we let the resources of the later queues help the earlier
queues only when they are empty. Using additional resources of tandem
queue k − 1 to serve requests at queue k − 2 when queue k − 1 is empty
gives γk−2 = µ + γk−1 πk−1 (0). Proceeding back with the same method,
32 Fork-Join Scheduling Approach

we have the recursion on γi as:



(n − k + 1)µ, i=k−1
γj = . (3.6)
µ + γj+1 πj+1 (0) j ∈ {0, 1, · · · , k − 2}
λ
Since each tandem queue is M/M/1, πj (0) = 1 − γj (Ross, 2019).
Thus, we can compute γj as

γj = (n − j)µ − (k − j − 1)λ, j ∈ {0, 1, · · · , k − 1}. (3.7)

The overall approximate latency is given as

k−1 k−1
X 1 X 1
T(n,k)≈ = . (3.8)
j=0
γj − λ j=0 (n − j)µ − (k − j)λ
This is summarized in the following lemma.

Lemma 3.3 ((Badita et al., 2019)). The expected file latency, T(n,k) can
be approximated as
k−1
X 1
T(n,k)≈ , (3.9)
j=0
(n − j)µ − (k − j)λ

where λ is the request arrival rate and µ is the service rate at each
queue.

We further note that the stability condition for the approximation


is kλ < nµ, and thus it satisfies the optimal stability condition.

3.3 Extension to General Service Time Distributions

The above results for exponential service distribution can be extended


to the case where each homogeneous server has a general service time
distribution. Let X1 , X2 , . . . , Xn be the i.i.d random variables represent-
ing the service times of the n nodes, with expectation E[Xi ] = µ1 and
variance Var[Xi ] = σ 2 for all i.

Theorem 3.4 ((Joshi et al., 2014)). The mean response time T(n,k) of an
(n, k) fork-join system with general service time X such that E[X] = µ1
3.3. Extension to General Service Time Distributions 33

and Var[X] = σ 2 satisfies


s
1 k−1
T(n,k) ≤ +σ
µ n−k+1
 q 2 
1 k−1
λ µ +σ n−k+1 + σ 2 C(n, k)
+ h  q i , (3.10)
1 k−1
2 1−λ µ +σ n−k+1

where C(n, k) is a constant depending on n and k, and is defined as


(Papadatos, 1995)
Ix (k, n + 1 − k)(1 − Ix (k, n + 1 − k))
C(n, k) = sup { }, (3.11)
0<x<1 x(1 − x)
where Ix (a, b) is the incomplete beta function. The values of constant
C(n, k) for certain n and k can be found in the table in (Papadatos,
1995), respectively.
Proof. The proof follows from Theorem 3.1 where the upper bound can
be calculated using (n, k) split-merge system and Pollaczek-Khinchin
formula (3.4). Unlike the exponential distribution, we do not have an
exact expression for S, i.e., the k th order statistic of the service times
X1 , X2 , · · · Xn . Instead, we use the following upper bounds on the
expectation and variance of S derived in (Arnold and Groeneveld, 1979)
and (Papadatos, 1995).
s
1 k−1
E[S] ≤ + σ , (3.12)
µ n−k+1
Var[S] ≤ C(n, k)σ 2 . (3.13)
The proof of (3.12) involves Jensen’s inequality and Cauchy-Schwarz
inequality. For details please refer to (Arnold and Groeneveld, 1979).
The proof of (3.13) can be found in (Papadatos, 1995).
Note that (3.4) strictly increases as either E[S] or Var[S] increases.
Thus, we can substitute the upper bounds in it to obtain the upper
bound on mean response time (3.10).

We
 noteqthat the
 stability condition for the upper bound of latency
1 k−1
is λ µ + σ n−k+1 < 1. For deterministic service times, σ = 0, and
34 Fork-Join Scheduling Approach

the condition becomes λ < µ. However, this is not optimal stability


condition for the best scheduling approach. For deterministic service
times, Fork-Join scheduling spends an additional time for the n − k
tasks which could have been saved and thus leads to non-optimality of
the stability region of Fork-Join queues.
Regarding the lower bound, we note that our proof in Theorem 3.2
cannot be extended to this general service time setting. The proof
requires memoryless property of the service time, which does not neces-
sary hold in the general service time case. However, the proof can be
extended directly to the shifted exponential service time distribution
easily. We let the shifted exponential distribution be Sexp(β, α), where
β is the shift and there is an exponential with rate α after the shift.
The probability density function of Sexp(β, α) is given as

αe−α(x−β) for x ≥ β
fX (x) = (3.14)
0 for x < β

The lower bound on latency is then given as

Theorem 3.5 ((Joshi et al., 2017)). The expected file latency, T(n,k) ,
satisfies
 2  2 
1 1
 
λ β+ nα + nα k−1 β+ 1
1 X nα
T(n,k)≥ β + +     + ,
nα 2 1−λ β+ 1
(n − j) − λ β + nα1
j=1

(3.15)
where λ is the request arrival rate and the the service distribution at
each queue is Sexp(β, α).

Proof. The proof is an adaptation of Theorem 3.2, where the latency


for the first stage is found using Pollaczek-Khinchin formula with
Sexp(β, nα) (as if all tasks are at head, this will be the distribution
of first job finishing). Using this as the completion latency of the first
task, the remaining task completions are similar as they still follow
exponential distributions.
3.4. Extension to Heterogeneous Systems 35

3.4 Extension to Heterogeneous Systems

We now extend the setup where there are r files, where each file i is
encoded using (n, ki ) MDS code. We assume that file i is of size li .
The arrival process for file i is assumed to be Poisson with rate λi .
The service time at each server is assumed to follow an exponential
distribution with service rate µ (per unit file size). The effective service
rate at any server for file i is µi = kliiµ since each server stores 1/ki
fraction of data. Let ρi = µλii be the server utilization factor for file i.
The following result describes the conditions for the queues to be stable
using Fork-Join queueing.

Lemma 3.6 ((Kumar et al., 2017)). For the system to be stable using
Fork-Join queueing system, we require
r r r
! !
X X λi li X
ki λi < nµ λi . (3.16)
i=1 i=1
ki i=1

Proof. Jobs of file i enter the queue with rate λi . Each file i is serviced by
the system when ki sub-tasks of that job are completed. The remaining
n − kr sub-tasks are then cleared from the system. Thus for each request
of file i, (n−k
n
i)
fraction of the sub-tasks aredeleted and 
hence the
n−ki ki λi
effective arrival rate of file i at any server is λi 1 − n = n . Thus
the overall arrival rate at any server, λeff , is
r
X ki λi
λeff = . (3.17)
i=1
n

Let S denote the service distribution for a single-server FCFS system


serving r files, with pi being the fraction of jobs of class i. Then, the
mean service time at a server is
r r
X X λi
E[S] = pi E[Si ] = r
P
, (3.18)
i=1 i=1 µi λi
i=1

where (3.18) follows the assumption that the service time for file i is
exponential with rate µi . To ensure stability, the net arrival rate should
be less than the average service rate at each server. Thus from (3.17)
36 Fork-Join Scheduling Approach

and (3.18) the stability condition of each queue is


 −1
r r
X ki λi X λi 
< r
 ,
i=1
n 
i=1 µi
P
λi

i=1
r
ki µ P
Since µi = li and the term λi is a constant, with simple algebraic
i=1
manipulations we arrive at the statement of the lemma.
j
Let Sj =
P 1
ρr Hn−k . We will now provide the lower and upper
i ,n
i=1
bounds for the mean latency extending the results for homogenous files.
The following results provides an upper bound on the latency.

Theorem 3.7 ((Kumar et al., 2017)). The average latency for job re-
quests of file i using Fork-Join queueing is upper-bounded as follows:

r 2
P 2
λi [Hn−k 1
+ (Hn−k ) ]/µi 2
1
Hn−k i ,n i ,n
i ,n
Ti ≤ + i=1 . (3.19)
µi 2 (1 − Sr )
| {z } | {z }
Service time Waiting time

The bound is valid only when Sr < 1.

Proof. The system can be modeled as a M/G/1 queuing system with


r
P
arrival rate λ = λi and a general service time distribution S. Then
i=1
the average latency for a job of class i is given as
r
λi Var[Si ] + E[Si ]2
P  
i=1
T i = E[Si ] +  r
 . (3.20)
2 1−
P
λi E[Si ]
i=1

To obtain an upper bound on the average latency, we degrade the


Fork-Join system in the following manner. For a request of file i, the
servers that have finished processing a sub-task of that request are
blocked and do not accept new jobs until ki sub-tasks of that request
have been completed. Then the sub-tasks at remaining n − ki servers
3.4. Extension to Heterogeneous Systems 37

exit the system immediately. Now this performance-degraded system


can be modeled as a M/G/1 system where the distribution of the service
process, Si , follows kith ordered statistics. For any file i, the service time
at each of the n servers is exponential with mean 1/µi . Hence from
(3.3), the mean and variance of Si are,
1
Hn−k 2
Hn−k
i ,n i ,n
E[Si ] = , V[Si ] = . (3.21)
µi µ2i
Substituting (3.21) in (3.20), we get the following upper bound on
average latency as in the statement of the theorem.

Without loss of generality, assume the files are relabeled such that
k1 ≤ k2 ≤ ... ≤ kr . The next theorem provides the lower bound of the
latency of file i.

Theorem 3.8 ((Kumar et al., 2017)). The average latency for file i is
lower-bounded as follows:
 

 Pr t2s,j


ki  λj
 
X t s,i j=c s,i +1

Ti ≥ + , (3.22)
 
 r

s=1 
λi 1 −
P
t

|{z} s,j 
service time
 j=cs,i +1 
| {z }
waiting time

λi
where ts,i = (n−s+1)µi , and cs,i is given as



0, 1 ≤ s ≤ k1



 1, k1 < s ≤ k2
cs,i = .. . (3.23)



 .

i − 1,

ki−1 < s ≤ ki

Proof. For the purpose of obtaining a lower bound on the average


latency of file i, using insights from proof of Theorem 3.2, we map
the parallel processing in the Fork-Join system to a sequential process
consisting of ki processing stages for ki sub-tasks of a request of file
i. The transition from one stage to the next occurs when one of the
38 Fork-Join Scheduling Approach

remaining servers finishes a sub-task of the file i. Note that cs,i in


the theorem statement denotes the number of classes of file i that
are finished before start of stage s. The processing in each stage s
corresponds to a single-server FCFS system with jobs of all but classes
1, 2, · · · , cs,i . Then, using Pollaczek-Khinchin formula at stage s, the
average latency for a sub-task of a job of class i in stage s is given by,

i λE[(S s )2 ]
TFCFS,s = E[Sis ] + , (3.24)
2(1 − λE[S s ]))
where S s is a r.v. denoting the service time for any sub-task in stage
s and Sis denotes the service time for a sub-task of class i in stage s,
which are given as
R R
pi E[Sis ], E[(S s )2 ] = pi E[(Sis )2 ],
X X
E[S s ] = (3.25)
i=cs,i +1 i=cs,i +1

where pi = Prλi λi
. Substituting (3.25) in (3.24), we get
i=1

r
λj E[(Sjs )2 ]
P
i j=cs,i +1
Ts,c s,i
= E[Sis ] + !. (3.26)
r
2 1− λj E[Sjs ]
P
j=cs,i +1

Now we note that at any stage s, the maximum possible service rate for
a request of file j that is not finished yet is (n − s + 1)µj . This happens
when all the remaining sub-tasks of request of file j are at the head of
their buffers. Thus, we can enhance the latency performance in each
stage s by approximating it with a M/G/1 system with service rate
(n − s + 1)µj for request of file j. Then, the average latency for sub-task
of request of file i in stage s is lower bounded as,
r
P λj
(n−s+1)µj 2
i 1 j=cs,i +1
Ts,c ≥ + r , (3.27)
s,i
(n − s + 1)µi 1 − P λj
(n−s+1)µj
j=cs,i +1

Finally, the average latency for file i in this enhanced system is simply
ki
P i
Ts,c . This gives us the result as in the statement of the theorem.
s,i
s=1
3.5. Simulations 39

3.5 Simulations

2
Fork Join Simulation
Fork Join Approximation
Fork Join Lower Bound
Mean Latency, T(n,k)

Fork Join Upper Bound


1.8

1.6

1.4

2 4 8 12 16 20
Number of Servers, n

Figure 3.2: This graph displays the latency as the number of servers n increases.
Throughout, the code rate is kept constant at k/n = 0.5, the arrival rate is set to
λ = 0.3, and the service rate of each server is µ = 0.5. The approximate result,
upper bound, and lower bound in Chapter 3.2 are depicted along with the simulation
results.

We evaluate the bounds in Chapter 3.2 for exponential service times,


and compare them with the simulation results. In Figures 3.2 and 3.3,
we consider different parameter regimes to compare the bounds and
the simulation results. We see that the approximation is close to the
simulation results. The upper and lower bounds capture efficient bounds
for the problem, while are still far from the actual latency. In Figure
3.2, we change the value of the number of servers n, and change k as
k = n/2. The rest of the parameters are λ = 0.3 and µ = 0.5. In Figure
3.3, we let n = 24, λ = 0.45, and µ = k/n. On increasing k, the latency
increases as is illustrated in the figure with the bounds.

3.6 Notes and Open Problems

The (n, k) fork-join system was first proposed in (Joshi et al., 2014)
to analyze content download latency from erasure coded distributed
storage for exponential service times. They consider that a content file
coded into n chunks can be recovered by accessing any k out of the
40 Fork-Join Scheduling Approach

8
Fork Join Simulation
Fork Join Approximation
Fork Join Lower Bound

Mean Latency, T(n,k)


Fork Join Upper Bound
6

1 4 8 12 16 20 24
Erasure Code Parameter k

Figure 3.3: This graph displays the latency as k increases. We let n = 24, λ = 0.45,
and µ = k/n. The approximate result, upper bound, and lower bound in Chapter
3.2 are depicted along with the simulation results.

n chunks, where the service time of each chunk is exponential. Even


with the exponential assumption analyzing the (n, k) fork-join system
is a hard problem. It is a generalization of the (n, n) fork-join system,
which was actively studied in queueing literature (Flatto and Hahn,
1984; Nelson and Tantawi, 1988; Varki et al., 2008). The results were
extended to general service time distributions in (Joshi et al., 2017). For
exponential service times, approximate latency was studied in (Badita
et al., 2019). These works assume homogeneous files, in the sense that
each file has the same arrival distributions, have the same erasure-coding
parameters, and run on the servers with same service distribution. For
exponential service times, the authors of (Kumar et al., 2017) studied
the case when the different files have different arrival rates and erasure-
code parameters. However, all these works assume that the number of
servers is the same as the n, which is the erasure coding parameter
representing the number of encoded chunks. Further, the servers are
assumed to be homogeneous, with the same service rate. Thus, the
following problems are of interest.

1. Tighter Upper Bounds: We note that even for the exponential


service times, the studied upper bounds do not meet the optimal
stability conditions. Thus, an efficient upper bound that satisfies
3.6. Notes and Open Problems 41

the stability conditions is open.

2. General File Placement: In the prior work, the number of


servers are the same as the erasure-coded encoded chunks. How-
ever, in general, the number of servers may be large and each file
i may be placed on a subset ni of the servers. The latency results
for a general parameter system has not been considered. A related
concept to the placement of the files is that it is not essential, in
general, to have only one chunk per node, some nodes may have
more chunks. In this case, the n requests are not in parallel and
the same analysis cannot be easily extended.

3. Heterogeneous Servers: In the prior works, the servers serving


the files are assumed to be homogeneous with the same service dis-
tribution. However, this is not the case in general, especially with
fog computing. Obtaining efficient analysis for such heterogeneous
server system is an open problem.

4. Approximation and its Guarantees: While an approximation


of latency has been proposed for exponential service times (Ba-
dita et al., 2019), such characterization for heterogenous files and
general service times is open. Further, the guarantees on approxi-
mation, in some asymptotic regime or bounding the gap between
the two by a constant (or within a multiplicative factor) has not
yet been considered.
4
Probabilistic Scheduling Approach

In this Chapter, we introduce the model of Probabilistic Scheduling in


Section 4.1, which was first proposed in (Xiang et al., 2016; Xiang et al.,
2014). We will evaluate an upper bound on the latency for heteogeneous
files, heterogenous servers, general service times, and general placemeny
of files in Section 4.2. A notion of tail latency is provided and an upper
bound on tail latency is characterized in Section 4.3. For homogenous
files, homogenous servers, Section 4.4 shows that the latency of uniform
probabilistic scheduling is upper bounded by assuming independence
across the servers. Further, asymptotic latency is considered as the
number of servers increase. Another asymptotic metric for heavy tailed
file sizes is provided and analyzed in Section 4.5. Sections 4.6 and 4.7
contain simulation results and notes on future directions, respectively.

4.1 Probabilistic Scheduling

We assume the model given in Section 1.1. Under (ni , ki ) MDS codes,
each file i can be retrieved by processing a batch of ki chunk requests at
distinct nodes that store the file chunks. Recall that each encoded file i
is spread over ni nodes, denoted by a set Si . Upon the arrival of a file
i request, in probabilistic scheduling we randomly dispatch the batch

42
4.1. Probabilistic Scheduling 43

of ki chunk requests to ki out of ni storage nodes in Si , denoted by a


subset Ai ⊆ Si (satisfying |Ai | = ki ) with predetermined probabilities.
Then, each storage node manages its local queue independently and
continues processing requests in order. A file request is completed if
all its chunk requests exit the system. An example of probabilistic
scheduling is depicted in Fig. 4.1 for the setup in Section 1.2, wherein 5
chunk requests are currently served by the 5 storage nodes, and there
are 9 more chunk requests that are randomly dispatched to and are
buffered in 5 local queues according to chunk placement, e.g., requests
B2 , B3 are only distributed to nodes {3, 4, 5}. Suppose that node 2
completes serving chunk request A2 . The next request in the node’s
local queue will move forward.
R A,1
2 R A,1
2

R 1A,1 R 1A,2 R 1B,1 R 1B,2 R 1A,1 R 1A,2 R 1B,1 R 1B,2

R A,2 R A,1 R B,1 R 3B,1 R B,2


R A,2
2
2 4 2 2

A,1 A,2
R R R A,2
R 3B,2
R B,1
2 R B,2
2
3 3 4

R 3B,1 R 3B,2
R 3A,1 R 3A,2 ……
R A,1
4 R A,2
4
Dispatch

(a) MDS scheduling (b) Probabilistic scheduling


Figure 4.1: Functioning of a probabilistic scheduling policy.

Definition 4.1. (Probabilistic scheduling) A Probabilistic scheduling


policy (i) dispatches each batch of chunk requests to appropriate nodes
with predetermined probabilities; (ii) each node buffers requests in a
local queue and processes in order.
It is easy to verify that such probabilistic scheduling ensures that at
most 1 chunk request from a batch to each appropriate node. It provides
an upper bound on average service latency for the optimal scheduling
since rebalancing and scheduling of local queues are not permitted. Let
P(Ai ) for all Ai ⊆ Si be the probability of selecting a set of nodes Ai
to process the |Ai | = ki distinct chunk requests1 .
1
It is easy to see that P(Ai ) = 0 for all Ai * Si and |Ai | = ki because such node
selections do not recover ki distinct chunks and thus are inadequate for successful
decode.
44 Probabilistic Scheduling Approach

Lemma 4.1. For given erasure codes and chunk placement, average
service latency of probabilistic scheduling with feasible probabilities
{P(Ai ) : ∀i, Ai } upper bounds the latency of optimal scheduling.

Clearly, the tightest upper bound can be obtained by minimizing


average latency of probabilistic scheduling over all feasible probabilities
P(Ai ) ∀Ai ⊆ Si and ∀i, which involves i nkii decision variables. We
P 

refer to this optimization as a scheduling subproblem. While it appears


prohibitive computationally, we will demonstrate next that the optimiza-
tion can be transformed into an equivalent form, which only requires
P
i ni variables. The key idea is to show that it is sufficient to consider
the conditional probability (denoted by πi,j ) of selecting a node j, given
that a batch of ki chunk requests of file i are dispatched. It is easy to
see that for given P(Ai ), we can derive πi,j by
X
πi,j = P(Ai ) · 1{j∈Ai } , ∀i (4.1)
Ai :Ai ⊆Si

where 1{j∈Ai } is an indicator function which equals to 1 if node j is


selected by Ai and 0 otherwise.
We first provide Farkas-Minkowski Theorem (Angell, 2002) that will
be used in this transformation.

Lemma 4.2. Farkas-Minkowski Theorem (Angell, 2002). Let A be an


m × n matrix with real entries, and x ∈ Rn and b ∈ Rm be 2 vectors. A
necessary and sufficient condition that A · x = b, x ≥ 0 has a solution
is that, for all y ∈ Rm with the property that AT · y ≥ 0, we have
hy, bi ≥ 0.

The next result formally shows that the optimization can be trans-
P
formed into an equivalent form, which only requires i ni variables.

Theorem 4.3. A probabilistic scheduling policy with feasible proba-


bilities {P(Ai ) : ∀i, Ai } exists if and only if there exists conditional
probabilities {πi,j ∈ [0, 1], ∀i, j} satisfying
m
X
πi,j = ki ∀i and πi,j = 0 if j ∈
/ Si . (4.2)
j=1
4.1. Probabilistic Scheduling 45

Proof. We first prove that the conditions m j=1 πi,j = ki ∀i and πi,j ∈
P

[0, 1] are necessary. πi,j ∈ [0, 1] for all i, j is obvious due to its definition.
Then, it is easy to show that
m
X m X
X
πi,j = 1{j∈Ai } P(Ai )
j=1 j=1 Ai ⊆Si
X X
= P(Ai )
Ai ⊆Si j∈Ai
X
= ki P(Ai ) = ki (4.3)
Ai ⊆Si

where the first step is due to (4.1), 1{j∈Ai } is an indicator function,


which is 1 if j ∈ Ai , and 0 otherwise. The second step changes the
order of summation, the last step uses the fact that each set Ai contain
P
exactly ki nodes and that Ai ⊆Si P(Ai ) = 1.
Next, we prove that for any set of πi,1 , . . . , πi,m (i.e., node selection
probabilities of file i) satisfying m j=1 πi,j = ki and πi,j ∈ [0, 1], there
P

exists a probabilistic scheduling scheme with feasible load balancing


probabilities P(Ai ) ∀Ai ⊆ Si to achieve the same node selection prob-
abilities. We start by constructing Si = {j : πi,j > 0}, which is a set
containing at least ki nodes, because there must be at least ki positive
probabilities πi,j to satisfy m
P
j=1 πi,j = ki . Then, we choose erasure code
length ni = |Si | and place chunks on nodes in Si . From (4.1), we only
need to show that when j∈Si πi,j = ki and πi,j ∈ [0, 1], the following
P

system of ni linear equations have a feasible solution P(Ai ) ∀Ai ⊆ Si :


X
1{j∈Ai } · P(Ai ) = πi,j , ∀j ∈ Si (4.4)
Ai ⊆Si

We prove the desired result using mathematical induction. It is easy to


show that the statement holds for ni = ki . In this case, we have a unique
solution Ai = Si and P(Ai ) = πi,j = 1 for the system of linear equations
(4.4), because all chunks must be selected to recover file i. Now assume
that the system of linear equations (4.4) has a feasible solution for
some ni ≥ ki . Consider the case with arbitrary |Si + {h}| = ni + 1 and
P
πi,h + j∈Si πi,j = ki . We have a system of linear equations:
X
1{j∈Ai } · P(Ai ) = πi,j , ∀j ∈ Si + {h} (4.5)
Ai ⊆Si +{h}
46 Probabilistic Scheduling Approach

Using the Farkas-Minkowski Theorem (Angell, 2002), a sufficient and


necessary condition that (4.5) has a non-negative solution is that, for
P
any y1 , . . . , ym and j yj πi,j < 0, we have
X
yj 1{j∈Ai } < 0 for some Ai ⊆ Si + {h}. (4.6)
j∈Si +{h}

Toward this end, we construct π̂i,j = πi,j + [u − πi,j ]+ for all j ∈ Si .


Here [x]+ = max(x, 0) is a truncating function and u is a proper water-
filling level satisfying
X
[u − πi,j ]+ = πi,h . (4.7)
j∈Si

It is easy to show that j∈Si π̂i,j = πi,h + j∈Si πi,j = ki and π̂i,j ∈
P P

[0, 1], because π̂i,j = max(u, πi,j ) ∈ [0, 1]. Here we used the fact that
u < 1 since ki = j∈Si π̂i,j ≥ j∈Si u ≥ ki u. Therefore, the system of
P P

linear equations in (4.4) with π̂i,j on the right hand side must have a
non-negative solution due to our induction assumption for ni = |Si |.
Furthermore, without loss of generality, we assume that yh ≥ yj for all
j ∈ Si (otherwise a different h can be chosen). It implies that
X X
yj π̂i,j = yj (πi,j + [u − πi,j ]+ )
j∈Si j∈Si
(a) X X
≤ yj πi,j + yh [u − πi,j ]+
j∈Si j∈Si
(b) X X
= yj πi,j + yh [u − πi,j ]+
j∈Si j∈Si

(c) X (d)
= yj πi,j + yh πi,h ≤ 0, (4.8)
j∈Si

where (a) uses yh ≥ yj , (b) uses that yh is independent of j, (c) follows


P
from (4.7) and the last step uses j yj πi,j < 0.
Applying the Farkas-Minkowski Theorem to the system of linear
equations in (4.4) with π̂i,j on the right hand side, the existence of a
non-negative solution (due to our induction assumption for ni ) implies
that j∈Si yj 1{j∈Ai } < 0 for some Âi ⊆ Si . It means that
P
X X
yj 1{j∈Âi } = yh 1{h∈Âi } + yj 1{j∈Âi } < 0. (4.9)
j∈Si +{h} j∈Si
4.2. Characterization of Mean Latency 47

/ Si and Âi ⊆ Si . This is exactly


The last step uses 1{h∈Âi } = 0 since h ∈
the desired inequality in (4.6). Thus, (4.5) has a non-negative solution
due to the Farkas-Minkowski Theorem. The induction statement holds
for ni + 1. Finally, the solution indeed gives a probability distribution
P P
since Ai ⊆Si +{h} P(Ai ) = j πi,j /ki = 1 due to (4.3). This completes
the proof.

The proof of Theorem 4.3 relies on Farkas-Minkowski Theorem


(Angell, 2002). Intuitively, m
P
j=1 πi,j = ki holds because each batch of
requests is dispatched to exact ki distinct nodes. Moreover, a node does
not host file i chunks should not be selected, meaning that πi,j = 0 if
j∈/ Si . Using this result, it is sufficient to study probabilistic scheduling
via conditional probabilities πi,j , which greatly simplifies our analysis. In
particular, it is easy to verify that under our model, the arrival of chunk
P
requests at node j form a Poisson Process with rate Λj = i λi πi,j ,
which is the superposition of r Poisson processes each with rate λi πi,j ,
µj is the service rate of node j. The resulting queuing system under
probabilistic scheduling is stable if all local queues are stable.

Corollary 4.4. The queuing system can be stabilized by a probabilistic


scheduling policy under request arrival rates λ1 , λ2 , . . . , λr if there exists
{πi,j ∈ [0, 1], ∀i, j} satisfying (4.2) and
X
Λj = λi πi,j < µj , ∀j. (4.10)
i

We let uniform probabilistic scheduling be defined as the case when


πi,j = ki /ni for j ∈ Si . If all ki = k, ni = n, m = n, r = 1, λ1 = λ,
µj = µ, we have the stable region as λ < nµ/k, which is the same as
for the case of Fork-Join Queues, while for general service times. In
contrast, this stability region is only valid for Fork-Join queues for only
exponential service times.

4.2 Characterization of Mean Latency

An exact analysis of the queuing latency of probabilistic scheduling is


still hard because local queues at different storage nodes are dependent
of each other as each batch of chunk requests are dispatched jointly.
48 Probabilistic Scheduling Approach

Since local queues at different storage nodes are dependent of each


other as each batch of chunk requests are jointly dispatched, the exact
analysis of the queuing latency of probabilistic scheduling is not tractable.
Thus, we will use probabilistic scheduling to bound the mean latency
and since probabilistic scheduling is a feasible strategy, the obtained
bound is an upper bound to the optimal strategy. We define Wi,j as
the random waiting time (sojourn time) in which a chunk request (for
file i) spends in the queue of node j. Typically, the latency of file i,
denoted as Qi , request is determined by the maximum latency that
ki chunk requests experience on distinct servers. These servers are
probabilistically scheduled with a prior known probabilities, i.e., πi,j .
Thus, we have

  
E[Qi ] , EWi,j EAi max Wi,j (4.11)
j∈Ai

where the first expectation EWj is taken over system queuing dynamics
and the second expectation EAi is taken over random dispatch decisions
Ai . Hence, we derive an upper-bound on the expected latency of a file
i, i.e., E[Qi ], as follows. Using Jensen’s inequality (Kuczma, 2009a), we
have for ti > 0

h i
eti E[Qi ] ≤ E eti Qi (4.12)

We notice from (4.12) that by bounding the moment generating


function of Qi , we are bounding the mean latency of file i. Then,

 
(a)
h i
E eti Qi = EAi ,Wi,j max eti Wi,j (4.13)
j∈Ai
  
ti Wi,j
= EAi EWi,j max e |Ai (4.14)
j∈Ai
 
(b) h i
eti Wi,j 
X
≤ EAi  EWi,j (4.15)
j∈Ai
4.2. Characterization of Mean Latency 49

 
h i
EAi  EWi,j eti Wi,j 1(j∈Ai ) 
X
= (4.16)
j
h i h i
EWi,j eti Wi,j EAi 1(j∈Ai )
X
= (4.17)
j
h i
EWi,j eti Wi,j P(j ∈ Ai )
X
= (4.18)
j
(c)
h i
πi,j EWi,j eti Wi,j
X
= (4.19)
j

where (a)Xfollows from (4.11) and (4.12), (b) follows by replacing the
max by and (c) follows by probabilistic scheduling. We note that
j∈Ai
j∈Ai
the only inequality here is for replacing the maximum by the sum.
However, since this term will be inside the logarithm for the mean
latency, the gap between the term and its bound becomes additive
rather than multiplicative. Since the request pattern is Poisson and the
service time is general distributed, the Laplace-Stieltjes Transform of
the waiting time Wi,j can be characterized using Pollaczek-Khinchine
formula for M/G/1 queues (Zwart and Boxma, 2000) as follows
h i (1 − ρj ) ti Zj (ti )
E eti Wi,j = (4.20)
ti − Λj (Zj (ti ) − 1)
h i
d
where ρj = Λj E [Xj ] = Λj dt Zj (ti ) |ti =0 and Zj (ti ) is the moment
generating function of the chunk service time. Plugging (4.20) in (4.19)
and substituting in (4.12), we get the following Theorem.

Theorem 4.5. The mean latency for file i is bounded by


 
m
1 X (1 − ρj )ti Zj (ti ) 
E[Qi ] ≤ log  πi,j (4.21)
ti j=1
ti − Λj (Zj (ti ) − 1)
h i
d
for any ti > 0, ρj = Λj dt Zj (ti ) |ti =0 , ρj < 1, and Λj (Zj (ti ) − 1) <
ti .

Note that the above Theorem holds only in the range of ti when
ti − Λj (Zj (t) − 1) > 0. Further, the server utilization ρj must be less
than 1 for stability of the system.
50 Probabilistic Scheduling Approach

We now specialize this result for shifted exponential distribution. Let


the service time distribution from server j, Xj , has probability density
function fXj (x), given as

α e−αj (x−βj ) for x ≥ βj
j
fXj (x) = (4.22)
0 for x < βj

Exponential distribution is a special case for βj = 0. The moment


generating function, Zj (t) is given as
αj βj t
Zj (t) = e for t < αj . (4.23)
αj − t

Using (4.23) in Theorem 4.5, we have

Corollary 4.6. The mean latency for file i for Shifted Exponential
Service time at each server is bounded by
 
m
1 X (1 − ρj )ti Zj (ti ) 
E[Qi ] ≤ log  πi,j (4.24)
ti j=1
ti − Λj (Zj (ti ) − 1)
 
1
for any ti > 0, ρj = Λj αj + βj , ρj < 1, ti (ti − αj + Λj ) +
αj βj t
Λj αj (eβj ti − 1) < 0, and Zj (t) = αj −t e .

Further, the exponential distribution has βj = 0, and the result for


exponential follows as a special case. We note that the bound presented
here has been shown to outperform that in (Xiang et al., 2016; Xiang et
al., 2014) in (Al-Abbasi and Aggarwal, 2020). The comparison between
the two is further illustrated in Figure 4.2. The bound in (Xiang et al.,
2016; Xiang et al., 2014) is further shown to be better than that in
(Joshi et al., 2014). Moreover, replication coding follows as a special
case when ki = 1 and thus the proposed upper bound for file download
can be used to bound the latency of replication based systems by setting
ki = 1.
For completeness, we will also present the latency bound provided
in (Xiang et al., 2016; Xiang et al., 2014).
4.3. Characterization of Tail Latency 51

Theorem 4.7. The mean latency of file i is bounded by


1X
 q 
E[Qi ] ≤ zi + πi,j E[Wi,j ] − zi + (E[Wi,j ] − zi )2 + Var[Wi,j ] ,
2 j∈A
i
(4.25)
where E[Wi,j ] and Var[Wi,j ] are given from calculating moments of
the moment generating function in (4.20).

Proof.
  
E[Qi ] = EWi,j EAi max Wi,j
j∈Ai
" "  + ##
≤ EWi,j EAi zi + max Wi,j − zi
j∈Ai
  
+
= EWi,j EAi zi + max [Wi,j − zi ]
j∈Ai
  

[Wi,j − zi ]+ 
X
≤ EWi,j EAi zi +
j∈Ai
  
1X
= EWi,j EAi zi + [Wi,j − zi + |Wi,j − zi |]
2 j∈A
i
 
1X
= EWi,j zi + πi,j [Wi,j − zi + |Wi,j − zi |]
2 j∈A
i

We note that E[Wi,j ] can be found from (4.20). Further, E[|Wi,j − zi |]


can be upper bounded as
q
E[|Wi,j − zi |] ≤ (E[Wi,j ] − zi )2 + Var[Wi,j ], (4.26)

where both E[Wi,j ] and Var[Wi,j ] can be found using (4.20).

4.3 Characterization of Tail Latency

Latency tail probability of file i is defined as the probability that


the latency tail is greater than (or equal) to a given number σ, i.e.,
Pr(Qi ≥ σ). Since evaluating Pr(Qi ≥ σ) in closed-form is hard (Lee
et al., 2017; Xiang et al., 2016; Xiang et al., 2014; Chen et al., 2014a;
52 Probabilistic Scheduling Approach

Huang et al., 2012b), we derive a tight upper bound on the latency tail
probability using Probabilistic Scheduling as follows (Aggarwal et al.,
2017b; Al-Abbasi et al., 2019a).
 
(d)
Pr (Qi ≥ σ) = Pr max Wi,j ≥ σ (4.27)
j∈Ai
  
= EAi EWi,j max Wi,j ≥ σ |Ai (4.28)
j∈Ai
 
= EAi ,Wi,j max 1(Wi,j ≥σ) (4.29)
j∈Ai
 
(e) X h i
≤ EAi ,Wi,j  1(Wi,j ≥σ)  (4.30)
j∈Ai
 
X
= EAi  [Pr(Wi,j ≥ σ)] (4.31)
j∈Ai
(f ) X
= πi,j [Pr(Wi,j ≥ σ)] (4.32)
j

where (d) follows from (4.11)2 , (e) follows by bounding maxj∈Ai by


j∈Ai and (f ) follows from probabilistic scheduling. To evaluate Pr(Wi,j ≥
P

σ), we use Markov Lemma, i.e.,

E[eti,j Wi,j ]
Pr(Wi,j ≥ σ) ≤
eti,j σ
(g) 1 (1 − ρj ) ti,j Zj (ti,j )
= ti,j σ (4.33)
e ti,j − Λj (Zj (ti,j ) − 1)
where (g) follows from (4.20). Plugging (4.33) in (4.32), we have the
following Lemma.
Theorem 4.8. Under probabilistic scheduling, the latency tail proba-
bility for file i, i.e., Pr (Qi ≥ σ) is bounded by

X πi,j (1 − ρj )ti,j Zj (ti,j )


Pr (Qi ≥ σ) ≤ (4.34)
j
eti,j σ ti,j − Λj (Zj (ti,j ) − 1)
2
As the time to reconstruct the file i is the maximum of the time of reconstructing
all the chunks from the set Ai .
4.4. Characterization of Asymptotic Latency 53

h i
d
for any ti,j > 0, ρj = Λj dt Zj (ti,j ) ti,j =0 , ρj < 1, and Λj (Zj (ti,j )−1) <
ti,j .

We now specialize the result to the case where the service times of
the servers are given in (4.22) in the following corollary.

Corollary 4.9. Under probabilistic scheduling and shifted exponential


service times, the latency tail probability for file i, i.e., Pr (Qi ≥ σ) is
bounded by

X πi,j (1 − ρj )ti,j Zj (ti,j )


Pr (Qi ≥ σ) ≤ (4.35)
j
eti,j σ ti,j − Λj (Zj (ti,j ) − 1)
 
1
for any ti,j > 0, ρj = Λj αj + βj , ρj < 1, ti,j (ti,j − αj + Λj ) +
αj βj t
Λj αj (eβj ti,j − 1) < 0, and Zj (t) = αj −t e .

4.4 Characterization of Asymptotic Latency

In this section, we consider homogeneous servers, and all files having


same size and erasure code (n, k). Further, we assume that the number
of servers m = n. In order to understand the asymptotic delay charac-
teristics, we also assume that πij = k/n, which chooses the k servers
uniformly at random. Jobs arrive over time according to a Poisson
process with rate Λ(n) , and each job (file request) consists of k (n) tasks
with k (n) ≤ n. Upon arrival, each job picks k (n) distinct servers uni-
formly at random from the n servers (uniform probabilistic scheduling)
and sends one task to each server. We assume that Λ(n) = nλ/k (n)
for a constant λ, where the constant λ is the task arrival rate to each
individual queue. Since different jobs choose servers independently, the
task arrival process to each queue is also a Poisson process, and the
rate is λ. The service times of tasks are i.i.d. following a c.d.f. G with
expectation 1/µ and a finite second moment. We think of the service
time of each task as being generated upon arrival: each task brings a
required service time with it, but the length of the required service time
is revealed to the system only when the task is completed. The load of
each queue, ρ = λ/µ, is then a constant and we assume that ρ < 1.
54 Probabilistic Scheduling Approach

(n)
As mentioned earlier, each queue is an M/G/1 queue. Let Wi (t)
denote the workload of server i’s queue at time t, i.e., the total remaining
service time of all the tasks in the queue, including the partially served
task in service. So the workload of a queue is the waiting time of an
incoming task to the queue before the server starts serving it. Let
(n) (n) (n)
W(n) (t) = W1 (t), W2 (t), . . . , Wn (t) . Then the workload process,


(W(n) (t), t ≥ 0), is Markovian and ergodic. The ergodicity can be proven
using the rather standard Foster-Lyapunov criteria (Meyn and Tweedie,
1993), so we omit it here. Therefore, the workload process has a unique
stationary distribution and W(n) (t) ⇒ W(n) (∞) as t → ∞.
Let a random variable T (n) represent this steady-state job delay.
Specifically, the distribution of T (n) is determined by the workload
W(n) (∞) in the following way. When a job comes into the system, its
tasks are sent to k (n) queues and experience the delays in these queues.
Since the queueing processes are symmetric over the indices of queues,
without loss of generality, we can assume that the tasks are sent to
the first k (n) queues for the purpose of computing the distribution of
T (n) . The delay of a task is the sum of its waiting time and service
(n)
time. So the task delay in queue i, denoted by Ti , can be written as
(n) (n)
Ti = Wi (∞) + Xi with Xi being the service time. Recall that the
Xi ’s are i.i.d.∼ G and independent of everything else. Since the job is
completed only when all its tasks are completed,
n o
(n) (n) (n)
T (n) = max T1 , T2 , . . . , Tk(n) . (4.36)

Let T̂ (n) be defined as the job delay given by independent task delays.
Specifically, T̂ (n) can be expressed as:
n o
(n) (n) (n)
T̂ (n) = max T̂1 , T̂2 , . . . , T̂k(n) , (4.37)
(n) (n) (n) (n)
where T̂1 , T̂2 , . . . , T̂k(n) are i.i.d. and each T̂i has the same distri-
(n) (n)
bution as Ti . Again, due to symmetry, all the Ti ’s have the same
(n)
distribution. Let F denote the c.d.f. of Ti , whose form is known from
the queueing theory literature. Then, we have the following explicit
form for T̂ (n) :
(n)
 
Pr T̂ (n) ≤ τ = (F (τ ))k , τ ≥ 0. (4.38)
4.4. Characterization of Asymptotic Latency 55

Remark 4.1. We note that even though the authors of (Wang et al.,
2019) related their results to Fork-Join queue, but need n = k, while
the results naturally hold for uniform probabilistic scheduling rather
than Fork-Join queues.

We first consider an asymptotic regime where the number of servers,


n, goes to infinity, and the number of tasks in a job, k (n) , is allowed
to increase with n. We establish the asymptotic independence of any
k (n) queues under the condition k (n) = o(n1/4 ). This greatly generalizes
the asymptotic-independence type of results in the literature where
asymptotic independence is shown only for a fixed constant number
of queues. As a consequence of our independence result, the job delay
converges to the maximum of independent task delays. More precisely,
we show that the distance between the distribution of job delay, T (n) ,
and the distribution of the job delay given by independent task delays,
T̂ (n) , goes to 0. This result indicates that assuming independence among
the delays of a job’s tasks gives a good approximation of job delay when
the system is large. Again, due to symmetry, we can focus on the first
k (n) queues without loss of generality.

Theorem 4.10 ((Wang et al., 2019)). Consider an n-server system in


(n)
the uniform probabilistic scheduling with k (n) = o(n1/4 ). Let π (n,k )
(n)
denote the joint distribution of the steady-state workloads W1 (∞),
(n) (n) (n)
W2 (∞), . . . , Wk(n) (∞), and π̂ (k ) denote the product distribution of
k (n) i.i.d. random variables, each of which follows a distribution that is
(n)
the same as the distribution of W1 (∞). Then
 (n) ) (n) )

lim dT V π (n,k , π̂ (k = 0. (4.39)
n→∞

Consequently, the steady-state job delay, T (n) , and the job delay given
by independent task delays as defined in (4.37), T̂ (n) , satisfy

lim sup Pr T (n) ≤ τ − Pr T̂ (n) ≤ τ


 
= 0. (4.40)
n→∞ τ ≥0

For the special case where the service times are exponentially dis-
tributed, the job delay asymptotics have explicit forms presented in
Corollary 4.11 below.
56 Probabilistic Scheduling Approach

Corollary 4.11 ((Wang et al., 2019)). Consider an n-server system in


the uniform probabilistic scheduling model with k (n) = o(n1/4 ), job
arrival rate Λ(n) = nλ/k (n) , and exponentially distributed service times
with mean 1/µ. Then the steady-state job delay, T (n) , converges as:
 k(n)
lim sup Pr T (n) ≤ τ − 1 − e−(µ−λ)τ

= 0, (4.41)
n→∞ τ ≥0

Specifically, if k (n) → ∞ as n → ∞, then


T (n)
⇒ 1, as n → ∞, (4.42)
Hk(n) /(µ − λ)

where Hk(n) is the k (n) -th harmonic number, and further,

E T (n)
 
lim = 1. (4.43)
n→∞ H (n) /(µ − λ)
k

The results above characterize job delay in the asymptotic regime


where n goes to infinity. In Theorem 4.12 below, we study the non-
asymptotic regime for any n and any k (n) with k (n) = k ≤ n, and we
establish the independence upper bound on job delay.

Theorem 4.12 ((Wang et al., 2019)). Consider an n-server system in


the uniform probabilistic scheduling model with k (n) = k ≤ n. Then the
steady-state job delay, T (n) , is stochastically upper bounded by the job
delay given by independent task delays as defined in (4.37), T̂ (n) , i.e.,

T (n) ≤st T̂ (n) , (4.44)

where “≤st ” denotes stochastic dominance. Specifically, for any τ ≥ 0,


(n)
Pr T (n) > τ ≤ Pr T̂ (n) > τ = 1 − (F (τ ))k
 
. (4.45)

We omit proofs for the results, while refer the reader to (Wang et al.,
2019) for the detailed proofs in this subsection.

4.5 Proof of Asymptotic Optimality for Heavy Tailed Service Rates

In this section, we quantify the tail index of service latency for arbi-
trary erasure-coded storage systems for Pareto-distributed file size and
4.5. Proof of Asymptotic Optimality for Heavy Tailed Service Rates 57

exponential service time. First, we derive the distribution of the waiting


time from a server. Next, we show that this time is a heavy-tailed with
tail-index α − 1. Then, we prove that the probabilistic scheduling based
algorithms achieve optimal tail index.

4.5.1 Assumptions and Chunk Size Distribution

We assume that the arrival of client requests for each file i of size kLi
Mb is assumed to form an independent Poisson process with a known
rate λi . Further, the chunk size Cei Mb is assumed to have a heavy
tail and follows a Pareto distribution with parameters (xm , α) with
shape parameter α > 2 (implying finite mean and variance). Thus, the
complementary cumulative distribution function (c.c.d.f.) of the chunk
size is given as

(x /x)α x ≥ x
m m
Pr(Cei > x) = (4.46)
0 x < xm

For α > 1, the mean is E[Cei ] = αxm /(α − 1). The service time per Mb
at server j, Xj is distributed as an exponential distribution the mean
service time 1/µj . Service time for a chunk of size C Mb is Xj C.
We will focus on the tail index of the waiting time to access each file.
In order to understand the tail index, let the waiting time for the files TW
has Pr(TW > x) of the order of x−d for large x, then the tail index is d.
More formally, the tail index d is defined as limx→∞ − log Pr(Tlog x
W >x)
. This
index gives the slope of the tail in the log-log scale of the complementary
CDF.

4.5.2 Waiting Time Distribution for a Chunk from a Server

In this Section, we will characterize the Laplace Stieltjes transform of


the waiting time distribution from a server, assuming that the arrival
of requests at a server is Poisson distributed with mean arrival rate Λj .
We first note that the service time per chunk on server j is given as
Bj = Xj Cei , where Cei is distributed as Pareto Distribution given above,
and Xj is exponential with parameter µj . Using this definition, we find
58 Probabilistic Scheduling Approach

that

Pr(Bj < y)
= Pr(Xj Cei < y)
Z ∞
1
= Pr(Xj < y/x)αxαm dx
x=xm xα+1
Z ∞
1
= (1 − exp(−µj y/x)) αxαm dx
x=xm xα+1
Z ∞
1
= 1− exp(−µj y/x)αxαm dx (4.47)
x=xm xα+1
Substitute t = µj y/x, and then dt = −µj y/x2 dx. Thus,

Pr(Bj > y)
Z ∞
1
= exp(−µj y/x)αxαm dx
x=xm xα+1
tα−1
Z µj y/xm
= exp(−t)αxαm dt
t=0 (µj y)α
1 µj y/xm
Z
= α(xm /µj )α exp(−t)tα−1 dt
y α t=0
= α(xm /µj )α γ(α, µj y/xm )/y α , (4.48)

where γ denote lower incomplete gamma function, given as γ(a, x) =


R x a−1
0 u exp(−u)du.
Since Pr(Bj > y) = V (y)/y α , where V (y) = α(xm /µj )α γ(α, µj y/xm )
is a slowly varying function, the asymptotic of the waiting time in heavy-
tailed limit can be calculated using the results in (Olvera-Cravioto et al.,
2011) as

Λ x1−α
Pr(W > x) ≈ V (x). (4.49)
1−ρα−1
Thus, we note that the waiting time from a server is heavy-tailed
with tail-index α − 1. Thus, we get the following result.

Theorem 4.13. Assume that the arrival rate for requests is Poisson
distributed, service time distribution is exponential and the chunk size
distribution is Pareto with shape parameter α. Then, the tail index for
the waiting time of chunk in the queue of a server is α − 1.
4.5. Proof of Asymptotic Optimality for Heavy Tailed Service Rates 59

4.5.3 Probabilistic Scheduling Achieves Optimal Tail Index


Having characterized the tail index of a single server with Poisson arrival
process and Pareto distributed file size, we will now give the tail index
for a general distributed storage system. The first result is that any
distributed storage system has a tail index of at most α − 1. For Poisson
arrivals, Pareto chunk sizes, and exponential chunk service times, the
tail index is at most α − 1.

Theorem 4.14. The tail index for distributed storage system is at most
α − 1.

Proof. In order to show this result, consider a genie server which is


combination of all the n servers together. The service rate of this server
is nj=1 µi per Mb. As a genie, we also assume that only one chunk is
P

enough to be served. In this case, the problem reduces to the single


server problem with Poisson arrival process and the result in Section
VI shows that the tail index is α − 1. Since even in the genie-aided case,
the tail index is α − 1, we cannot get any higher tail index.

The next result shows that the probabilistic scheduling achieves the
optimal tail index.

Theorem 4.15. The optimal tail index of α − 1 is achieved by proba-


bilistic scheduling.

Proof. In order to show that probabilistic scheduling achieves this tail


index, we consider the simple case where all the n-choose-k sets are
chosen equally likely for each file. Using this, we note that each server
is accessed with equal probability of πij = k/n. Thus, the arrival rate
at the server is Poisson and the tail index of the waiting time at the
server is α − 1.
The overall latency of a file chunk is the sum of the queue waiting
time and the service time. Since the service time has tail index of α, the
overall latency for a chunk is α − 1. Probability that latency is greater
than x is determined by the k th chunk to be received. The probability
is upper bounded by the sum of probability over all servers that waiting
time at a server is greater than x. This is because Pr(maxj (Aj ) ≥
60 Probabilistic Scheduling Approach

x) ≤ j Pr(Aj ≥ x) even when the random variables Aj are correlated.


P

Finite sum of terms, each with tail index α − 1 will still give the term
with tail index α − 1 thus proving that the tail index with probabilistic
scheduling is α − 1.

We note that even though we assumed a total of n servers, and the


erasure code being the same, the above can be extended to the case when
there are more than n servers with uniform placement of files and each
file using different erasure code. The upper bound argument does not
change as long as number of servers are finite. For the achievability with
probabilistic scheduling, we require that the chunks that are serviced
follow a Pareto distribution with shape parameter α. Thus, as long as
placed files on each server are placed with the same distribution and the
access pattern does not change the nature of distribution of accessed
chunks from a server, the result holds in general.

4.6 Simulations

We define q = (πi,j ∀i = 1, · · · , r and j = 1, · · · , m), and t = te1 , te2 , . . . ,

ter ; t1 , t2 , . . . , tr . Note that the values of ti ’s used for mean latency and
tail latency probability may be different and the parameters te and t
indicate these parameters for the two cases, respectively. Our goal is
to minimize the two proposed QoE metrics over the choice of access
decisions and auxiliary bound parameters. The objective can be mod-
eled as a convex combination of the two QoE metrics since this is a
multi-objective optimization.
To incorporate for weighted fairness and differentiated services,
we assign a positive weight wi for each QoE for file i. Without loss of
generality, each file i is weighted by the arrival rate λi in the objective (so
larger arrival rates are weighted higher). However, any other weights can
be incorporated to accommodate for weighted fairness or differentiated
P
services. Let λ = i λi be the total arrival rate. Hence, wi = λi /λ is
the ratio of file i requests. The first objective is the minimization of the
mean latency, averaged over all the file requests, and is given as i λλi Qi .
P

The second objective is the minimization of latency tail probability,


4.6. Simulations 61

averaged over all the file requests, and is given as i λλi Pr (Qi ≥ σ).
P

By using a special case of the expressions for the mean latency and
the latency tail probability in Sections 4.2 and 4.3, optimization of
a convex combination of the two QoE metrics can be formulated as
follows.

  
r m
X λi 1
θ log 
X (1 − ρj )ti Zj (ti ) 
e e
min qi,j  
i=1 λ ti ti − Λj Zj (tei ) − 1
e
j=1
e

m
X qi,j (1 − ρj )ti Zj (ti ) 
+(1 − θ) (4.50)
j=1 eti σ ti − Λj (Zj (ti ) − 1)
s.t.

αj
Zj (ti ) = eβj ti , ∀j (4.51)
αj − ti
Λj
ρj = + Λj βj < 1 , ∀j (4.52)
αj
X
Λj = λi qi,j , ∀j (4.53)
i
X
qi,j = ki , ∀i (4.54)
j

qi,j = 0, j ∈
/ Gi , ∀i, j (4.55)
qi,j ∈ [0, 1] , ∀i, j (4.56)
tei > 0 , ∀i (4.57)
ti > 0 , ∀i (4.58)
tei (tei − αj + Λj ) + Λj αj (eβjeti − 1) < 0 (4.59)
ti (ti − αj + Λj ) + Λj αj (eβj ti − 1) < 0 (4.60)

var q , t,
where θ ∈ [0, 1] is a trade-off factor that determines the relative sig-
nificance of mean latency and latency tail probability in the objective
function. By changing θ from θ = 1 to θ = 0, the solution for (4.50)
62 Probabilistic Scheduling Approach

spans the solutions that minimize the mean latency to ones that mini-
mize the tail latency probability. Note that constraint (4.52) gives the
load intensity of server j. Constraint (4.53) gives the aggregate arrival
rate Λj for each node for the given probabilistic scheduling probabili-
ties qi,j and arrival rates λi . Constraints (4.54)-(4.56) guarantee that
the scheduling probabilities are feasible. Also, Constraints (4.57)-(4.60)
ensure that the moment generating function given in (4.20) exists. Note
that the optimization over q helps decrease the overall latency which
gives significant flexibility over choosing the lowest-queue servers for
accessing the files. We further note that the optimization problem in
(4.50) is non-convex as, for instance, Constraint (4.59) is non-convex in
(q, t) jointly. In order to solve the problem, we can use an alternating op-
timization that divides the problem into two subproblems that optimize
one variable while fixing the another. In order to solve each subproblem,
we use the iNner cOnVex Approximation (NOVA) algorithm proposed
in (Scutari et al., 2017), which guarantees convergence to a stationary
point. Based on this, it can be shown that the alternating optimization
converges to a stationary point.
To validate our proposed algorithm for joint mean-tail latency and
evaluate its performance, we simulate our algorithm in a distributed
storage system of m = 12 distributed nodes, r = 1000 files, all of size
200 MB and using (7, 4). However, our model can be used for any given
number of storage servers, any number of files, and for any erasure
coding setting. We consider a shifted-exponential distribution for the
chunk service times as it has been shown in real system measurements
on Tahoe and Amazon S3 servers (Baccelli et al., 1989; Chen et al.,
2014a; S3, n.d.). The service time parameters αj and βj are shown in
Table 4.1. Unless otherwise explicitly stated, the arrival rate for the first
500 files is 0.002s−1 while for the next 500 files is set to be 0.003s−1 .
Node 1 Node 2 Node 3 Node 4 Node 5 Node 6
αj 18.23 24.06 11.88 17.06 20.19 23.91
Node 7 Node 8 Node 9 Node 10 Node 11 Node 12
αj 27.01 21.39 9.92 24.96 26.53 21.80

Table 4.1: Storage Node Parameters Used in our Simulation (Shift βj = 10 msec, ∀j
and rate α in 1/s).
4.6. Simulations 63

In order to initialize our algorithm, we use a random placement of


each file on 7 out of the 12 servers. Further, we set qi,j = k/n on the
placed servers with ti = 0.01 ∀i and j ∈ Gi . However, these choices of
qi,j and ti may not be feasible. Thus, we modify the initialization to be
closest norm feasible solution.
We compare the proposed approach with two baselines.

1. PSP (Projected Service-Rate Proportional Access) Policy: The


access probabilities q are assigned proportional to the service
. rates
µ
of the storage nodes, i.e., qi,j = ki P jµ , where µj = 1 ( α1j + βj ) .
j j
This policy assigns servers proportional to their service rates.
These access probabilities are projected toward feasible region
in (4.50) to ensure stability of the storage system. With these
fixed access probabilities, the QoE metrics are optimized over the
auxiliary variables t using NOVA.

2. PEA (Projected Equal Access) Policy: In this strategy, we set


qi,j = k/n on the placed servers with ti = 0.01 ∀i and j ∈ Gi .
We then modify the initialization of q to be closest norm feasible
solution given above values of t. Finally, an optimization over t is
performed to the objective using NOVA.

Mean Latency: We first let θ = 1. We also compare with the


third policy, that is based on optimizing the mean latency upper bound
in Theorem 4.7. Figure 4.2 plots the effect of different arrival rates on
the upper bound of the mean latency where we compare our proposed
algorithm with three different policies. Here, the arrival rate of each file
λi is varied from 0.2×λi to 1.2×λi , where λi is the base arrival rate. We
note that our proposed algorithm outperforms all these strategies for the
QoE metric of mean latency. Thus, both access and file-based auxiliary
variables of files are both important for the reduction of mean latency.
We also note that uniformly accessing servers (PEA) and simple service-
rate-based scheduling (PSP) are unable to optimize the request based on
different factors like arrival rates, different latency weights, thus leading
to much higher latency. As expected, the mean latency increases with
arrival rates. However, at high arrival rates, we see significant reduction
in mean latency for our proposed approach. For example, we see, at the
64 Probabilistic Scheduling Approach

250

Weighted Mean Latency (in seconds)


PEA Approach
PSP Approach
Approach in Theorem 4.7
200
Prop. Approach

150

100

50
0.2 λi 0.4λi 0.6 λi 0.8λi 1λi 1.2 λi
Arrival Rate (λi)

Figure 4.2: Weighted mean latency for different file arrival rates. We vary the
arrival rate of file i from 0.2 × λi to 1.2 × λi , where λi is the base arrival rate.

highest arrival rate, approximately 25% reduction in weighted mean


latency as compared to the proposed approach in (Xiang et al., 2016;
Xiang et al., 2014), given in Theorem 4.7.
Tail Latency: For λi ’s as the base arrival rates and σ = 80
seconds, we increase the arrival rate of all files from 0.3λ to 2.1λ and
plot the weighted latency tail probability in Figure 4.3. We note that
our algorithm assigns differentiated latency for different files to keep
low weighted latency tail probability. We also observe that our proposed
algorithm outperforms all strategies for all arrival rates. For example,
at the highest arrival rate, the proposed approach performs much better
compared to (Aggarwal et al., 2017b; Al-Abbasi et al., 2019a), i.e.,
a significant reduction in tail probability from 0.04 to 0.02. Hence,
reducing the latency of the high arrival rate files and exploiting the role
of auxiliary variables result in reducing the overall weighted latency tail
probability.
Tradeoff: We investigate the tradeoff between weighted mean
latency and weighted latency tail probability in Figure 4.4. Intuitively,
if the mean latency decreases, the latency tail probability also reduces.
Thus, one may wonder whether the optimal point for decreasing the
4.7. Notes and Open Problems 65

0.12
Wighted Latency Tail Probability PEA Approach
PSP Approach
0.1
Approach in (Aggarwal et al., 2017b)
Prop. Approach
0.08

0.06

0.04

0.02

0
0.3 λi 0.6 λi 0.9λi 1.2 λi 1.5 λi 1.8λi 2.1λi
Arrival Rate (λi)

Figure 4.3: Weighted latency tail probability for different file arrival rates. We vary
the arrival rate of file i from 0.3 × λi to 2.1 × λi , where λi is the base arrival rate.

mean latency and the latency tail probability is the same? From Figure
4.4, we answer this question negatively since for r = 1000 and m = 12,
we find out that the optimal mean latency is approximately 43% lower
as compared to the mean latency at the value of (q,t) that optimizes
the weighted latency tail probability. Hence, an efficient tradeoff point
between the two QoE metrics can be chosen based on the point on the
curve that is appropriate for the clients.

4.7 Notes and Open Problems

Probablistic scheduling for erasure-coded storage system was first pro-


posed in (Xiang et al., 2016; Xiang et al., 2014), where the mean latency
was characterized. The theoretical analysis on joint latency-plus-cost
optimization is evaluated in Tahoe (B. Warner and Kinninmont, 2012),
which is an open-source, distributed file system based on the zfec erasure
coding library for fault tolerance. The mean latency expressions are
further extended in (Al-Abbasi and Aggarwal, 2018b). Differentiated
latency in erasure-coded storage by investigating weighted queue and
priority queue policies was considered in (Xiang et al., 2015b; Xiang
et al., 2017). The problem of erasure-coded storage in a data center
network needs to account for the limited bandwidth available at both
66 Probabilistic Scheduling Approach

0.014
Weighted Latency Tail Probability
0.012
θ decreases from θ = 10−4
−6
0.01 to θ = 10
−4
0.008 θ = 10

0.006

0.004

0.002
θ = 10−6
0
65 75 85 95 105 115
Weighted Mean Latency (in seconds)

Figure 4.4: Tradeoff between weighted mean latency and weighted latency tail
probability obtained by varying θ in the objective function given by (4.50). We vary
θ (coefficient of weighted mean latency) from θ = 10−4 to θ = 10−6 . These values are
chosen carefully to bring the two QoE metrics to a comparable scale, since weighted
mean latency is orders of magnitude higher than weighted latency tail probability.

top-of-the-rack and aggregation switches, and differentiated service re-


quirements of the tenants. This is accounted via efficient splitting of
network bandwidth among different intra- and inter-rack data flows for
different service classes in line with their traffic statistics (Xiang et al.,
2015a; Xiang et al., 2019). Erasure coding can lead to new caching
designs, for which the latency has been characterized (Aggarwal et al.,
2016; Aggarwal et al., 2017a). The proposed approach is prototyped
using Ceph, an open-source erasure-coded storage system (Weil et al.,
2006) and tested on a real-world storage testbed with an emulation of
real storage workload, as will de detailed in Chapter 7.
The evaluation of tail latency in erasure-coded storage systems using
probalistic scheduling was first considered in (Aggarwal et al., 2017b).
This was further extended in (Al-Abbasi et al., 2019a), where the prob-
abilistic scheduling-based algorithms were shown to be (asymptotically)
optimal since they are able to achieve the exact tail index. The analysis
in this chapter further extends these works, and is shown to outperform
these previous analysis in this chapter. These extended results appear
for the first time in this monograph. The authors of (Wang et al., 2019)
4.7. Notes and Open Problems 67

considered asymptotic regime for n in the case of uniform probabilistic


scheduling.
The results for mean and tail latency in this Chapter have been
extended from the works above, with an aim of giving a concise repre-
sentation for general service process. The approach in this chapter has
also been used in (Al-Abbasi and Aggarwal, 2020), where TTL based
caching is also considered, and (Al-Abbasi and Aggarwal, 2018d), where
the results are extended to stall duration (and will be covered in detail
in Chapter 6).
The approach could be further extended on the following directions:

1. Placement of multiple chunks on the same node: This case


arises when a group of storage nodes share a single bottleneck
(e.g., outgoing bandwidth at a regional datacenter) and must be
modeled by a single queue, or in small clusters the number storage
node is less than that of created file chunks (i.e., ni > m). As
a result, multiple chunk requests corresponding to the same file
request can be submitted to the same queue, which processes
the requests sequentially and results in dependent chunk service
times. The analysis in this chapter can be extended on the lines
of (Xiang et al., 2016).

2. File can be retrieved from more than k nodes: We first


note that file can be retrieved by obtaining Fi /di amount of data
from di ≥ ki nodes with the same placement and the same (ni , ki )
MDS code. To see this, consider that the content at each node
is subdivided into B = kdii sub-chunks (We assume that each


chunk can be perfectly divided and ignore the effect of non-perfect


division). Let L = {L1 , · · · , LB } be the list of all B combinations
of di servers such that each combination is of size ki . In order to
access the data, we get mth sub-chunks from all the servers in
Lm for all m = 1, 2, · · · B. Thus, the total size of data retrieved
is of size Fi , which is evenly accessed from all the di nodes. In
order to obtain the data, we have enough data to decode since ki
sub-chunks are available for each m and we assume a linear MDS
code.
68 Probabilistic Scheduling Approach

In this case, smaller amount of data can be obtained from more


nodes. Obtaining data from more nodes has an effect of considering
worst ordered statistics having an effect on increasing latency,
while the smaller file size from each of the node facilitating higher
parallelization, and thus decreasing latency. The optimal value of
the number of disks to access can then be optimized. However,
the analysis of the mean and tail latency can be easily extended
following the approach in (Xiang et al., 2016).

3. Asymptotic Independence Results for Heterogenous Files


and Servers: We note that the result in Section 4.4 states that
the steady state job delay is upper bounded by the delay given by
independent task delays. Further, the steady-state job delay holds
was characterized for n large. However, these results hold for all
files of same size and same erasure code. Further, the servers are
assumed to be homogenous. Finally, the probabilistic scheduling
probabilities are assumed to be equal. Extension of these results
when such assumptions do not hold are open.

4. Efficient Caching Mechanisms: Efficient caching can help


reduce both mean and tail latency. Different caching mechanisms
based on Least-Resenly-Used strategy and its adaptations have
been studied (Al-Abbasi and Aggarwal, 2020; Friedlander and
Aggarwal, 2019). Erasure-coded mechanisms of caching have also
been explored (Aggarwal et al., 2017a). In an independent line of
work, coded caching strategies have been proposed which use a
single central server (Pedarsani et al., 2015), with extensions to
distributed storage (Luo et al., 2019). Integrating efficient caching
mechanisms with distributed storage and evaluating them in terms
of latency is an interesting problem.
5
Delayed-Relaunch Scheduling Approach

In this Chapter, we introduce the model of Delayed Relaunch Scheduling


in Section 5.1, which is first proposed for distributed storage in this
monograph. This model generalizes the models of Fork-Join Scheduling
in Chapter 3.1 and the Probabilistic Scheduling in Chapter 4.1, and
thus the guarantees in those chapters hold for the relevant parameters.
Such a model is generalized from the earlier works on speculative
execution for cloud computing (Aktaş and Soljanin, 2019; Badita et al.,
2020a). Even though queueing analysis is not available in general for
this strategy, the analysis is provided for a single job. The inter-service
time of different chunks is provided in Section 5.2, which is used to
characetrize two metrics - Mean Service Completion Time and Mean
Server Utilization Cost in Section 5.3, for shifted exponential service
times of the homogenous servers. Sections 5.4 and 5.5 contain simulation
results and notes on future directions, respectively.

5.1 Delayed-Relaunch Scheduling

We consider the system model introduced in Chapter 1.1. In Fork-Join


scheduling, the request was sent to all ni servers and the job completed
when ki servers finish execution. In Probabilistic scheduling, the request

69
70 Delayed-Relaunch Scheduling Approach

was sent to ki servers using a probabilistic approach. The delayed


relaunch scheduling sends the requests to the servers in stages. In stage
d, the request is sent to ni,d servers. The job is complete when ki servers
have finished execution. The time between the stages can either be
deterministic, random variable independent of server completion times,
or a random variable based on the different task completion times.
Since in Fork-Join scheduling, all ni servers may be busy processing
the file wasting time at all of ni − ki servers that will eventually not be
used, the delayed scheduling aims to reduce the additional time spent
at ni − ki servers by launching them at a later time.
In order to see the concept of delayed relaunch scheduling, see Figure
5.1 (where file index i is supressed). For a file request, ni,0 tasks are
requested at time t0 = 0, ni,1 tasks are requested at time t1 , and so
on. Based on these requests, the overall job is complete when ki servers
have finished execution.

t0 t1 t2
0 1 2 3 4 5 6 7 8

n0

n1

n2
s1

Figure 5.1: This figure illustrates two-forking, by plotting the different completion
times on the real line, with the forked servers n0 = 4, n1 = 5, n2 = 3 at forking
points t0 = 0, t1 = 2, t2 = 4. The first task completes at s1 .

In this Chapter, we assume ni = n and ki = k for all i, and thus


index i will be supressed. We assume homogeneous servers with shifted
exponential service times, Sexp(c, µ), where Sexp(c, µ) is as defined in
(3.14). Analysis of delayed relaunch scheduling is not as straightforward
due to the added complexity in choosing td . For Fork-Join scheduling,
n0 = n and others are zero, while in the case of probabilistic scheduling
n0 = k. In addition, the choice of servers has similar challenges as in
5.1. Delayed-Relaunch Scheduling 71

Probabilistic Scheduling. The cancellation of remaining tasks after k


finished execution is akin to the Fork-Join scheduling. Another challenge
in the scheduling is as to where in the queue should the tasks requested
at time t1 be placed - at the tail or based on the job request times
or some other approach? These challenges make the problem hard to
analyze. We note that since the approach is a generalization of Fork-Join
scheduling and Probabilistic Scheduling, the latency optimized over the
different parameters of Delayed Relaunch scheduling is lower than that
for Fork-Join scheduling and Probabilistic Scheduling. Thus, the tail
index optimality holds also for delayed relaunch scheduling.
In order to make progress, we do not consider the scheduling ap-
proach in this chapter while focus on a single job. The analysis for a
stream of jobs and general service time distributions is left as a future
work. We assume a single-fork scheduling, where a file request starts
at n0 parallel servers at time t0 = 0, and adds n1 = n − n0 servers
at a random time instant t1 corresponding to service completion time
of the `0 th coded sub-task out of n0 initial servers. The total service
completion time is given by t2 when the remaining coded sub-tasks at
`1 = k − `0 servers are completed. Since we can’t have more service
completions than the number of servers in service, we have `0 ≤ n0 and
`0 + `1 = k ≤ n. The overall scheduling approach choses the first n0
servers using probabilistic scheduling among the n servers having the
corresponding file. Further, the choice of n1 servers is from the rest of
n − n0 servers that were not chosen for the first phase. The request
is cancelled from n0 + n1 − k servers that still have the file chunk in
queue/service when k chunks have been received.
We note that the latency analysis of the delayed relaunch scheduling
is open. In this chapter, we will analyse the delayed relaunch scheduling
for a single job. In this case, there is no queue and earlier jobs in the
system. We will consider two metrics - the service completion time and
the service utilization cost. The first indicate the latency in extremely
lightly loaded system while the second indicate the amount of load
created by the job for other jobs in the system.
The service completion time for k requested chunks is denoted
by t2 and the server utilization cost by W . We denote the service
completion time of rth coded sub-task in ith stage [ti , ti+1 ) by ti,r
72 Delayed-Relaunch Scheduling Approach

where i ∈ {0, 1}. Since each stage consists of `i service completions, we


have r ∈ {0, . . . , `i } such that ti,0 = ti and ti,`i = ti+1,0 = ti+1 .
Assuming that a server is discarded after its chunk completion, we
can write the utilization cost in this case as the time-integral of number
of servers that are ON during the service completion [0, t2 ], multiplied
by the server utilization cost per unit time

X i −1
1 h `X i
X i
W =λ (ti,r+1 − ti,r ) (nj − `j ) + `i − r . (5.1)
i=0 r=0 j=0

The total service completion time S = t2 can be written as the following


telescopic sum  
1
X `i
X
S=  (ti,r − ti,r−1 ) . (5.2)
i=0 r=1

Thus, both the metrics rely on inter-service times ti,r − ti,r−1 , which
will be characterized in the next section, followed by the result on the
two metrics.

5.2 Characterization of Inter-Service Times of Different Chunks


for Single Job

Before we start the analysis, we describe some preliminary results and


definitions. Let Ti , i ∈ {1, · · · , n}, be a shifted exponential with rate µ
and shift c, such that the complementary distribution function F̄ = 1−F
can be written

F̄ (x) , P {Ti > x} = 1x∈[0,c] + e−µ(x−c) 1x≥c . (5.3)

We see that Ti0 , Ti − c are i.i.d. random variables distributed exponen-


tially with rate µ. We denote the jth order statistic of (T10 , . . . , Tn0 ) by
Xjn . The jth order statistic of (T1 , . . . , Tn ) is c + Xjn . The distribution
of ordered statistics is given as follows.

Lemma 5.1. Let (X1 , . . . , Xn ) be n i.i.d. random variables with com-


mon distribution function F , and we denote the jth order statistics
of this collection by Xjn . Then the distribution of Xjn is given by
n o Pn n
P Xjn ≤ x = i=j
i
i F (x) F̄ (x)
n−i .
5.2. Characterization of Inter-Service Times of Different Chunks for
Single Job 73

The next result provides the mean gap between two inter-arrivals.

Lemma 5.2. Denoting X0n = 0, from the memoryless property of Ti0 ,


we observe the following equality in joint distribution of two vectors
Tj0
!
(Xjn − n
Xj−1 : j ∈ [n]) = : j ∈ [n] . (5.4)
n−j+1

We next introduce a definition of the Pochhammer function, which


will be used in the analysis.
Γ(a+n)
Definition 5.1. We denote the Pochhammer function (a)n , Γ(a) to
define the z-transform of hypergeometric series as
∞ Qp
ha , . . . , a
1 p
i X (ai )n z n
p Fq (z) , p Fq ;z = Qqi=1 . (5.5)
b1 , . . . , bq n=0 j=1 (bj )n (n)!

Because generalizations of the above series also exist (Gasper et al.,


2004), this series is referred to here as the hypergeometric series rather
than as the generalized hypergeometric series.

Remark 5.1. For positive integers p, q and positive reals c, µ, we have


the following identity in terms of the hypergeometric series p Fq defined
in Definition 5.1
−µx )q (e−µx −e−µc )p−q
xe−µx (1−e
Rc
0 (1−e−µc )p+2
dx
  h i
1,1,q+2
= 1
(p+2)µ2 (p+1 3 F2 2,p+3 ; 1 − e−µc . (5.6)
q+1)

Using the definition of hypergeometric series, it can be verified


that expression in Remark 5.1 simplifies to the expression below for
p = q = m − 1.

Corollary 5.3. For positive integer m and positive reals c, µ, we have


the following identity mµ 0c xe−µx (1 − e−µx )m−1 dx
R
Pm (1−e−µc )i
= c(1 − e−µc )m − c + i=1 iµ .

Having provided some definitions and the basic results, we analyze


the time between two service completions of the chunks. Recall that
we have two contiguous stages. The time interval [t0 , t1 ) corresponds
74 Delayed-Relaunch Scheduling Approach

to the stage 0, and the interval [t1 , t2 ] corresponds to the stage 1. In


stage 0, we switch on n0 initial servers at instant t0 = 0. This stage is
completed at the single forking point denoted by the instant t1 , when `0
chunks out of n0 are completed. At the beginning of stage 1, additional
n1 = n − n0 servers are switched on, each working on a unique chunk.
The job is completed at the end of this second stage, when remaining
k − `0 chunks are completed. The kth service completion time is denoted
by t2 . We will separately analyze these two stages in the following.
We will first compute the mean of the interval [t0,r−1 , t0,r ) for each
r ∈ [`0 ].

Lemma 5.4. The mean time between two coded sub-task completions
in the single forking scheme for i.i.d. shifted exponential coded sub-task
completion times in stage 0 is

c + 1 , r = 1,
µn0
E [t0,r − t0,r−1 ] = 1
(5.7)

µ(n0 −r+1) , r ∈ {2, . . . , `0 }.

Proof. Since t0,r is the completion time of first r coded sub-tasks out
of n0 parallel coded sub-tasks, we have t0,r = c + Xrn0 . Hence, for each
r ∈ [`0 ], we have
n0
t0,r − t0,r−1 = (c + Xrn0 ) − (c + Xr−1 ). (5.8)

The chunk requests are initiated at time t0,0 = t0 = 0 and hence the
first chunk is completed at t0,1 − t0,0 = c + X1n0 .
From Lemma 5.2, we can write the following equality in distribution
0

c + T1 , r = 1,
n0
t0,r − t0,r−1 = Tr0 (5.9)

(n0 −r+1) , r ∈ {2, . . . , `0 } ,
where (T10 , . . . , Tn0 ) are i.i.d. exponentially distributed random variables
with rate µ. Taking expectations on both sides, we get the result.

Having analyzed the Stage 0, we now compute the mean of the


interval [t1,r−1 , t1,r ) for each r ∈ [`1 ]. The difficulty in this computation
is that additional n1 servers that start working on chunk requests at
5.2. Characterization of Inter-Service Times of Different Chunks for
Single Job 75

the single forking-time t1 , have an initial start-up time of c due to the


shifted exponential service distribution. Hence, none of these additional
n1 servers can complete service before time t1 + c. Whereas, some of the
n0 − `0 servers with unfinished chunk requests from stage 0 can finish
their chunks in this time-interval (t1 , t1 + c]. In general, the number of
chunk completions in the interval (t1 , t1 + c] is a random variable, which
we denote by N (t1 , t1 + c) ∈ {0, . . . , n0 − `0 }.
We first compute the probability mass function of this discrete
valued random variable N (t1 , t1 + c). We denote the event of j − `0
chunk completions in this interval (t1 , t1 + c] for any `0 ≤ j ≤ n0 by
n o
Ej−`0 , N (t1 , t1 + c) = j − `0 , t1 = c + X`n00 . (5.10)

Lemma 5.5. The probability distribution of the number of chunk com-


pletions N (t1 , t1 + c) in the interval (t1 , t1 + c] for `0 ≤ j ≤ n0 is given
by pj−`0 , P (Ej−`0 ) where
!
n 0 − `0
pj−`0 = (1 − e−µc )j−`0 e−(n0 −j)µc . (5.11)
j − `0
Proof. Let the number of service completions until time t1 + c be
j ∈ {`0 , . . . , n0 }. We can write the event of j − `0 service completions
in the interval (t1 , t1 + c] as
n o n oc
Xjn0 − X`n00 ≤ c ∩ Xj+1
n0
− X`n00 ≤ c . (5.12)

From the definition of order statistics for continuous random variables,


we have Xjn0 < Xj+1 n0
. This implies that the intersection of events
n o n o n o
Xjn0 − X`n00 ≤ c and Xj+1 n0
− X`n00 ≤ c is Xj+1
n0
− X`n00 ≤ c .
Therefore, from the disjointness of complementary events and prob-
ability axiom for summation of disjoint events, it follows
n o n o
pj−`0 = P Xjn0 − X`n00 ≤ c − P Xj+1
n0
− X`n00 ≤ c . (5.13)

Due to memoryless property, we can write the above as


n o n o
n0 −`0 n0 −`0
pj−`0 = P Xj−` 0
≤ c − P Xj+1−` 0
≤c . (5.14)

From ordered statistics for exponentially distributed random variables


with rate µ, we get the required form of pj−`0 (Let (X1 , . . . , Xn ) be
76 Delayed-Relaunch Scheduling Approach

n i.i.d. random variables with common distribution function F , and


we denote the jth order statistics of this collection by Xjn . Then the
n o Pn n
distribution of Xjn is given by P Xjn ≤ x = i=j
i
i F (x) F̄ (x)
n−i .).

Let {s1 , s2 , . . . , sn0 −`0 } be the chunk completion times in stage 1


after the forking time t1 , which in definition correspond to {t1,1 =
t1 + s1 , t1,2 = t1 + s2 , . . . , t1,n0 −`0 = t1 + sn0 −`0 }. In stage 1, the chunk
completions numbered r ∈ [j − `0 ] are finished only by the n0 − `0
servers within time t1 + c, since none of the n1 servers started at forking
point t1 are able to finish even a single chunk with in the time t1 + c,
whereas the chunk completions numbered r ∈ {j − `0 + 1, . . . , k − `0 }
are finished by n − j servers which include subset of combination of
both left over initial servers and all forked servers.
We next find mean of rth completion time in stage 1 conditioned
on the event Ej−`0 .

Lemma 5.6. For any r ∈ [j − `0 ] and α = 1 − e−cµ , we have


  
3 F2 1,1,r+1 ; α rα
r < j − `0 ,
2,j−`0 +2 µ(j−`0 +1) ,
E [sr |Ej−`0 ] = h i (5.15)
c 1 − α−r + Pr αi−r , r = j − `0 .
i=1 icµ

Proof. We denote m = j − `0 for convenience. Let N (t1 , t1 + c) = m,


then t1 + s1 , . . . , t1 + sm are the chunk completion times of the first m
servers out of n0 − `0 parallel servers in their memoryless phase in the
duration [t1 , t1 + c). In the duration [t1,r−1 , t1,r ) for r ∈ [m], there are
n0 − `0 − r + 1 parallel servers in their memoryless phase, and hence the
inter-service completion times (t1,r − t1,r−1 : r ∈ [m]) are independent
and distributed exponentially with parameter µr , (n0 − `0 − r + 1)µ.
Denoting s0 = 0, we have sr − sr−1 = t1,r − t1,r−1 for each r ∈ [m].
From the definition of µr ’s and pm , the independence of sr − sr−1 , and
rearrangement of terms we can write the conditional joint density of
vector s = (s1 , s2 , . . . , sm ) given event Em as
m
Y iµe−µsi
fs1 ,...,sm |Em = . (5.16)
i=1
1 − e−cµ
5.2. Characterization of Inter-Service Times of Different Chunks for
Single Job 77

From the definition of the task completion times, the possible values of
the vector s = (s1 , . . . , sm ) satisfy the constraint 0 < s1 < · · · < sm < c.
That is, we can write the set of possible values for vector s as Am , where
Am is a vector of increasing co-ordinates bounded between (0, c), and
can be written as
Am , {s ∈ Rm : 0 < s1 < · · · < sm < c} . (5.17)
This constraint couples the set of achievable values for the vector s,
and hence even though the conditional density has a product form, the
random variables (s1 , . . . , sm ) are not conditionally independent given
the event Em .
To compute the conditional expectation E [sr |Em ], we find the con-
ditional marginal density of sr given the event Em . To this end, we
integrate the conditional joint density of vector s over variables without
sr . In terms of sr ∈ (0, c), we can write the region of integration as the
following intersection of regions,
A−r
m = ∩i<r {0 < si < si+1 } ∩i>r {si−1 < si < c} . (5.18)
Using the conditional density of vector s defined in (5.16) in the above
equation, and denoting α , 1 − e−cµ and αr , 1 − e−µsr for clar-
ity of presentation, we can compute the conditional marginal density
function (Ross, 2019)
!
mµ(1 − αr ) m − 1
fsr |Em = (αr )r−1 (α − αr )m−r . (5.19)
αm r−1
The conditional mean E [sr |Em ] = 0c fsr |Em dsr is obtained by integrat-
R

ing the conditional marginal density in (5.19), over sr ∈ (0, c). For
r ∈ [m − 1], the result follows from the integral identity of Remark 5.1
for x = sr , q = r − 1, p = m − 1 and α = 1 − e−µc . Similarly, the result
for r = j − `0 follows from Corollary 5.3 for x = sm and m = j − `0 .

In stage 1, for 1 ≤ r ≤ j − `0 , we have


n0 −`0
t1,r − t1,r−1 = (Xrn0 −`0 − Xr−1 )1Ej−`0 . (5.20)
For j − `0 + 2 ≤ r ≤ k − `0 , the difference t1,r − t1,r−1 is equal to
n−j n−j
(Xr−j+`0
− Xr−j+`0 −1
)1Ej−`0 . (5.21)
78 Delayed-Relaunch Scheduling Approach

When r = j − `0 + 1, we write the time difference between rth and


(r − 1)th chunk completion instants as
t1,r − t1,r−1 = t1,r − (t1 + c) + (t1 + c) − t1,r−1 . (5.22)
For r = j − `0 + 1, we have t1,r−1 ≤ t1 + c < t1,r . In the disjoint
intervals [t1,r−1 , t1 + c) and [t1 + c, t1,r ), there are n0 − j and n − j i.i.d.
exponentially distributed parallel servers respectively. Since the age and
excess service times of exponential random variables are independent
at any constant time, we have independence of t1,r − (t1 + c) and
(t1 + c) − t1,r−1 for r = j − `0 + 1.
Conditioned on the event Ej−`0 of j − `0 chunk completions in the
interval (t1 , t1 + c], the conditional mean of inter-chunk completion time
in stage 1 is
E [(t1,r − t1,r−1 )|Ej−`0 ] = E[(sr − sr−1 )(1{ j − `0 > r − 1} (5.23)
+1{ j − `0 = r − 1} + 1{ j − `0 < r − 1})|Ej−`0 ]. (5.24)
Lemma 5.7. For any r ∈ [k − `0 ], and α = 1 − e−cµ the conditional
mean E [(t1,r − t1,r−1 )|Ej−`0 ] equals
  
1,r rα


2 F1 j−`0 +2 ; α µ(j−`0 +1) , r < j − `0 + 1,

 h r−1
P i−r+1 i
1 α 1 (5.25)
c α(r−1)
− icµ + µ(n−j) , r = j − `0 + 1,


 i=1
 1

µ(n−`0 −r+1) , r > j − `0 + 1.
Proof. Recall that, we have n0 − `0 parallel servers in their memoryless
phase working on individual chunks in the interval (t1 , t1 + c]. In this
duration, N (t1 , t1 + c) chunks are completed and additional n1 parallel
servers start their memoryless phase at time t1 + c.
We first consider the case when r − 1 > N (t1 , t1 + c) = j − `0 . This
implies that t1,r−1 > t1 + c and there are n − `0 − r + 1 parallel servers in
their memoryless phase working on remaining chunks. From Lemma 5.2,
the following equality holds in distribution
Tr0
t1,r − t1,r−1 = . (5.26)
n − `0 − r + 1
Recall that Ej−`0 ∈ σ(T10 , . . . , Tj−`
0
0 +1
), and since (Ti0 : i ∈ N) is an i.i.d.
sequence, it follows that t1,r − t1,r−1 is independent of the event Ej−`0
5.2. Characterization of Inter-Service Times of Different Chunks for
Single Job 79

for r > j − `0 + 1 and hence E [t1,r − t1,r−1 |Ej−`0 ] = E [t1,r − t1,r−1 ].


The result follows from the fact that E [Ti0 ] = µ1 .
We next consider the case when r − 1 = N (t1 , t1 + c) = j − `0 . By
definition of N (t1 , t1 + c), we have t1,r−1 ≤ t1 + c < t1,r . In the disjoint
intervals (t1,r−1 , t1 + c] and (t1 + c, t1,r ], there are n0 − j and n − j
i.i.d. exponentially distributed parallel servers respectively. Therefore,
writing t1,r − t1,r−1 as (t1,r − (t1 + c)) + ((t1 + c) − t1,r−1 ), and using
Lemma 5.2, we compute the conditional mean of the first part as
" #
0
Tr+1 1
E [t1,r − (t1 + c)|Ej−`0 ] = E = . (5.27)
n−j µ(n − j)

By using the fact t1,r−1 = t1 +sr−1 , we can write the conditional mean of
the second part as E [t1 + c − t1,r−1 |Ej−`0 ] = c − E [sr−1 |Ej−`0 ] , where
E [sr−1 |Ej−`0 ] is given by Lemma 5.6. Summing these two parts, we get
the conditional expectation for r = j − `0 + 1.
For the case when r ∈ [j − `0 ], the result follows from Lemma 5.6
and the fact t1,r = t1 + sr .

We next compute the unconditional mean of inter-chunk com-


pletion time E [(t1,r − t1,r−1 )] by averaging out the conditional mean
E [(t1,r − t1,r−1 )|Ej−`0 ] over all possible values of j. We denote m = j−`0
for convenience.

Corollary 5.8. For each r ∈ [k − `0 ], by considering all possible values


of m from the set {0, 1, . . . , n − `0 }, the mean inter-service completion
time in stage 1, is

E [t1,r − t1,r−1 ] = pm µ(n−`01−r+1)


P
m:m+1<r (5.28)
h h r−1
P αi−r+1 i i
1 1

P
+ m:m+1=r pm c α(r−1) icµ + µ(n−j) (5.29)
i=1
 
P 1,r rα
+ m:m+1>r 2 F1 m+2 ; α µ(m+1) pm . (5.30)

Proof. The result follows by using Lemma 5.7 and from the tower
property of nested expectations

E [(t1,r − t1,r−1 )] = E [E [(t1,r − t1,r−1 )|Ej−`0 ]] , (5.31)


80 Delayed-Relaunch Scheduling Approach

and the fact that N (t1 , t1 + c) ∈ {0, . . . , n0 − `0 } and pm is defined


in (5.11), as the probability of the number of service completions
N (t1 , t1 + c) in the interval (t1 , t1 + c] being m = j − `0 where t1
is the time of `0 completions of initial n0 chunks.

5.3 Characterization of Mean Service Completion Time and Mean


Server Utilization Cost for Single Job

We are now ready to compute the means of service completion time


and server utilization cost. We first consider the metrics in the Stage 0,
based on Lemma 5.4.

Lemma 5.9. Consider single-forking with i.i.d. shifted exponential


coded sub-task completion times and initial number of servers n0 in
stage 0. The mean forking time is given by
`0
X 1
E [t1 ] = c + . (5.32)
r=1
µ(n0 − r + 1)

The mean server utilization cost in stage 0 is given by


λ
E [W0 ] = (`0 + µn0 c). (5.33)
µ
Proof. We can write the completion time t1 of `0 th coded sub-task
out of n0 in parallel, as a telescopic sum of length of coded sub-task
completions given in (5.2). Taking expectations on both sides, the mean
forking point E [t1 ] follows from the the linearity of expectations and
the mean length of each coded sub-task completion (5.7).
Taking expectation of the server utilization cost in (5.1), the mean
server utilization cost E [W0 ] in stage 0 follows from the linearity of expec-
tations and the mean length of each coded sub-task completion (5.7).

Next, we consider two possibilities for the initial number of servers


n0 : when n0 < k and otherwise.
Note that when n0 < k, then t2 > t1 + c almost surely, since k coded
sub-tasks can never be finished by initial n0 servers. The next result
computes the mean service completion time and mean server utilization
cost for n0 < k case.
5.3. Characterization of Mean Service Completion Time and Mean
Server Utilization Cost for Single Job 81

Theorem 5.10. For the single forking case with n total servers for k sub-
tasks and initial number of servers n0 < k, the mean server utilization
cost is
λk
E [W ] = λnc + , (5.34)
µ
and the mean service completion time is
n0 k−1
1 X X 1
E [t2 ] = c + E [t1 ] + pj−`0 , (5.35)
µ j=` i=j
(n − i)
0

where E [t1 ] is given in (5.32) and pj−`0 is given in (5.11).

Proof. The proof follows by substituting inter-chunk times in (5.1) and


(5.2) and simplifications. The details are omitted, while can be seen in
(Badita et al., 2020a).

From Theorem 5.10, we observe that the mean server utilization


cost remains same for all values of initial number of servers n0 < k and
forking threshold `0 . The mean service completion time decreases as we
increase the number of initial servers, and thus at n0 = n will have lower
mean service completion time. Further, the mean service utilization
cost for n0 = n can be easily shown to be λnc + λk µ which is the same
as that for all n0 < k. Thus, as compared to no forking (n0 = n), the
single forking with n0 < k has the same mean server utilization cost
while it has higher mean service completion time. Thus, this regime
doesn’t provide any tradeoff point between service completion time and
server utilization cost which is worse than no-forking, and hence the
only region of interest for a system designer is n0 ≥ k, which is studied
in the following.
For n0 ≥ k, the number of completed chunks `0 at the forking
point t1 are in {0, 1, . . . , k}. There are three different possibilities for
completing k chunks. First possibility is `0 = k, when all the required
k chunks are finished on initial n0 servers without any forking. In this
case, t2 = t1 . For the next two possibilities, `0 < k and hence forking is
needed.
Second possibility is `0 < k and `0 + N (t1 , t1 + c) = j ≤ k − 1,
where j − `0 service completions occur in the duration [t1 , t1 + c) and
82 Delayed-Relaunch Scheduling Approach

`0 ≤ j ≤ k − 1. This implies that even though n0 > k, the total


chunks finished until instant t1 + c are still less than k and remaining
k − j > 0 chunks among the required k are completed only after t1 + c,
when n − j parallel servers are in their memoryless phase. In this case,
n−j
t2 = t1 + c + Xk−j for N (t1 , t1 + c) = j − `0 ∈ {0, . . . , k − `0 − 1}.
Third possibility is when `0 < k and `0 + N (t1 , t1 + c) ≥ k. That is,
even though the chunks are forked on additional n1 servers at time t1 ,
the job is completed at k out of n0 initial servers before the constant
start-up time of these additional n1 servers is finished. This happens
when sk−`0 ≤ c and in this case, t2 = t1 + sk−`0 for N (t1 , t1 + c) ≥ k − `0 .
Recall that sk−`0 is the (k − `0 )th service completion in stage 1 after t1 .
Summarizing all the results, we write the service completion time in
the case n0 ≥ k and N (t1 , t1 + c) = j − `0 as
n−j
t2 = t1 + sk−`0 1{`0 <k≤j} + (c + Xk−j )1{`0 ≤j<k} . (5.36)

For n0 ≥ k, the mean service completion time and the mean server
utilization cost are given in the following theorem.

Theorem 5.11. In single forking scheme, for n0 ≥ k case, the mean


service completion time E [t2 ] is
h k−`
X0 i
E [t1 ] + E [t1,r − t1,r−1 ] 1{`0 <k} (5.37)
r=1

and the mean server utilization cost E [W ] is


k−`
X0
E [W0 ] + λ (n − `0 − r + 1)E [t1,r − t1,r−1 ] 1{`0 <k} . (5.38)
r=1

Where E [t1,r − t1,r−1 ] in the above expressions is given by Corollary 5.8.

Proof. In Lemma 5.9, we have already computed the mean completion


time E [t1 ] of stage 0, and the mean server utilization cost E [W0 ] in
stage 0. Recall that since completion of any k chunks suffice for the job
completion, the forking threshold `0 ≤ k.
We first consider the case when `0 = k. In this case, we do not need
to add any further servers because all the required tasks are already
finished in stage 0 itself. Hence, there is no need of forking in this case,
5.4. Simulations 83

and the mean service completion time is given by E [t1 ] and the mean
server utilization cost is given by E [W0 ].
We next consider the case when `0 < k. In this case, the job comple-
tion occurs necessarily in stage 1. Thus, we need to compute E [t2 − t1 ]
and E [W1 ] in order to evaluate the mean service completion time E [t2 ]
and the mean server utilization cost E [W0 + W1 ]. The duration of
stage 1 can be written as a telescopic sum of inter service times
k−`
X 0 −1
t2 − t1 = (t1,r − t1,r−1 ). (5.39)
r=1

Further for `0 < k, the number of servers that are active in stage 1 after
(r − 1)th service completions are n − `0 − r + 1 and the associated cost
incurred in the interval [t1,r−1 , t1,r ) is λ(t1,r − t1,r−1 )(n − `0 − r + 1).
Therefore, we can write the server utilization cost in stage 1 as
k−`
X 0 −1
W1 = λ (n − `0 − r + 1)(t1,r − t1,r−1 ). (5.40)
r=1

The result follows from taking mean of the duration t2 −t1 and server
utilization cost W1 , from the linearity of expectations, and considering
both possible cases.

We observe that when n0 ≥ k, the mean service utilization cost


depends on the initial number of servers n0 as well as the total number
of servers n, unlike the case n0 < k where this cost depends only on the
total number of servers n.

5.4 Simulations

For numerical evaluation of mean service completion time and mean


server cost utilization for single forking systems, we choose the following
system parameters. We select the sub-task fragmentation of a single job
as k = 12, and a maximum redundancy factor of n/k = 2. That is, we
choose the total number of servers n = 24. We take the server utilization
cost rate to be λ = 1. Coded-task completion time at each server was
chosen to be an i.i.d. random variable having a shifted exponential
distribution. For numerical studies in this section, we choose the shift
84 Delayed-Relaunch Scheduling Approach

parameter c = 1 and the exponential rate µ = 0.5. Since it was already


shown that n0 < k is not a useful regime, we consider the case where
n0 ≥ k.
Mean service completion time E [S]
7 n0 =12
n0 =14
n0 =16
6 n0 =18
n0 =20
n0 =24, No Forking
5

2
2 4 6 8 10 12
Fork task threshold `0

Figure 5.2: For the setting n0 ≥ k, this graph displays the mean service completion
time E [S] as a function of fork task threshold `0 for single forking with the total
number of servers n = 24, the total needed coded sub-tasks k = 12, and different
numbers of initial servers n0 ∈ {12, 14, 16, 18, 20}. The single coded sub-task execution
time at servers are assumed to be i.i.d. shifted exponential distribution with shift
c = 1 and rate µ = 0.5.

To this end, we plot the mean service completion time in Fig-


ure 5.2 and mean server utilization in Figure 5.3, both as a function
of fork-task threshold `0 ∈ [k], for different values of initial servers
n0 ∈ {12, 14, 16, 18, 20}. The analytical results in Theorem 5.11 are
substantiated by observing that the mean service completion time E [S]
increases with increase in fork-task threshold `0 and decreases with
increase in initial number of servers n0 . Further, the mean server utiliza-
tion cost E [W ] decreases with increase in fork-task threshold `0 . Thus,
there is a tradeoff between the two performance measures as a function
of fork-task threshold `0 . The tradeoff between the two performance
metrics of interest is plotted in Figure 5.4, which suggests that the
number of initial servers n0 and the forking threshold `0 affords a true
tradeoff between these metrics.
It is interesting to observe the behavior of mean server utilization
cost as a function of initial number of servers n0 in Figure 5.3. We note
5.4. Simulations 85

Mean server utilization cost E [W ]


45

n0 =12
40 n0 =14
n0 =16
n0 =18
n0 =20
n0 =24, No Forking
35
2 4 6 8 10 12
Fork task threshold `0

Figure 5.3: For the setting n0 ≥ k, this graph displays the mean server utilization
cost E [W ] as a function of fork task threshold `0 for single forking with the total
number of servers n = 24, the total needed coded sub-tasks k = 12, and different
numbers of initial servers n0 ∈ {12, 14, 16, 18, 20}. The single coded sub-task execution
time at servers are assumed to be i.i.d. shifted exponential distribution with shift
c = 1 and rate µ = 0.5.

that for each fork-task threshold `0 , there exists an optimal number of


initial servers n0 that minimizes the server utilization cost. We further
observe in Figure 5.4 that for n0 = 20, the mean service completion
time increases only 17.635% while the mean server utilization cost can
be decreased 8.3617% by an appropriate choice of `0 as compared to
choosing no forking case of n0 = n. However, a value of `0 cannot
be chosen for n0 = 20 that reduces the mean server utilization cost
beyond 8.3617%. In order to have further reduction in mean server
utilization cost, we can choose n0 to 18 which helps to decrease mean
server completion time by 12.43% at an expense of 31.888% increase
in mean service completion time as compared to the no forking case
n0 = n. The intermediate points on the curve of n0 = 18 further provide
tradeoff points that can be chosen based on the desired combination
of the two measures as required by the system designer. The choice
of n0 = 12 further helps decrease the mean server utilization cost
by 24.976% by having 207.49% times increase in the mean service
completion time as compared to the no forking case n0 = n. Thus, we
see that appropriate choice of n0 and `0 provide tradeoff points that
86 Delayed-Relaunch Scheduling Approach

Mean server utilization cost E [W ]


48

46

44

42
n0 =12
40 n0 =14
n0 =16
38 n0 =18
n0 =20
n0 =24, No Forking
36
2 3 4 5 6 7
Mean service completion time E [S]

Figure 5.4: For the setting n0 ≥ k, we have plotted the mean server utilization
cost E [W ] as a function of the mean service completion time E [S] by varying fork
task threshold `0 ∈ [n0 ] in single forking. The total number of servers considered
are n = 24, the total coded sub-task needed are k = 12. The single coded sub-task
execution time at servers are assumed to be i.i.d. shifted exponential distribution
with shift c = 1 and rate µ = 0.5. We have plotted the same curve for different values
of initial servers n0 ∈ {12, 14, 16, 18, 20}. For each curve, `0 increasing from left to
right.

help minimizing the mean server utilization cost at the expense of the
mean service completion time.

5.5 Notes and Open Problems

This problem has been studied in the context of straggler mitigation


problem, where some tasks have run-time variability. Existing solution
techniques for straggler mitigation fall into two categories: i) Squashing
runtime variability via preventive actions such as blacklisting faulty
machines that frequently exhibit high variability (Dean and Ghemawat,
2008; Dean, 2012) or learning the characteristics of task-to-node as-
signments that lead to high variability and avoiding such problematic
task-node pairings (Yadwadkar and Choi, 2012), ii) Speculative exe-
cution by launching the tasks together with replicas and waiting only
for the fastest copy to complete Ananthanarayanan et al., 2013 (Anan-
thanarayanan et al., 2010; Zaharia et al., 2008; Melnik et al., 2010).
5.5. Notes and Open Problems 87

Because runtime variability is caused by intrinsically complex reasons,


preventive measures for stragglers could not fully solve the problem and
runtime variability continued plaguing the compute workloads (Dean,
2012; Ananthanarayanan et al., 2013). Speculative task execution on
the other hand has proved to be an effective remedy, and indeed the
most widely deployed solution for stragglers (Dean and Barroso, 2013;
Ren et al., 2015).
Even though the technique of delayed relaunch with erasure coding
of tasks was originally proposed for straggler mitigation, it is also
applicable to accessing erasure coded chunks from distributed storage.
The authors of (Aktaş and Soljanin, 2019) provided a single fork analysis
with coding, where k chunk requests are started at t = 0. Further, after
a fixed deterministic time ∆, additional n − k chunk requests are
started. While this lays an important problem, this chapter considers
the following differences to the approach: (i) we allow for general number
of starting chunk requests, (ii) the start time of new chunk requests is
random and based on the completion time of certain number of chunk
requests rather than a fixed constant, and (iii) our framework allows for
an optimization of different parameters to provide a tradeoff between
service utilization cost and service completion time. As shown in the
evaluation results, the choice of n0 = k is not always optimal, which
additionally motivates such setup. This analysis has been considered in
(Badita et al., 2020a; Badita et al., 2020b).
The approach could be further extended on the following directions:

1. Queueing Analysis: The proposed framework in this chapter


considers a non-queueing system with a single job. The analysis
with multiple arrivals is open.

2. General Service Distribution: The analysis in this chapter is


limited to shifted-exponential service times. Even though Parteo
distribution has been considered in (Aktaş and Soljanin, 2019),
considering general service time distribution is important.

3. Multiple Forking Points: We only considered single forking.


Additional benefits to more forking points is an open problem.
88 Delayed-Relaunch Scheduling Approach

4. Use for Distributed Gradient Descent: The approach can be


used for straggler mitigation with gradient codes (Sasi et al., 2020),
where it has been shown that the delay allows for lower amount
of computation per node. Thus, the results of delayed relaunch
scheduling can be used for distributed gradient computations in
addition to that for erasure-coded storage.
6
Analyzing Latency for Video Content

In this Chapter, we extend the setup to assume that the servers store
video content. Rather than downloading the content, the users are
streaming the content, which makes the notion of stall duration more
important. We explain the system model in Section 6.1. The downlaod
and play times of different segments in a video is characterized in Section
6.2. This is further used to characterize upper bounds on mean stall
duration and tail stall duration in Sections 6.3 and 6.4, respectively.
Sections 6.5 and 6.6 contain simulation results and notes on future
directions, respectively.

6.1 Modeling Stall Duration for Video Requests

We consider a distributed storage system consisting of m heterogeneous


servers (also called storage nodes), denoted by M = 1, 2, ..., m. Each
video file i, where i = 1, 2, ...r, is divided into Li equal segments,
Gi,1 , · · · , Gi,Li , each of length τ sec. Then, each segment Gi,j for j ∈
{1, 2, . . . , Li } is partitioned into ki fixed-size chunks and then encoded
using an (ni , ki ) Maximum Distance Separable (MDS) erasure code to
generate ni distinct chunks for each segment Gi,j . These coded chunks
(1) (n )
are denoted as Ci,j , · · · , Ci,j i . The encoding setup is illustrated in

89
90 Analyzing Latency for Video Content

Figure 6.1.
The encoded chunks are stored on the disks of ni distinct storage
nodes. These storage nodes are represented by a set Si , such that
(g )
Si ⊆ M and ni = |Si |. Each server z ∈ Si stores all the chunks Ci,jz
for all j and for some gz ∈ {1, · · · , ni }. In other words, each of the ni
storage nodes stores one of the coded chunks for the entire duration
of the video. The placement on the servers is illustrated in Figure 6.2,
where server 1 is shown to store first coded chunks of file i, third coded
chunks of file u and first coded chunks for file v.
The use of (ni , ki ) of MDS erasure code introduces a redundancy
factor of ni /ki which allows the video to be reconstructed from the
video chunks from any subset of ki -out-of-ni servers. We note that the
erasure-code can also help in recovery of the content i as long as ki of
the servers containing file i are available (Dimakis et al., 2010). Note
that replication along n servers is equivalent to choosing (n, 1) erasure
code. Hence, when a video i is requested, the request goes to a set Ai
of the storage nodes, where Ai ⊆ Si and ki = |Ai |. From each server
(g )
z ∈ Ai , all chunks Ci,jz for all j and the value of gz corresponding
to that placed on server z are requested. The request is illustrated in
(g )
Figure 6.2. In order to play a segment q of video i, Ci,qz should have
been downloaded from all z ∈ Ai . We assume that an edge router which
is a combination of multiple users is requesting the files. Thus, the
connections between the servers and the edge router is considered as
the bottleneck. Since the service provider only has control over this part
of the network and the last hop may not be under the control of the
provider, the service provider can only guarantee the quality-of-service
till the edge router.
We assume that the files at each server are served in order of the
request in a first-in-first-out (FIFO) policy. Further, the different chunks
are processed in order of the duration. This is depicted in Figure 6.3,
where for a server q, when a file i is requested, all the chunks are placed
in the queue where other video requests before this that have not yet
been served are waiting.
In order to schedule the requests for video file i to the ki servers,
the choice of ki -out-of-ni servers is important. Finding the optimal
6.1. Modeling Stall Duration for Video Requests 91

Video i:

G i,1 (1) (1)


B i,j C i,j
G (2) (2)
i,2 B i,j Ci,j
j th (n i ,k i)
G i,j Segment
Encoding

(k ) (n )
Gi,L ii B i,ji C i,j i

Figure 6.1: A schematic illustrates video fragmentation and erasure-coding processes.


Video i is composed of Li segments. Each segments is partitioned into ki chunks and
then encoded using an (ni , ki ) MDS code.

Server 1 Server 2 Server m


(1) (1) (1) (5) (5) C (5) (j) (j) C (j)
C i,1 C i,2 Ci,L C i,1 C i,2 C i,1 C i,2 i,Li
i i,Li

(3) (3) C (3) (2) (2) C (2) (4) (4) C (4)


C u,1 Cu,2 u,Lu C u,1 Cu,2 u,Lu C u,1 Cu,2 u,Lu

(1) (1) C (1) (j) (j) C (j) (2) (2) C (2)


C v,1 Cv,2 v,Lv C v,1 Cv,2 v,Lv C v,1 Cv,2 v,Lv

Joint
Requests for video i, u, v Scheduler
(K i-out-of-n i)

Figure 6.2: An Illustration of a distributed storage system equipped with m nodes


and storing 3 video files assuming (ni , ki ) erasure codes.

Waiting Queue at Server q

(1) (1) (4) (4) C (4) (2) (2) C (2)


C (1) C u,1 C C v,1 C
C i,1 C i,L i u,2 u,L u v,2 v,L v
i,2

File i

Figure 6.3: An Example of the instantaneous queue status at server q, where


q ∈ 1, 2, ..., m.
92 Analyzing Latency for Video Content

choice of these servers to compute the latency expressions is an open


problem to the best of our knowledge. Thus, this paper uses a policy,
called Probabilistic Scheduling, given in Chapter 4.1. This policy allows
choice of every possible subset of ki nodes with certain probability.
Upon the arrival of a video file i, we randomly dispatch the batch of
ki chunk requests to appropriate a set of nodes (denoted by set Ai
of servers for file i) with predetermined probabilities (P (Ai ) for set
Ai and file i). Then, each node buffers requests in a local queue and
processes in order and independently as explained before. From Chapter
4.1, we note that probabilistic scheduling policy with feasible proba-
bilities {P (Ai ) : ∀i , Ai } exists if and only if there exists conditional
probabilities πij ∈ [0, 1] ∀i, j satisfying

m
X
πij = ki ∀i and πij = 0 if j ∈
/ Si .
j=1

In other words, selecting each node j with probability πij would yield
a feasible choice of {P (Ai ) : ∀i , Ai }. Thus, we consider the request
probabilities πij as the probability that the request for video file i uses
server j. While the probabilistic scheduling have been used to give
bounds on latency of file download, this paper uses the scheduling to
give bounds on the QoE for video streaming.
We note that it may not be ideal in practice for a server to finish one
video request before starting another since that increases delay for the
future requests. However, this can be easily alleviated by considering
that each server has multiple queues (streams) to the edge router
which can all be considered as separate servers. These multiple streams
can allow multiple parallel videos from the server. The probabilistic
scheduling can choose ki of the overall queues to access the content.
This extension can be seen in (Al-Abbasi and Aggarwal, 2018d).
We now describe a queuing model of the distributed storage system.
We assume that the arrival of client requests for each video i form an
independent Poisson process with a known rate λi . The arrival of file
P
requests at node j forms a Poisson Process with rate Λj = i λi πi,j
which is the superposition of r Poisson processes each with rate λi πi,j .
(g )
We assume that the chunk service time for each coded chunk Ci,l j
6.2. Modeling Download and Play Times 93

at server j, Xj , follows a shifted exponential distribution as has been


demonstrated in realistic systems (Xiang et al., 2016; Chen et al., 2014a).
The service time distribution for the chunk service time at server j, Xj ,
is given by the probability distribution function fj (x), which is

 α e−αj (x−βj ) , x ≥ βj
j
fj (x) = . (6.1)
 0, x < βj

We note that exponential distribution is a special case with βj = 0.


We note that the constant delays like the networking delay, and the
decoding time can be easily factoredhinto ithe shift of the shifted expo-
nential distribution. Let Mj (t) = E etXj be the moment generating
function of Xj . Then, Mj (t) is given as
αj
Mj (t) = eβj t t < αj (6.2)
αj − t
We note that the arrival rates are given in terms of the video files,
and the service rate above is provided in terms of the coded chunks at
each server. The client plays the video segment after all the ki chunks
for the segment have been downloaded and the previous segment has
been played. We also assume that there is a start-up delay of ds (in
seconds) for the video which is the duration in which the content can be
buffered but not played. This paper will characterize the stall duration
and stall duration tail probability for this setting.

6.2 Modeling Download and Play Times

In order to understand the stall duration, we need to see the download


time of different coded chunks and the play time of the different segments
of the video.

6.2.1 Download Times of the Chunks from each Server


In this subsection, we will quantify the download time of chunk for
(g )
video file i from server j which has chunks Ci,qj for all q = 1, · · · Li .
(g )
We consider download of q th chunk Ci,qj . As seen in Figure 6.3, the
(g )
download of Ci,qj consists of two components - the waiting time of all
94 Analyzing Latency for Video Content

the video files in queue before file i request and the service time of
all chunks of video file i up to the q th chunk. Let Wj be the random
variable corresponding to the waiting time of all the video files in queue
(q)
before file i request and Yj be the (random) service time of coded
chunk q for file i from server j. Then, the (random) download time for
(q)
coded chunk q ∈ {1, · · · , Li } for file i at server j ∈ Ai , Di,j , is given as

q
(q) X (v)
Di,j = Wj + Yj . (6.3)
v=1

We will now find the distribution of Wj . We note that this is the


waiting time for the video files whose arrival rate is given as Λj =
P
i λi πi,j . Since the arrival rate of video files is Poisson, the waiting
time for the start of video download from a server j, Wj , is given by an
M/G/1 process. In order to find the waiting time, we would need to
find the service time statistics of the video files. Note that fj (x) gives
the service time distribution of only a chunk and not of the video files.
Video file i consists of Li coded chunks at server j (j ∈ Si ). The
total service time for video file i at server j if requested from server j,
STi,j , is given as
Li
X (v)
STi,j = Yj . (6.4)
v=1

The service time of the video files is given as


n
πij λi
Rj = STi,j with probability Λj ∀i, (6.5)

since the service time is STi,j when file i is requested from server j. Let
Rj (s) = E[e−sRj ] be the Laplace-Stieltjes Transform of Rj .

h i
Lemma 6.1. The Laplace-Stieltjes Transform of Rj , Rj (s) = E e−sRj
is given as
r
!Li
X πij λi αj e−βj s
Rj (s) = (6.6)
i=1
Λj αj + s
6.2. Modeling Download and Play Times 95

Proof.
r
πij λi h i
E e−s(STi,j )
X
Rj (s) =
i=1
Λj
P #
r
" Li (ν)
X πij λi −s ν=1
Yj
= E e
i=1
Λj
  #!Li
r
" (1)
X πij λi −s Yj
= E e
i=1
Λj
r
!Li
X πij λi αj e−βj s
= (6.7)
i=1
Λj αj + s

Corollary 6.2. The moment generating function for the service time of
video files when requested from server j, Bj (t), is given by
r
!Li
X πij λi αj eβj t
Bj (t) = (6.8)
i=1
Λj αj − t

for any t > 0, and t < αj .

Proof. This corollary follows from (6.6) by setting t = −s.

The server utilization for the video files at server j is given as


ρj = Λj E [Rj ]. Since E [Rj ] = Bj0 (0), using Lemma 6.6, we have
!
X 1
ρj = πij λi Li βj + . (6.9)
i
αj
Having characterized the service time distribution of the video files
via a Laplace-Stieltjes Transform Rj (s), the Laplace-Stieltjes Transform
of the waiting time Wj can be characterized using Pollaczek-Khinchine
formula for M/G/1 queues (Zwart and Boxma, 2000), since the request
pattern is Poisson and the service time is general distributed. Thus, the
Laplace-Stieltjes Transform of the waiting time Wj is given as
h i (1 − ρj ) s
E e−sWj =   (6.10)
s − Λj 1 − Rj (s)
96 Analyzing Latency for Video Content

Having characterized the Laplace-Stieltjes Transform of the waiting


(v)
time Wj and knowing the distribution of Yj , the Laplace-Stieltjes
(q)
Transform of the download time Di,j is given as

!q
(q)
−sDi,j (1 − ρj ) s αj
E[e ]=   e−βj s . (6.11)
s − Λj 1 − Rj (s) αj + s

 expression above holds only in the range of s when


We note that the
s − Λj 1 − Rj (s) > 0 and αj + s > 0. Further, the server utilization
ρj must be less than 1. The overall download time of all the chunks for
(q)
the segment Gi,q at the client, Di , is given by

(q) (q)
Di = max Di,j . (6.12)
j∈Ai

6.2.2 Play Time of Each Video Segment


(q)
Let Ti be the time at which the segment Gi,q is played (started) at
the client. The startup delay of the video is ds . Then, the first segment
can be played at the maximum of the time the first segment can be
downloaded and the startup delay. Thus,

 
(1) (1)
Ti = max ds , Di . (6.13)

For 1 < q ≤ Li , the play time of segment q of file i is given by


the maximum of the time it takes to download the segment and the
time at which the previous segment is played plus the time to play a
(q)
segment (τ seconds). Thus, the play time of segment q of file i, Ti can
be expressed as
 
(q) (q−1) (q)
Ti = max Ti + τ, Di . (6.14)

Equation (6.14) gives a recursive equation, which can yield


6.2. Modeling Download and Play Times 97

(Li −1)
 
(Li ) (Li )
Ti = max Ti + τ, Di
(Li −2) (Li −1)
 
(Li )
= max Ti + 2τ, Di + τ, Di
= max (ds + (Li − 1)τ,

Li +1(z−1)
max Di + (Li − z + 1)τ (6.15)
z=2

(q) (q) (Li )


Since Di = maxj∈Ai Di,j from (6.12), Ti can be written as

(Li ) Li +1
Ti = max max (pi,j,z ) , (6.16)
z=1 j∈Ai

where

ds + (Li − 1) τ , z=1


pi,j,z = (6.17)

 (z−1)

Di,j + (Li − z + 1)τ , 2 ≤ z ≤ (Li + 1)

We next give the moment generating function of pi,j,z that will be


used in the calculations of the QoE metrics in the next sections. Hence,
we define the following lemma.

Lemma 6.3. The moment generating function for pi,j,z , is given as



h i et(ds +(Li −1)τ ) ,z=1
E etpi,j,z = (6.18)
et(Li +1−z)τ Z (z−1) (t) , 2 ≤ z ≤ Li + 1
i,j

where

(`) (`) (1 − ρj ) t (Mj (t))`


Zi,j (t) = E[etDi,j ] = (6.19)
t − Λj (Bj (t) − 1)

Proof. This follows by substituting t = −s in (6.11) and Bj (t) is given


by (6.8) and Mj (t) is given by (6.2). This expressions holds when
t − Λj (Bj (t) − 1) > 0 and t < 0 ∀j, since the moment generating
function does not exist if the above does not hold.
98 Analyzing Latency for Video Content

Ideally, the last segment should be completed by time ds + Li τ . The


(L )
difference between Ti i and ds + (Li − 1)τ gives the stall duration.
Note that the stalls may occur before any segment. This difference will
give the sum of durations of all the stall periods before any segment.
Thus, the stall duration for the request of file δ (i) is given as
(Li )
Γ(i) = Ti − ds − (Li − 1)τ. (6.20)

In the next two sections, we will use this stall time to determine the
bounds on the mean stall duration and the stall duration tail probability.

6.3 Characterization of Mean Stall Duration

In this section, we will provide a bound for the first QoE metric, which
is the mean stall duration for a file i. We will find the bound through
probabilistic scheduling and since probabilistic scheduling is one feasible
strategy, the obtained bound is an upper bound to the optimal strategy.
Using (6.20), the expected stall time for file i is given as follows
h i h i
(Li )
E Γ(i) = E Ti − ds − (Li − 1) τ

h i
(Li )
= E Ti − ds − (Li − 1) τ (6.21)

An exact evaluation for the play time of segment Li is hard due


to the dependencies between pjz random variables for different values
of j and z, where z ∈ (1, 2, ..., Li + 1) and j ∈ Ai . Hence, we derive
an upper-bound on the playtime of the segment Li as follows. Using
Jensen’s inequality (Kuczma, 2009b), we have for ti > 0,

h i
(L )
ti E Ti i (L )
 
ti Ti i
e ≤E e . (6.22)

Thus, finding an upper bound on the moment generating function


(L )
for Ti i can lead to an upper bound on the mean stall duration. Thus,
(L )
we will now bound the moment generating function for Ti i .
6.3. Characterization of Mean Stall Duration 99

(Li )
   
(a)
E eti Ti = E max max eti pijz
z j∈Ai
  
= EAi E max max eti pijz | Ai
z j∈Ai
 
(b) X h i
≤ EAi  E max eti pijz 
z
j∈Ai
 
X
= EAi  Fij 1{j∈Ai } 
j
X h i
= Fij EAi 1{j∈Ai }
j
X
= Fij P (j ∈ Ai )
j
(c) X
= Fij πij (6.23)
j

where (a) follows from (6.16), (b) follows by upper bounding maxj∈Ai
by j∈Ai , (c) follows by probabilistic scheduling where P (j ∈ Ai ) = πij ,
P
h i
and Fij = E max eti pijz . We note that the only inequality here is for
z
replacing the maximum by the sum. Since this term will be inside the
logarithm for the mean stall latency, the gap between the term and its
bound becomes additive rather than multiplicative.
Substituting (6.23) in (6.22), we have
 
m
h
(Li )
i 1 X
E Ti ≤ log  πij Fij  . (6.24)
ti j=1
(`) (`)
Let Hij = L −ti (ds +(`−1)τ ) Z (t ), where Z (t) is defined in
i
P
`=1 e i,j i i,j
equation (6.19). We note that Hij can be simplified using the geometric
series formula as follows.
Lemma 6.4.
 L
1 − fj (ti ) i
M
e−ti (ds −τ ) (1 − ρj ) ti Mj (ti )
f
Hij =   , (6.25)
ti − Λj (Bj (ti ) − 1) 1−M fj (ti )
100 Analyzing Latency for Video Content

where Mfj (ti ) = Mj (ti )e−ti τ , Mj (ti ) is given in (6.2), and Bj (ti ) is given
in (6.8).

Proof.
 !` 
Li
X e−ti (ds +(`−1)τ ) (1 − ρj ) ti αj eti βj
Hij =  
`=1
ti − Λj (Bj (ti ) − 1) αj − ti
 !` 
L
e−ti ds (1 − ρj ) ti X i
e−ti (`−1)τ
αj eti βj
= 
ti − Λj (Bj (ti ) − 1) `=1 αj − t i
Li
!`
e−ti (ds −τ ) (1 − ρj ) ti X αj eti βj
= e−ti τ
ti − Λj (Bj (ti ) − 1) `=1 αj − ti
Li
!`
e−ti (ds −τ ) (1 − ρj ) ti X αj eti βj −ti τ
=
ti − Λj (Bj (ti ) − 1) `=1 αj − ti
e−ti (ds −τ ) (1 − ρj ) ti
= ×
ti − Λj (Bj (ti ) − 1)
1 − (Mj (ti ))Li e−ti Li τ
!
Mj (ti )e−ti τ
1 − Mj (ti )e−ti τ
 L
1 − fj (ti ) i
M
e−ti (ds −τ ) (1
− ρj ) ti Mj (ti )
f
=   (6.26)
ti − Λj (Bj (ti ) − 1) 1−M fj (ti )

Substituting (6.24) in (6.21) and some manipulations, the mean


stall duration is bounded as follows.

Theorem 6.5. The mean stall duration time for file i is bounded by
 
m
h i 1 X
E Γ(i) ≤ log  πij (1 + Hij ) (6.27)
ti j=1
 
P 1
for any ti > 0, ρj = i πij λi Li βj + αj , ρj < 1, and
 −β t Lf
Pr αj e j i
f =1 πf j λf αj −ti − (Λj + ti ) < 0, ∀j.
6.3. Characterization of Mean Stall Duration 101

Proof. We first find an upper bound on Fij as follows.


h i
Fij = E max eti pijz
z
(d) X h i
≤ E eti pijz
z
(e)
= eti (ds +(Li −1)τ ) +
LX
!z−1
i +1
eti (Li −z+1)τ (1 − ρj ) ti αj eti βj
z=2
ti − Λj (Bj (ti ) − 1) αj − t i
(f ) ti (ds +(Li −1)τ )
= e +
Li ti (Li −`)τ
!`
X e (1 − ρj ) ti αj eti βj
(6.28)
t
`=1 i
− Λj (Bj (ti ) − 1) αj − ti

where (d) follows by bounding the maximum by the sum, (e) follows
from (6.18), and (f) follows by substituting ` = z − 1.
Further, substituting the bounds (6.28) and (6.24) in (6.21), the
mean stall duration is bounded as follows.

h i
E Γ(i)

m
1 
πij eti (ds +(Li −1)τ )
X
≤ log 
ti j=1

Li
(`)
eti (Li −`)τ Zi,j (ti ) − (ds + (Li − 1) τ )
X
+
`=1

m
1 
πij eti (ds +(Li −1)τ )
X
= log 
ti j=1

Li
(`) 1  
eti (Li −`)τ Zi,j (ti ) − log eti (ds +(Li −1)τ )
X
+
`=1
ti
  
m i L
1 (`)
e−ti (ds +(`−1)τ ) Zi,j (ti )
X X
= log  πij 1 + (6.29)
ti j=1 `=1
102 Analyzing Latency for Video Content

Note that Theorem 6.5 above holds only in the range of ti when
 Lf
Pr αj e−βj ti
ti − Λj (Bj (ti ) − 1) > 0 which reduces to f =1 πf j λf αj −ti −
(Λj + ti ) < 0, ∀i, j, and αj − ti > 0. Further, the server utilization ρj
must be less than 1 for stability of the system.
We note that for the scenario, where the files are downloaded rather
than streamed, a metric of interest is the mean download time. This is a
special case of our approach when the number of segments of each video
is one, or Li = 1. Thus, the mean download time of the file follows as a
special case of Theorem 6.5. This special case was discussed in detail in
Section 4.2.

6.4 Characterization of Tail Stall Duration

The stall duration tail probability of a file i is defined as the probability


that the stall duration tail Γ(i) is greater than (or equal) to x. Since
evaluating Pr Γ(i) ≥ x in closed-form is hard (Huang et al., 2012b;
Joshi et al., 2014; Lee et al., 2017; Xiang et al., 2014; Xiang et al., 2016;
Chen et al., 2014a), we derive an upper bound on the stall duration tail
probability considering Probabilistic Scheduling as follows.

  (a)  
(Li )
Pr Γ(i) ≥ x = Pr Ti ≥ x + ds + (Li − 1) τ
 
(Li )
= Pr Ti ≥x (6.30)

where (a) follows from (6.21) and x = x + ds + (Li − 1) τ . Then,

  (b)  
(L )
Pr Ti i ≥x = Pr max maxpijz ≥ x
z j∈Ai
 

= EAi ,pijz 1 (6.31)


 

max max pijz ≥x
z j∈Ai
6.4. Characterization of Tail Stall Duration 103

 
(c)
= EAi ,pijz max 1 
j∈Ai max pijz ≥x
z

(d) X
≤ EAi ,pijz 1 
maxpijz ≥x
j∈Ai z
 
(e) X
= πij Epijz  1 
maxpijz ≥x
j z
X  
= πij P max pijz ≥ x (6.32)
z
j

where (b) follows from (6.16), (c) follows as both max over z and max
over Aj are discrete indicies (quantities) and do not depend on other
P
so they can be exchanged, (d) follows by replacing the max by Ai , (e)
follows from probabilistic scheduling. Using Markov Lemma, we get
"  #
ti max pijz
 
E e z

P max pijz ≥ x ≤ (6.33)


z eti x
We further simplify to get
"  #
ti max pijz
 
E e z

P max pijz ≥ x ≤
z
heti x i
E max eti pijz
z
=
eti x
(f ) Fij
= (6.34)
eti x
where (f) follows from (6.28). Substituting (6.34) in (6.32), we get the
stall duration tail probability as described in the following theorem.

Theorem 6.6. The stall distribution tail probability for video file i is
bounded by
X πij  
1 + e−ti (ds +(Li −1)τ ) Hij (6.35)
j
eti x
104 Analyzing Latency for Video Content

 
1
, ρj ≤ 1,
P
for any ti > 0, ρj = i πij λi Li βj + αj
 −β t Lf
Pr αj e j i
f =1 πf j λf αj −ti − (Λj + ti ) < 0, ∀i, j, and Hij is given by
(6.25).

Proof. Substituting (6.34) in (6.32), we get

 
(Li )
Pr Ti ≥x
X  
≤ πij P max pijz ≥ x
z
j
X Fij
≤ πij
j
eti x
(g) X πij  
≤ eti (ds +(Li −1)τ ) + H ij
j
eti x
πij  
eti (ds +(Li −1)τ ) + Hij
X
=
j
eti (x+ds +(Li −1)τ )
X πij  
−ti (ds +(Li −1)τ )
= 1 + e Hij (6.36)
j
eti x

where (g) follows from (6.28) and Hij is given by (6.25).

We note that for the scenario, where the files are downloaded rather
than streamed, a metric of interest is the latency tail probability which
is the probability that the file download latency is greater than x. This
is a special case of our approach when the number of segments of each
video is one, or Li = 1. Thus, the latency tail probability of the file
follows as a special case of Theorem 6.6. In this special case, the result
reduces to that in (Aggarwal et al., 2017b).

6.5 Simulations

Let π =(πij ∀i = 1, · · · , r and j = 1, · · · , m), S = (S1 , S2 , . . . , Sr ),


and t = te1 , te2 , . . . , ter ; t1 , t2 , . . . , tr . Note that the values of ti ’s used
for mean stall duration and the stall duration tail probability can be
different and the parameters te and t indicate these parameters for the
6.5. Simulations 105

two cases, respectively. We wish to minimize the two proposed QoE


metrics over the choice of scheduling and access decisions. Since this
is a multi-objective optimization, the objective can be modeled as a
convex combination of the two QoE metrics.
P
Let λ = i λi be the total arrival rate. Then, λi /λ is the ratio of
video i requests. The first objective is the minimization of the meanh stall
i
duration, averaged over all the file requests, and is given as i λλi E Γ(i) .
P

The second objective is the minimization of stall duration tail probability,


 
averaged over all the file requests, and is given as i λλi Pr Γ(i) ≥ x .
P

Using the expressions for the mean stall duration and the stall duration
tail probability in Chapters 6.3 and 6.4, respectively, optimization of
a convex combination of the two QoE metrics can be formulated as
follows.

  
X λi m
θ
1 X  
min log  πij 1 + H
e ij 
i λ tei j=1

X πij  
+ (1 − θ) 1 + e−ti (ds +(Li −1)τ ) H ij  (6.37)
j eti x

e−eti (ds −τ ) (1 − ρj ) tei e


s.t. H
e ij =   Qij , (6.38)
tei − Λj Bj (tei ) − 1
e−ti (ds −τ ) (1 − ρj ) ti
H ij = Q , (6.39)
ti − Λj (Bj (ti ) − 1) ij
   Li  
 Mj (ti ) 1 − Mj (ti )
f e f e

Q
e ij =  , (6.40)
1−M
fj (tei )
 

   Li  
 Mj (ti ) 1 − Mj (ti )
f f

Qij =  , (6.41)
1−M
fj (ti )
 

αj e(βj −τ )t
M
fj (t) = , (6.42)
αj − t
106 Analyzing Latency for Video Content

r
!Lf
X λf πf j αj eβj t
Bj (t) = , (6.43)
f =1
Λj αj − t
αj e(βj −τ )t
M
fj (t) = , (6.44)
αj − t
r
!Lf
X λf πf j αj eβj t
Bj (t) = , (6.45)
f =1
Λj αj − t
r
!
X 1
ρj = πf j λf Lf βj + < 1 ∀j (6.46)
f =1
αj
Xr
Λj = λf πf,j ∀j (6.47)
f =1
m
X
πi,j = ki (6.48)
j=1
πi,j =0 if j ∈
/ Si , πi,j ∈ [0, 1] (6.49)
|Si | = ni , ∀i (6.50)
0 < tei < αj , ∀j (6.51)
0 < ti < αj , ∀j (6.52)
 
(βj −τ )e
ti
αj e − 1 + tei < 0 , ∀j (6.53)
 
αj e(βj −τ )ti − 1 + ti < 0 , ∀j (6.54)
r
!Lf
X αj eβjeti  
πf j λf − Λj + tei < 0, ∀i, j (6.55)
f =1
αj − tei
r
!Lf
X αj eβj ti
πf j λf − (Λj + ti ) < 0, ∀i, j (6.56)
f =1
αj − ti
var. π, t, S (6.57)
Here, θ ∈ [0, 1] is a trade-off factor that determines the relative
significance of mean and tail probability of the stall durations in the
minimization problem. Varying θ = 0 to θ = 1, the solution for (6.37)
spans the solutions that minimize the mean stall duration to ones that
minimize the stall duration tail probability. Note that constraint (6.46)
6.5. Simulations 107

gives the load intensity of server j. Constraint (6.47) gives the aggregate
arrival rate Λj for each node for the given probabilistic scheduling prob-
abilities πij and arrival rates λi . Constraints (6.49)-(6.50) guarantees
that the scheduling probabilities are feasible. Constraints (6.51)-(6.54)
ensure that M fj (t) exist for each tei and ti . Finally, Constraints (6.55)-
(6.56) ensure that the moment generating function given in (6.19) exists.
We note that the the optimization over π helps decrease the objective
function and gives significant flexibility over choosing the lowest-queue
servers for accessing the files. The placement of the video files S helps
separate the highly accessed files on different servers thus reducing the
objective. Finally, the optimization over the auxiliary variables t gives a
tighter bound on the objective function. We note that the QoE for file
i is weighed by the arrival rate λi in the formulation. However, general
weights can be easily incorporated for weighted fairness or differentiated
services.
Note that the proposed optimization problem is a mixed integer
non-convex optimization as we have the placement over n servers and
the constraints (6.55) and (6.56) are non-convex in (π, t). The problem
can be solved using an optimization algorithm described in (Al-Abbasi
and Aggarwal, 2018d), which in part uses NOVA algorithm proposed
in (Scutari et al., 2017).
We simulate our algorithm in a distributed storage system of m = 12
distributed nodes, where each video file uses an (10, 4) erasure code. The
parameters for storage servers are chosen as in Table 4.1, which were
chosen in (Xiang et al., 2016) in the experiments using Tahoe testbed.
Further, (10, 4) erasure code is used in HDFS-RAID in Facebook (al.,
2010) and Microsoft (Huang et al., 2012a). Unless otherwise explicitly
stated, we consider r = 1000 files, whose sizes are generated based on
Pareto distribution (Arnold, 2015) with shape factor of 2 and scale of
300, respectively. We note that the Pareto distribution is considered
as it has been widely used in existing literature (Ramaswami et al.,
2014) to model video files, and file-size distribution over networks. We
also assume that the chunk service time follows a shifted-exponential
distribution with rate αj and shift βj , whose values are shown in Table
I, which are generated at random and kept fixed for the experiments.
Unless explicitly stated, the arrival rate for the first 500 files is 0.002s−1
108 Analyzing Latency for Video Content

while for the next 500 files is set to be 0.003s−1 . Chunk size τ is set to
be equal to 4 s. When generating video files, the sizes of the video file
sizes are rounded up to the multiple of 4 sec. We note that a high load
scenario is considered for the numerical results. In order to initialize
our algorithm, we use a random placement of files on all the servers.
Further, we set πij = k/n on the placed servers with ti = 0.01 ∀i and
j ∈ Si . However, these choices of πij and ti may not be feasible. Thus,
we modify the initialization of π to be closest norm feasible solution
given above values of S and t. We compare the proposed approach with
some baselines:

1. Random Placement, Optimized Access (RP-OA): In this strategy, the


placement is chosen at random where any n out of m servers are chosen
for each file, where each choice is equally likely. Given the random
placement, the variables t and π are optimized using the proposed
algorithm, where S-optimization is not performed.

2. Optimized Placement, Projected Equal Access (OP-PEA): The strat-


egy utilizes π, t and S as mentioned in the setup. Then, alternating
optimization over placement and t are performed using the proposed
algorithm.

3. Random Placement, Projected Equal Access (RP-PEA): In this


strategy, the placement is chosen at random where any n out of m
servers are chosen for each file, where each choice is equally likely.
Further, we set πij = k/n on the placed servers with ti = 0.01 ∀i and
j ∈ Si . We then modify the initialization of π to be closest norm feasible
solution given above values of S and t. Finally, an optimization over t is
performed with respect to the objective using the proposed algorithm.

4. OP-PSP (Optimized Placement-Projected Service-Rate Proportional


Allocation) Policy: The joint request scheduler chooses the access proba-
bilities to be proportional to the service rates of the storage nodes, i.e.,
µ
πij = ki P jµ . This policy assigns servers proportional to their service
j j
rates. These access probabilities are projected toward feasible region
for a uniformly random placed files to ensure stability of the storage
system. With these fixed access probabilities, the weighted mean stall
6.5. Simulations 109

600
Proposed Algorithm
RP−OA
500
OP−PSP
Average Stall Time (Sec)
RP−PSP
400 OP−PEA
RP−PEA
300

200

100

0
1 2 3 4 5 6
Arrival Rates for Different File Sizes −3
x 10

Figure 6.4: Mean stall duration for different video arrival rates with different video
lengths.

duration and stall duration tail probability are optimized over the t,
and placement S.

5. RP-PSP (Random Placement-PSP) Policy: As compared to the OP-


PSP Policy, the chunks are placed uniformly at random. The weighted
mean stall duration and stall duration tail probability are optimized
over the choice of auxiliary variables t.
Figure 6.4 shows the effect of different video arrival rates on the
mean stall duration for different-size video length. The different size
uses the Pareto-distributed lengths described above. We compare our
proposed algorithm with the five baseline policies and we see that the
proposed algorithm outperforms all baseline strategies for the QoE
metric of mean stall duration. Thus, both access and placement of files
are both important for the reduction of mean stall duration. Further, we
see that the mean stall duration increases with arrival rates, as expected.
Since the mean stall duration is more significant at high arrival rates,
we notice a significant improvement in mean stall duration by about
60% ( approximately 700s to about 250s) at the highest arrival rate in
Figure 6.4 as compared to the random placement and projected equal
access policy.
Figure 6.5 shows the decay of weighted stall duration tail probability
with respect to x (in seconds) for the proposed and the baseline strategies.
110 Analyzing Latency for Video Content

Weighted Stall Duration Tail Probability


10

−1
10

−2
10
Prop. Alg.
RP−OA
−3 OP−PSP
10
RP−PSP
OP−PEA
−4 RP−PEA
10
50 100 150 200
x (in seconds)

Figure 6.5: Stall duration tail probability for different values of x (in seconds).

In order to signify (magnify) the small differences, we plot y-axis in


logarithmic scale. We observe that the proposed algorithm gives orders
improvement in the stall duration tail probabilities as compared to the
baseline strategies.
If the mean stall duration decreases, intuitively the stall duration
tail probability also reduces. Thus, a question arises whether the optimal
point for decreasing the mean stall duration and the stall duration tail
probability is the same. We answer the question in negative since for
r = 1000 of equal sizes of length 300 sec, we find that at the values of
(π, S) that optimize the mean stall duration, the stall duration tail
probability is 12 times higher as compared to the optimal stall duration
tail probability. Similarly, the optimal mean stall duration is 30% lower
as compared to the mean stall duration at the value of (π, S) that
optimizes the stall duration tail probability. Thus, an efficient tradeoff
point between the QoE metrics can be chosen based on the point on
the curve that is appropriate for the clients.

6.6 Notes and Open Problems

Servicing Video on Demand and Live TV Content from cloud servers


have been studied widely (Lee et al., 2013; Huang et al., 2011; He
et al., 2014; Chang et al., 2016; Oza and Gohil, 2016). The reliability
of content over the cloud servers have been first considered for video
6.6. Notes and Open Problems 111

0.013

Stall Duration Tail Probability


−4
θ = 10
0.010

θ changes from
0.007
θ = 10−4 to θ = 10−6

0.004
θ = 10−6
0.001
75 80 85 90 95 100
Meas Stall Duration (Sec)

Figure 6.6: Tradeoff between mean stall duration and stall duration tail probability
obtained by varying θ.

streaming applications in (Al-Abbasi and Aggarwal, 2018d). In this


work, the mean and tail stall duration are characterized. Based on
this analysis of the metrics, joint placement of content and resource
optimization over the cloud servers have been considered. Even though
we consider single steam from each server to the edge router, we can
extend the approach to multiple parallel streams. Multiple streams can
help obtain parallel video files thus helping one file not wait behind the
other. This extension has also been studied in (Al-Abbasi and Aggarwal,
2018d). The results have been further extended to a Virtualized Content
Distribution Network (vCDN) architecture in (Al-Abbasi et al., 2019c),
which consists of a remote datacenter that stores complete original video
data and multiple CDN sites (i.e., local cache servers) that only have
part of those data and are equipped with solid state drives (SSDs) for
higher throughput. A user request for video content not satisfied in the
local cache is directed to, and processed by, the remote datacenter. If
the required video content/chunk is not stored in cache servers, multiple
parallel connections are established between a cache server and the edge
router, as well as between the cache servers and the origin server, to
support multiple video streams simultaneously. This work uses cache at
SDN servers, while not at the edge routers. Using caching at edge routers
112 Analyzing Latency for Video Content

based on an adaptation of Least-Recently-Used (LRU) strategy, the stall


duration metrics have been characterized in (Al-Abbasi and Aggarwal,
2018a; Al-Abbasi et al., 2019b). These work assume a single quality
video. An approach to select one of the different quality levels to have
an efficient tradeoff between video quality and the stall duration metrics
has been considered in (Al-Abbasi and Aggarwal, 2018c; Alabbasi and
Aggarwal, 2018).
These works lay the foundations for many important future problems,
including

1. Adaptive Streaming: Adaptive streaming algorithms have been


considered for video streaming (Chen, 2012; Wang et al., 2013;
Elgabli et al., 2018a; Elgabli and Aggarwal, 2019a; Elgabli and
Aggarwal, 2019b; Elgabli et al., 2018c; Elgabli et al., 2018b; El-
gabli et al., 2019). However, the above works consider entire video
streaming at the same quality. Considering the aspects of adavp-
tive video streaming to compute the stall duration metrics and
the video quality metrics jointly is an important problem.

2. Efficient Caching Algorithms: In the vCDN environment, we


consider optimized caching at the CDN servers and LRU based
caching at the edge router. However, with different file sizes,
different caching algorithms have been studied (Berger et al., 2017;
Halalai et al., 2017; Friedlander and Aggarwal, 2019). Considering
efficient caching strategies for video streaming is an important
problem for the future.
7
Lessons from prototype implementation

Various models and theories proposed in Chapters 2 to 5 provide mathe-


matical crystallization of erasure coded storage systems, by quantifying
different performance metrics, revealing important control knobs and
tradeoffs, illuminating opportunities for novel optimization algorithms,
and thus enabling us to rethink the design of erasure coded storage sys-
tems in practice. On the other hand, implementing these algorithms and
designs in practical storage systems allows us to validate/falsify different
modeling assumptions and provides crucial feedback to bridge the divide
between theory and practice. In this chapter, we introduce a practical
implementation to demonstrate the path for realizing erasure-coded
storage systems. Then, we provide numerical examples to illuminate
key design tradeoffs in this system. Finally, another application of the
models to distributed caching and content distribution is described, and
numerical examples discussed. We highlight key messages learned from
these experiments as remarks throughout this chapter.

7.1 Exemplary implementation of erasure-coded storage

Distributed systems such as Hadoop, AT&T Cloud Storage, Google


File System and Windows Azure have evolved to support different

113
114 Lessons from prototype implementation

types of erasure codes, in order to achieve the benefits of improved


storage efficiency while providing the same reliability as replication-
based schemes (Balaji et al., 2018). In particular, Reed-Solomon (RS)
codes have been implemented in the Azure production cluster and
resulted in the savings of millions of dollars for Microsoft (Huang et
al., 2012a; blog, 2012). Later, Locally Recoverable (LR) codes were
implemented in HDFS-RAID carried out in Amazon EC2 and a cluster
at Facebook in (Sathiamoorthy et al., 2013). Various erasure code plug-
ins and libraries have been developed in storage systems like Ceph
(Weil et al., 2006; Aggarwal et al., 2017a), Tahoe (Xiang et al., 2016),
Quantcast (QFS) (Ovsiannikov et al., 2013), and Hadoop (HDFS)
(Rashmi et al., 2014). In a separate line of work, efficient repair schemes
and traffic engineering in erasure-coded storage systems are discussed
in (Plank et al., 2009; Dimakis et al., 2010; Zhou and Tian, 2020; Li
et al., 2019).
In this chapter, we report an implementation of erasure coded storage
in Tahoe (B. Warner and Kinninmont, 2012), which is an open-source,
distributed filesystem based on the zfec erasure coding library. It provides
three special instances of a generic node: (a) Tahoe Introducer: it keeps
track of a collection of storage servers and clients and introduces them
to each other. (b) Tahoe Storage Server: it exposes attached storage
to external clients and stores erasure-coded shares. (c) Tahoe Client:
it processes upload/download requests and connects to storage servers
through a Web-based REST API and the Tahoe-LAFS (Least-Authority
File System) storage protocol over SSL.

CA# NJ#

194#Mbps#
#73.5#ms#

TX#
Storage#server# Storage#Client#

Figure 7.1: Our Tahoe testbed with average ping (RTT) and bandwidth measure-
ments among three data centers in New Jersey, Texas, and California
7.1. Exemplary implementation of erasure-coded storage 115

While Tahoe uses a default (10, 3) erasure code, it supports arbitrary


erasure code specification statically through a configuration file. In
Tahoe, each file is encrypted, and is then broken into a set of segments,
where each segment consists of k blocks. Each segment is then erasure-
coded to produce n blocks (using a (n, k) encoding scheme) and then
distributed to (ideally) n distinct storage servers. The set of blocks
on each storage server constitute a chunk. Thus, the file equivalently
consists of k chunks which are encoded into n chunks and each chunk
consists of multiple blocks1 . For chunk placement, the Tahoe client
randomly selects a set of available storage servers with enough storage
space to store n chunks. For server selection during file retrievals, the
client first asks all known servers for the storage chunks they might have.
Once it knows where to find the needed k chunks (from the k servers
that responds the fastest), it downloads at least the first segment from
those servers. This means that it tends to download chunks from the
“fastest” servers purely based on round-trip times (RTT).
To implement different scheduling algorithms in Tahoe, we would
need to modify the upload and download modules in the Tahoe storage
server and client to allow for customized and explicit server selection,
which is specified in the configuration file that is read by the client
when it starts. In addition, Tahoe performance suffers from its single-
threaded design on the client side for which we had to use multiple
clients with separate ports to improve parallelism and bandwidth usage
during experiments
We deployed 12 Tahoe storage servers as virtual machines in an
OpenStack-based data center environment distributed in New Jersey
(NJ), Texas (TX), and California (CA). Each site has four storage
servers. One additional storage client was deployed in the NJ data
center to issue storage requests. The deployment is shown in Figure 7.1
with average ping (round-trip time) and bandwidth measurements
listed among the three data centers. We note that while the distance
between CA and NJ is greater than that of TX and NJ, the maximum
1
If there are not enough servers, Tahoe will store multiple chunks on one sever.
Also, the term “chunk” we used in this chapter is equivalent to the term “share” in
Tahoe terminology. The number of blocks in each chunk is equivalent to the number
of segments in each file.
116 Lessons from prototype implementation

bandwidth is higher in the former case. The RTT time measured by


ping does not necessarily correlate to the bandwidth number. Further,
the current implementation of Tahoe does not use up the maximum
available bandwidth, even with our multi-port revision.
1
Cumulative Distribution Function

0.8

0.6

0.4

Service Time Distribution


0.2 Exponential Distribution with same Mean
Exponential Distribution with same Variance

0
0 5 10 15 20 25 30 35 40 45 50
Latency (sec)

Figure 7.2: Comparison of actual service time distribution and an exponential


distribution with the same mean. It verifies that actual service time does not follow
an exponential distribution, falsifying the assumption in previous work (Huang et al.,
2012b).

Remark 1: Actual service time can be approximated well by a shifted


exponential distribution. Using this testbed, we can run experiments to
understand actual service time distribution on our testbed. We upload
a 50MB file using a (7, 4) erasure code and measure the chunk service
time. Figure 7.2 depicts the Cumulative Distribution Function (CDF)
of the chunk service time. Using the measured results, we get the mean
service time of 13.9 seconds with a standard deviation of 4.3 seconds,
second moment of 211.8 s2 and the third moment of 3476.8 s3 . We
compare the distribution to the exponential distribution( with the same
mean and the same variance, respectively) and note that the two do not
match. It verifies that actual service time does not follow an exponential
distribution, and therefore, the assumption of exponential service time
in (Huang et al., 2012b) is falsified by empirical data. The observation
is also evident because a distribution never has positive probability for
very small service time. Further, the mean and the standard deviation
are very different from each other and cannot be matched by any
exponential distribution.
7.2. Illuminating key design tradeoffs 117

7.2 Illuminating key design tradeoffs

We leverage the implementation prototype in Section 7.1 to illustrate


a number of crucial design tradeoffs in erasure coded storage systems.
While the exact tradeoff curves could vary significantly in different
systems/environments, the numerical examples presented in this section
nevertheless provide a visualization of the various design space available
in erasure-coded storage systems.

Figure 7.3: Comparison of joint latency and cost minimization with some obliv-
ious approaches. Algorithm JLCM minimizes latency-plus-cost over 3 dimensions:
load-balancing (LB), chunk placement (CP), and erasure code (EC), while any
optimizations over a subset of the dimensions is non-optimal.

Remark 2: Latency and storage cost tradeoff. The use of (ni , ki ) MDS
erasure code allows the content to be reconstructed from any subset
of ki -out-of-ni chunks, while it also introduces a redundancy factor
of ni /ki . To model storage cost, we assume that each storage node
j ∈ M charges a constant cost Vj per chunk. Since ki is determined
by content size and the choice of chunk size, we need to choose an
appropriate ni which not only introduces sufficient redundancy for
improving chunk availability, but also achieves a cost-effective solution.
We consider RTT plus expected queuing delay and transfer delay as
a measure of latency. To find the optimal parameters for scheduling,
118 Lessons from prototype implementation

we use the optimization using the latency upper bound in Theorem


4.7. To demonstrate this tradeoff, we use the theoretical models to
develop an algorithm for Joint Latency and Cost Minimization (JLCM)
and compare its performance with three oblivious schemes, each of
which minimize latency-plus-cost over only a subset of the 3 dimensions:
load-balancing (LB), chunk placement (CP), and erasure code (EC).
We run Algorithm JLCM for r = 3 files of size (150, 150, 100)M B
on our testbed, with Vj = $1 for every 25M B storage and a tradeoff
factor of θ = 2 sec/dollar. The result is shown in Figure. 7.3. First,
even with the optimal erasure code and chunk placement (which means
the same storage cost as the optimal solution from Algorithm JLCM),
higher latency is observed in Oblivious LB, which schedules chunk
requests according to a load-balancing heuristic that selects storage
nodes with probabilities proportional to their service rates. Second, we
keep optimal erasure codes and employ a random chunk placement
algorithm, referred to as Random CP, which adopts the best outcome
of 10 random runs. Large latency increment resulted by Random CP
highlights the importance of joint chunk placement and load balancing in
reducing service latency. Finally, Maximum EC uses maximum possible
erasure code n = m and selects all nodes for chunk placement. Although
its latency is comparable to the optimal solution from Algorithm JLCM,
higher storage cost is observed. Minimum latency-plus-cost can only be
achieved by jointly optimizing over all 3 dimensions.

Remark 3: Latency distribution and coding strategy tradeoff. To


demonstrate the impact of coding strategies on latency distribution, we
choose files of size 150 MB and the same storage cost and tradeoff factor
as in the previous experiment. The files are divided into four classes (each
class has 250 files) with different erasure code parameters, respectively
(class-1 files using (n, k) = (12, 6) , class-2 files using (n, k) = (10, 7),
class 3 using (n, k) = (10, 6), and class 4 has (n, k) = (8, 4)). Aggregate
request arrival rate for each file class are set to λ1 = λ4 = 0.0354/s and
λ2 = λ3 = 0.0236/s, which leads to an aggregate file request arrival
rate of λi = 0.118/s. We are choosing the values of erasure codes for a
proper chunk size for our experiments so that the file sizes are widely
used for today’s data center storage users, and setting different request
7.2. Illuminating key design tradeoffs 119

Empirical CDF
1

(12,6)
0.8 (10,7)

Cumulative Distribution Function


(10,6)
(8,4)
0.6
 
 
 
 
 
0.4
 
 
 
 
 
0.2

0
0 20 40 60 80 100 120 140 160 180
Latency (Sec)

Figure 7.4: Actual service latency distribution for 1000 files of size 150 MB using
erasure codes (12, 6), (10, 7), (10, 6), and (8, 4) for each quarter with aggregate
request arrival rates set to λ = 0.118/s.

arrival rates for the two classes using the same value to see the latency
distribution under different coding strategies. We retrieve the 1000 files
at the designated request arrival rate and plot the CDF of download
latency for each file in Fig. 10. We note that 95% of download requests
for files with erasure code (10, 7) complete within 100s, while the same
percentage of requests for files using (12, 6) erasure code complete
within 32s due to higher level of redundancy. In this experiment, erasure
code (12, 6) outperforms (8, 4) in latency though they have the same
level of redundancy because the latter has larger chunk size when file
size are set to be the same.
Remark 4: Latency and file size tradeoff. Increasing file size clearly gen-
erates high load on the storage system, thus resulting in higher latency.
To illustrate this tradeoff, we vary file size in the experiment from (30,
20)MB to (150, 100)MB and plot download latency of individual files 1,
2, 3, average latency, and the analytical latency upper bound (Xiang
et al., 2016) in Figure 7.5. We see that latency increases super-linearly
as file size grows, since it generates higher load on the storage system,
causing larger queuing latency (which is super-linear according to our
analysis). Further, smaller files always have lower latency because it is
less costly to achieve higher redundancy for these files. We also observe
that analytical latency bound in (Xiang et al., 2016) tightly follows
120 Lessons from prototype implementation

(11,6) (10,7) (10,6) (8,4) Average Latency Analytical Bound


140

120

100
Latency (sec)

80

60

40

20

0
50M 100M 150M 200M
File Size(MB)

Figure 7.5: Evaluation of different chunk sizes. Latency increases super-linearly as


file size grows due to queuing delay.

actual service latency. In one case, service latency exceeds the analytical
bound by 0.5 seconds. This is because theoretical bound quantifying
network and queuing delay does not take into account Tahoe protocol
overhead, which is indeed small compared to network and queuing delay.

Average  Latency   Analy:cal  Bound   Storage  Cost  


250   13  

12  
Storage  Cost  Per  File  (US  Dollars)  

200   11  
Average  Latency  (Sec)  

10  
150  
9  

8  
100  
7  

50   6  

5  

0   4  
r=0.1   r=0.11   r=0.115   r=0.12   r=0.125  
Request  Arrival  Rate  (/sec)  

Figure 7.6: Evaluation of different request arrival rates. As arrival rates increase,
latency increases and becomes more dominating in the latency-plus-cost objective
than storage cost.
7.2. Illuminating key design tradeoffs 121

Remark 5: Latency and arrival rate tradeoff. We increase the maximum


file request arrival rate from λi =1/(60sec) to λi =1/(30sec) (and
other arrival rates also increase accordingly), while keeping file size at
(150, 150, 100)M B. Actual service delay and the analytical bound (Xiang
et al., 2016) for each scenario is shown by a bar plot in Figure 7.6
and associated storage cost by a curve plot. As arrival rates increase,
latency increases and becomes more dominating in the latency-plus-
cost objective than storage cost. Thus, the marginal benefit of adding
more chunks (i.e., redundancy) eventually outweighs higher storage cost
introduced at the same time. Figure 7.6 also shows that to achieve a
minimization of the latency-plus-cost objective, an optimal solution
from theoretical models allows higher storage cost for larger arrival
rates, resulting in a nearly-linear growth of average latency as the
request arrival rates increase. For instance, Algorithm JLCM chooses
(10,6), (11,6), and (10,4) erasure codes at the largest arrival rates,
while (8,6), (9,6), and (7,4) codes are selected at the smallest arrival
rates in this experiment. We believe that this ability to autonomously
manage latency and storage cost for latency-plus-cost minimization
under different workload is crucial for practical distributed storage
systems relying on erasure coding.

Remark 6: Visualizing tradeoff curve. We demonstrate the tradeoff


curve between latency and storage cost. Varying the tradeoff factor in
Algorithm JLCM from θ = 0.5 sec/dollar to θ = 100 sec/dollar for fixed
file size of (150, 150, 100)M B and arrival rates λi =1/(30 sec), 1/(30
sec), 1/(40 sec), we obtain a sequence of solutions, minimizing different
latency-plus-cost objectives. As θ increases, higher weight is placed on
the storage cost component of the latency-plus-cost objective, leading to
less file chunks in the storage system and higher latency. This tradeoff
is visualized in Figure 7.7. When θ = 0.5, the optimal solution chooses
(12,6), (11,6), and (9,4) erasure codes, which is nearly the maximum
erasure code length allowable in our experiment and leads to highest
storage cost (i.e., 32 dollars), yet lowest latency (i.e., 47 sec). On the
other hand, θ = 0.5 results in the choice of (6,6), (7,6), and (4,4) erasure
code, which is almost the minimum possible cost for storing the three
file, with the highest latency of 65 seconds. Further, the theoretical
122 Lessons from prototype implementation

Average Latency Analytical Bound


140

135

130

125
Latency (Sec)

120

115

110

105

100
10.4113 10.8514 11.3527 11.711 12
Average Storage Cost Per User (US Dollar)

Figure 7.7: Visualization of latency and cost tradeoff for varying θ = 0.5 sec-
ond/dollar to θ = 100 second/dollar. As θ increases, higher weight is placed on the
storage cost component of the latency-plus-cost objective, leading to less file chunks
and higher latency.

tradeoff calculated by analytical bound (Xiang et al., 2016) is very close


to the actual measurement on our testbed. These results allow operators
to exploit the latency and cost tradeoff in an erasure-coded storage
system by selecting the best operating point.

7.3 Applications in Caching and Content Distribution

The application of erasure codes can go beyond data storage. In this


section, we briefly introduce a novel caching framework leveraging
erasure codes, known as functional caching (Aggarwal et al., 2016;
Aggarwal et al., 2017a). Historically, caching is a key solution to relieve
traffic burden on networks (Pedarsani et al., 2014). By storing large
chunks of popular data at different locations closer to end-users, caching
can greatly reduce congestion in the network and improve service delay
for processing file requests. It is very common for 20% of the video
content to be accessed 80% of the time, so caching popular content at
proxies significantly reduces the overall latency on the client side.
7.3. Applications in Caching and Content Distribution 123

However, caching with erasure codes has not been well studied. The
current results for caching systems cannot automatically be carried
over to caches in erasure coded storage systems. First, using an (n, k)
maximum-distance-separable (MDS) erasure code, a file is encoded into
n chunks and can be recovered from any subset of k distinct chunks.
Thus, file access latency in such a system is determined by the delay
to access file chunks on hot storage nodes with slowest performance.
Significant latency reduction can be achieved by caching a few hot chunks
(and therefore alleviating system performance bottlenecks), whereas
caching additional chunks only has diminishing benefits. Second, caching
the most popular data chunks is often optimal because the cache-miss
rate and the resulting network load are proportional to each other.
However, this may not be true for an erasure-coded storage, where
cached chunks need not be identical to the transferred chunks. More
precisely, a function of the data chunks can be computed and cached, so
that the constructed new chunks, along with the existing chunks, also
satisfy the property of being an MDS code. There have been caching
schemes that cache the entire file (Nadgowda et al., 2014; Chang et al.,
2008; Zhu et al., 2004), while we can cache partial file for an eraure-
coded system (practically proposed for replicated storage systems in
(Naik et al., 2015)) which gives extra flexibility and the evaluation
results depict the advantage of caching partial files.

A new functional caching approach called Sprout that can efficiently


capitalize on existing file coding in erasure-coded storage systems has
been proposed in (Aggarwal et al., 2016; Aggarwal et al., 2017a). In
contrast to exact caching that stores d chunks identical to original
copies, our functional caching approach forms d new data chunks, which
together with the existing n chunks satisfy the property of being an
(n + d, k) MDS code. Thus, the file can now be recovered from any k out
of n + d chunks (rather than k out of n under exact caching), effectively
extending coding redundancy, as well as system diversity for scheduling
file access requests. The proposed functional caching approach saves
latency due to more flexibility to obtain k − d chunks from the storage
system at a very minimal additional computational cost of creating the
coded cached chunks.
124 Lessons from prototype implementation

Compute C1 Compute Compute F1


Server C2 Server Server F2


F1 F3

F2 F6 F4

F5
Storage Nodes Storage Nodes Storage Nodes

Functional Cache Exact Cache


C1=F1+2F2+3F3+4F4+5F5 C1=F1
C2=F1+2F2+4F3+8F4+16F5 C2=F2

Figure 7.8: An illustration of functional caching and exact caching in an erasure-


coded storage system with one file using a (5, 4) erasure code.

Example. Consider a datacenter storing a single file using a (5, 4) MDS


code. The file is split into ki = 4 chunks, denoted by A1 , A2 , A3 , A4 ,
and then linearly encoded to generate ni = 5 coded chunks F1 = A1 ,
F2 = A2 , F3 = A3 , F4 = A4 , and F5 = A1 + A2 + A3 + A4 in a finite
field of order at-least 5. Two compute servers in the datacenter access
this file and each is equipped with a cache of size C = 2 chunks as
depicted in Figure 7.8. The compute server on the right employs an
exact caching scheme and stores chunks F1 , F2 in the cache memory.
Thus, 2 out of 3 remaining chunks (i.e., F3 , F4 or F5 ) must be retrieved
to access the file, whereas chunks F1 , F2 and their host nodes will
not be selected for scheduling requests. Under functional caching, the
compute server on the left generates di = 2 new coded chunks, i.e.,
C1 = A1 + 2A2 + 3A3 + 4A4 and C2 = 4A1 + 3A2 + 2A3 + 1A4 , and saves
them in its cache memory. It is easy to see that chunks F1 , . . . , F5 and
C1 , C2 now form a (7, 4) erasure code. Thus, the file can be retrieved
by accessing C1 , C2 in the cache together with any 2 out of 5 chunks
from F1 , . . . , F5 . This allows an optimal request scheduling mechanism
to select the least busy chunks/nodes among all 5 possible candidates
in the system, so that the service latency is determined by the best 2
storage node with minimum queuing delay. In contrast, service latency
in exact caching is limited by the latency of accessing a smaller subset
of chunks F3 , F4 , and F5 . In order to have a (n, k) coded file in the
storage server, we can construct chunks by using an (n + k, k) MDS
code, where n chunks are stored in the storage server. The remaining k
7.3. Applications in Caching and Content Distribution 125

out of the n + k coded chunks are assigned to be in part in cache based


on the contents of the file in the cache. Thus, irrespective of the value
of d ≤ k, we ascertain that (n + d, k) code, formed with n coded chunks
in the storage server and k coded chunks in the cache, will be MDS.
Latency  with  Cache   Latency  without  Cache  
140  

120  

100  
Average  Latency  (Sec)  

80  

60  

40  

20  

0  
0.0149   0.0225   0.0301   0.0384   0.0456  
Request  Arrival  Rate  (/Sec)  

Figure 7.9: Comparison of average latency of functional caching and Tahoe’s native
storage system without caching, with varying average arrival rates for r = 1000 files
of 200MB, where the cache size fixed at 2500.

Utilizing the theoretical models developed for quantifying service


latency in erasure-coded storage systems such as (Xiang et al., 2016), we
can obtain latency bound for functional caching and use it to formulate
and solve a cache-content optimization problem. Due to space limita-
tions, we skip the technical details and refer readers to (Aggarwal et al.,
2017a), where extension of Theorem 4.7 is provided with functional
caching. Implementing functional caching on the Tahoe testbed, we fix
file size to be 200MB and vary the file request arrival rates with average
request arrival rate of the r = 1000 files in {0.0149/sec, 0.0225/sec,
0.0301/sec, 0.0384/sec, 0.0456/sec}. The cache size is fixed, and set to
be 2500. Actual average service latency of files for each request arrival
rate is shown by a bar plot in Figure 7.6. In this experiment we also com-
pare the performance of functional caching with Tahoe’s built-in native
storage system without caching. Fig 7.6 shows that our algorithm with
caching outperforms Tahoe native storage in terms of average latency
for all the average arrival rates. Functional caching gives an average
49% reduction in latency. Future work includes designing various cache
replacement policies with respect to erasure codes, as well as developing
a theoretical framework to quantify and optimize the resulting latency.
Acknowledgements

The authors would like to thank their collaborators for contributions


to this line of work: Yih-Farn Robin Chen and Yu Xiang at AT&T
Labs-Research, Moo-Ryong Ra at Amazon, Vinay Vaishampayan at
City University of NY, Ajay Badita and Parimal Parag at IISc Banga-
lore, Abubakr Alabassi, Jingxian Fan, and Ciyuan Zhang at Purdue
University, and Chao Tian at Texas A&M University.
The authors would like to thank Alexander Barg at the University
of Maryland for the many suggestions on the manuscript. The authors
are also grateful to anonymous reviewers for valuable comments that
have significantly improved the manuscript.

126
References

Al-Abbasi, A. O. and V. Aggarwal. 2018a. “EdgeCache: An optimized


algorithm for CDN-based over-the-top video streaming services”. In:
IEEE INFOCOM 2018 - IEEE Conference on Computer Communi-
cations Workshops (INFOCOM WKSHPS). 202–207. doi: 10.1109/
INFCOMW.2018.8407016.
Al-Abbasi, A. O. and V. Aggarwal. 2018b. “Mean latency optimization
in erasure-coded distributed storage systems”. In: IEEE INFOCOM
2018 - IEEE Conference on Computer Communications Workshops
(INFOCOM WKSHPS). 432–437. doi: 10.1109/INFCOMW.2018.
8406958.
Al-Abbasi, A. O. and V. Aggarwal. 2018c. “Stall-Quality Tradeoff for
Cloud-based Video Streaming”. In: 2018 International Conference
on Signal Processing and Communications (SPCOM). 6–10. doi:
10.1109/SPCOM.2018.8724450.
Al-Abbasi, A. O. and V. Aggarwal. 2018d. “Video streaming in dis-
tributed erasure-coded storage systems: Stall duration analysis”.
IEEE/ACM Transactions on Networking. 26(4): 1921–1932.
Al-Abbasi, A. O. and V. Aggarwal. 2020. “TTLCache: Minimizing
Latency in Erasure-coded Storage through Time To Live Caching”.
IEEE Transactions on Network and Service Management.

127
128 References

Al-Abbasi, A. O., V. Aggarwal, and T. Lan. 2019a. “Ttloc: Taming tail


latency for erasure-coded cloud storage systems”. IEEE Transactions
on Network and Service Management. 16(4): 1609–1623.
Al-Abbasi, A. O., V. Aggarwal, and M.-R. Ra. 2019b. “Multi-tier
caching analysis in cdn-based over-the-top video streaming systems”.
IEEE/ACM Transactions on Networking. 27(2): 835–847.
Al-Abbasi, A., V. Aggarwal, T. Lan, Y. Xiang, M.-R. Ra, and Y.-F.
Chen. 2019c. “Fasttrack: Minimizing stalls for cdn-based over-the-top
video streaming systems”. IEEE Transactions on Cloud Computing.
Abdelkefi, A. and Y. Jiang. 2011. “A Structural Analysis of Network De-
lay”. In: Communication Networks and Services Research Conference
(CNSR), 2011 Ninth Annual. 41–48. doi: 10.1109/CNSR.2011.15.
Aggarwal, V., Y. R. Chen, T. Lan, and Y. Xiang. 2017a. “Sprout: A Func-
tional Caching Approach to Minimize Service Latency in Erasure-
Coded Storage”. IEEE/ACM Transactions on Networking. 25(6):
3683–3694. issn: 1063-6692. doi: 10.1109/TNET.2017.2749879.
Aggarwal, V., Y.-F. Chen, T. Lan, and Y. Xiang. 2016. “Sprout: A
functional caching approach to minimize service latency in erasure-
coded storage”. In: Distributed Computing Systems (ICDCS), 2016
IEEE 36th International Conference on.
Aggarwal, V., J. Fan, and T. Lan. 2017b. “Taming Tail Latency for
Erasure-coded, Distributed Storage Systems”. In: in Proc. IEEE
Infocom.
Aggarwal, V., C. Tian, V. A. Vaishampayan, and Y.-F. R. Chen. 2014.
“Distributed data storage systems with opportunistic repair”. In:
IEEE INFOCOM 2014-IEEE Conference on Computer Communi-
cations. IEEE. 1833–1841.
Aguilera, M. K., R. Janakiraman, and L. Xu. 2005. “Using erasure
codes efficiently for storage in a distributed system”. In: Dependable
Systems and Networks, 2005. DSN 2005. Proceedings. International
Conference on. 336–345. doi: 10.1109/DSN.2005.96.
Aktaş, M. F. and E. Soljanin. 2019. “Straggler mitigation at scale”.
IEEE/ACM Transactions on Networking. 27(6): 2266–2279.
al., D. B. et. 2010. “HDFS RAID Hadoop User Group Meeting”. In:
Meeting, Nov.
References 129

Alabbasi, A. and V. Aggarwal. 2018. “Optimized Video Streaming


over Cloud: A Stall-Quality Trade-off”. ArXiv e-prints. June. arXiv:
1806.09466 [cs.NI].
Ananthanarayanan, G., A. Ghodsi, S. Shenker, and I. Stoica. 2013.
“Effective straggler mitigation: Attack of the clones”. In: Presented
as part of the 10th {USENIX} Symposium on Networked Systems
Design and Implementation ({NSDI} 13). 185–198.
Ananthanarayanan, G., A. Ghodsi, S. Shenker, and I. Stoica. Submitted.
“Why Let Resources Idle? Aggressive Cloning of Jobs with Dolly”.
In: Presented as part of the. USENIX. url: https://www.usenix.org/
conference/hotcloud12/why-let-resources-idle-aggressive-cloning-
jobs-dolly.
Ananthanarayanan, G., S. Kandula, A. G. Greenberg, I. Stoica, Y. Lu, B.
Saha, and E. Harris. 2010. “Reining in the Outliers in Map-Reduce
Clusters using Mantri.” In: Osdi. Vol. 10. No. 1. 24.
Angell, T. 2002. “The Farkas-Minkowski Theorem”. Tech. rep. www.
math.udel.edu/~angell/Opt/farkas.pdf.
Angus, J. E. 1988. “On computing MTBF for a k-out-of-n: G repairable
system”. IEEE Transactions on Reliability. 37(3): 312–313.
Arnold, B. C. 2015. Pareto distribution. Wiley Online Library.
Arnold, B. C. and R. A. Groeneveld. 1979. “Bounds on expectations
of linear systematic statistics based on dependent samples”. The
Annals of Statistics. 7(1): 220–223.
B. Warner, Z. W.-O. and R. Kinninmont. 2012. “Tahoe-LAFS docs”.
Tech. rep. https://tahoe-lafs.org/trac/tahoe-lafs.
Baccelli, F., A. Makowski, and A. Shwartz. 1989. “The fork-join queue
and related systems with synchronization constraints: stochastic
ordering and computable bounds”. Advances in Applied Probability:
629–660.
Badita, A., P. Parag, and V. Aggarwal. 2020a. “Sequential addition of
coded sub-tasks for straggler mitigation”. In: Proceedings of IEEE
Infocom.
Badita, A., P. Parag, and V. Aggarwal. 2020b. “Sequential addition of
coded sub-tasks for straggler mitigation”. Submitted to IEEE/ACM
Transactions on Networking.
130 References

Badita, A., P. Parag, and J.-F. Chamberland. 2019. “Latency Analysis


for Distributed Coded Storage Systems”. IEEE Transactions on
Information Theory. 65(8): 4683–4698.
Balaji, S. B., M. N. Krishnan, M. Vajha, V. Ramkumar, B. Sasidharan,
and P. V. Kumar. 2018. “Erasure Coding for Distributed Storage:
An Overview”. arXiv: 1806.04437 [cs.IT].
Berger, D. S., R. K. Sitaraman, and M. Harchol-Balter. 2017. “AdaptSize:
Orchestrating the hot object memory cache in a content delivery
network”. In: 14th {USENIX} Symposium on Networked Systems
Design and Implementation ({NSDI} 17). 483–498.
blog, M. research. 2012. “Microsoft research blog: A better way to store
data”. url: https://www.microsoft.com/en- us/research/blog/
better-way-store-data/.
Chang, F., J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M.
Burrows, T. Chandra, A. Fikes, and R. E. Gruber. 2008. “Bigtable:
A Distributed Storage System for Structured Data”. ACM Trans.
Comput. Syst. 26(2): 4:1–4:26. issn: 0734-2071. doi: 10.1145/1365815.
1365816.
Chang, H.-Y., K.-B. Chen, and H.-C. Lu. 2016. “A novel resource
allocation mechanism for live cloud-based video streaming service”.
Multimedia Tools and Applications: 1–18.
Chen, M. 2012. “AMVSC: a framework of adaptive mobile video stream-
ing in the cloud”. In: Global Communications Conference (GLOBE-
COM), 2012 IEEE. IEEE. 2042–2047.
Chen, P. M., E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patterson.
1994. “RAID: High-performance, reliable secondary storage”. ACM
Computing Surveys (CSUR). 26(2): 145–185.
Chen, S., Y. Sun, U. Kozat, L. Huang, P. Sinha, G. Liang, X. Liu, and
N. Shroff. 2014a. “When Queuing Meets Coding: Optimal-Latency
Data Retrieving Scheme in Storage Clouds”. In: Proceedings of IEEE
Infocom.
Chen, S., Y. Sun, L. Huang, P. Sinha, G. Liang, X. Liu, N. B. Shroff,
et al. 2014b. “When queueing meets coding: Optimal-latency data
retrieving scheme in storage clouds”. In: IEEE INFOCOM 2014-
IEEE Conference on Computer Communications. IEEE. 1042–1050.
Dean, J. 2012. “Achieving rapid response times in large online services”.
References 131

Dean, J. and L. A. Barroso. 2013. “The tail at scale”. Communications


of the ACM. 56(2): 74–80.
Dean, J. and S. Ghemawat. 2008. “MapReduce: simplified data pro-
cessing on large clusters”. Communications of the ACM. 51(1): 107–
113.
Dimakis, A., V. Prabhakaran, and K. Ramchandran. 2004. “Distributed
data storage in sensor networks using decentralized erasure codes”.
In: Signals, Systems and Computers, 2004. Conference Record of
the Thirty-Eighth Asilomar Conference on. Vol. 2. 1387–1391 Vol.2.
doi: 10.1109/ACSSC.2004.1399381.
Dimakis, A. G., P. B. Godfrey, Y. Wu, M. J. Wainwright, and K. Ram-
chandran. 2010. “Network coding for distributed storage systems”.
IEEE Transactions on Information Theory. 56(9): 4539–4551.
Downey, A. 2001. “The structural cause of file size distributions”. In:
Modeling, Analysis and Simulation of Computer and Telecommuni-
cation Systems, 2001. Proceedings. Ninth International Symposium
on. 361–370. doi: 10.1109/MASCOT.2001.948888.
Elgabli, A. and V. Aggarwal. 2019a. “Fastscan: Robust low-complexity
rate adaptation algorithm for video streaming over http”. IEEE
Transactions on Circuits and Systems for Video Technology.
Elgabli, A. and V. Aggarwal. 2019b. “SmartStreamer: Preference-Aware
Multipath Video Streaming Over MPTCP”. IEEE Transactions on
Vehicular Technology. 68(7): 6975–6984.
Elgabli, A., V. Aggarwal, S. Hao, F. Qian, and S. Sen. 2018a. “LBP: Ro-
bust rate adaptation algorithm for SVC video streaming”. IEEE/ACM
Transactions on Networking. 26(4): 1633–1645.
Elgabli, A., M. Felemban, and V. Aggarwal. 2018b. “GiantClient: Video
hotspot for multi-user streaming”. IEEE Transactions on Circuits
and Systems for Video Technology. 29(9): 2833–2843.
Elgabli, A., M. Felemban, and V. Aggarwal. 2019. “GroupCast: Preference-
aware cooperative video streaming with scalable video coding”.
IEEE/ACM Transactions on Networking. 27(3): 1138–1150.
Elgabli, A., K. Liu, and V. Aggarwal. 2018c. “Optimized preference-
aware multi-path video streaming with scalable video coding”. IEEE
Transactions on Mobile Computing. 19(1): 159–172.
132 References

Ferner, U. J., M. Médard, and E. Soljanin. 2012. “Toward sustainable


networking: Storage area networks with network coding”. In: 2012
50th Annual Allerton Conference on Communication, Control, and
Computing (Allerton). IEEE. 517–524.
Fidler, M. and Y. Jiang. 2016. “Non-asymptotic delay bounds for (k,
l) fork-join systems and multi-stage fork-join networks”. In: IEEE
INFOCOM 2016-The 35th Annual IEEE International Conference
on Computer Communications. IEEE. 1–9.
Flatto, L. and S. Hahn. 1984. “Two parallel queues created by arrivals
with two demands I”. SIAM Journal on Applied Mathematics. 44(5):
1041–1053.
Friedlander, E. and V. Aggarwal. 2019. “Generalization of LRU cache
replacement policy with applications to video streaming”. ACM
Transactions on Modeling and Performance Evaluation of Computing
Systems (TOMPECS). 4(3): 1–22.
Gardner, K., S. Zbarsky, S. Doroudi, M. Harchol-Balter, and E. Hyytia.
2015. “Reducing latency via redundant requests: Exact analysis”.
ACM SIGMETRICS Performance Evaluation Review. 43(1): 347–
360.
Gasper, G., M. Rahman, and G. George. 2004. Basic hypergeometric
series. Vol. 96. Cambridge university press.
Goparaju, S., I. Tamo, and R. Calderbank. 2014. “An improved sub-
packetization bound for minimum storage regenerating codes”. IEEE
Transactions on Information Theory. 60(5): 2770–2779.
Halalai, R., P. Felber, A.-M. Kermarrec, and F. Taiani. 2017. “Agar: A
caching system for erasure-coded data”. In: 2017 IEEE 37th Inter-
national Conference on Distributed Computing Systems (ICDCS).
IEEE. 23–33.
Harrison, P. and S. Zertal. 2003. “Queueing Models with Maxima of
Service Times”. In: Computer Performance Evaluation. Modelling
Techniques and Tools. Ed. by P. Kemper and W. H. Sanders. Berlin,
Heidelberg: Springer Berlin Heidelberg. 152–168.
He, J., Y. Wen, J. Huang, and D. Wu. 2014. “On the Cost–QoE trade-
off for cloud-based video streaming under Amazon EC2’s pricing
models”. IEEE Transactions on Circuits and Systems for Video
Technology. 24(4): 669–680.
References 133

Huang, C., H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, and


S. Yekhanin. 2012a. “Erasure Coding in Windows Azure Storage”. In:
Proceedings of the 2012 USENIX Conference on Annual Technical
Conference. USENIX ATC’12. USENIX Association.
Huang, L., S. Pawar, H. Zhang, and K. Ramchandran. 2012b. “Codes
can reduce queueing delay in data centers”. In: Information Theory
Proceedings (ISIT), 2012 IEEE International Symposium on. 2766–
2770. doi: 10.1109/ISIT.2012.6284026.
Huang, Z., C. Mei, L. E. Li, and T. Woo. 2011. “CloudStream: Delivering
high-quality streaming videos through a cloud-based SVC proxy”.
In: INFOCOM, 2011 Proceedings IEEE. IEEE. 201–205.
Joshi, G., Y. Liu, and E. Soljanin. 2014. “On the Delay-Storage Trade-
Off in Content Download from Coded Distributed Storage Systems”.
Selected Areas in Communications, IEEE Journal on. 32(5): 989–997.
issn: 0733-8716. doi: 10.1109/JSAC.2014.140518.
Joshi, G., E. Soljanin, and G. Wornell. 2017. “Efficient redundancy tech-
niques for latency reduction in cloud systems”. ACM Transactions
on Modeling and Performance Evaluation of Computing Systems
(TOMPECS). 2(2): 1–30.
Kim, C. and A. K. Agrawala. 1989. “Analysis of the fork-join queue”.
IEEE Transactions on computers. 38(2): 250–255.
Kuczma, M. 2009a. An introduction to the theory of functional equations
and inequalities: Cauchy’s equation and Jensen’s inequality. Springer
Science & Business Media.
Kuczma, M. 2009b. An introduction to the theory of functional equations
and inequalities: Cauchy’s equation and Jensen’s inequality. Springer
Science & Business Media.
Kumar, A., R. Tandon, and T. C. Clancy. 2017. “On the Latency and
Energy Efficiency of Distributed Storage Systems”. IEEE Trans-
actions on Cloud Computing. 5(2): 221–233. issn: 2372-0018. doi:
10.1109/TCC.2015.2459711.
Lee, K., N. B. Shah, L. Huang, and K. Ramchandran. 2017. “The MDS
queue: Analysing the latency performance of erasure codes”. IEEE
Transactions on Information Theory. 63(5): 2822–2842.
134 References

Lee, K., L. Yan, A. Parekh, and K. Ramchandran. 2013. “A VoD Sys-


tem for Massively Scaled, Heterogeneous Environments: Design and
Implementation”. In: 2013 IEEE 21st International Symposium on
Modelling, Analysis and Simulation of Computer and Telecommuni-
cation Systems. IEEE. 1–10.
Li, X., Z. Yang, J. Li, R. Li, P. P. C. Lee, Q. Huang, and Y. Hu.
2019. “Repair Pipelining for Erasure-Coded Storage: Algorithms
and Evaluation”. arXiv: 1908.01527 [cs.DC].
Liang, G. and U. C. Kozat. 2013. “FAST CLOUD: Pushing the Enve-
lope on Delay Performance of Cloud Storage with Coding”. CoRR.
abs/1301.1294. arXiv: 1301.1294. url: http://arxiv.org/abs/1301.
1294.
Luo, T., V. Aggarwal, and B. Peleato. 2019. “Coded caching with
distributed storage”. IEEE Transactions on Information Theory.
65(12): 7742–7755.
Lv, Q., P. Cao, E. Cohen, K. Li, and S. Shenker. 2002. “Search and
Replication in Unstructured Peer-to-peer Networks”. In: Proceedings
of the 16th International Conference on Supercomputing. ICS ’02.
New York, New York, USA: ACM. 84–95. isbn: 1-58113-483-5. doi:
10.1145/514191.514206. url: http://doi.acm.org/10.1145/514191.
514206.
Melnik, S., A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton,
and T. Vassilakis. 2010. “Dremel: interactive analysis of web-scale
datasets”. Proceedings of the VLDB Endowment. 3(1-2): 330–339.
Meyn, S. P. and R. L. Tweedie. 1993. “Stability of Markovian pro-
cesses III: Foster–Lyapunov criteria for continuous-time processes”.
Advances in Applied Probability. 25(3): 518–548.
Nadgowda, S. J., R. C. Sreenivas, S. Gupta, N. Gupta, and A. Verma.
2014. “Service-Oriented Computing: 12th International Conference,
ICSOC 2014, Paris, France, November 3-6, 2014. Proceedings”. In:
ed. by X. Franch, A. K. Ghose, G. A. Lewis, and S. Bhiri. Berlin,
Heidelberg: Springer Berlin Heidelberg. Chap. C2P: Co-operative
Caching in Distributed Storage Systems. 214–229. isbn: 978-3-662-
45391-9. doi: 10.1007/978-3-662-45391-9_15.
References 135

Naik, M., F. Schmuck, and R. Tewari. 2015. “Read and write requests
to partially cached files”. US Patent 9,098,413. url: http://www.
google.com/patents/US9098413.
Nelson, R. and A. N. Tantawi. 1988. “Approximate analysis of fork/join
synchronization in parallel queues”. IEEE transactions on computers.
37(6): 739–743.
Olvera-Cravioto, M., J. Blanchet, and P. Glynn. 2011. “On the transition
from heavy traffic to heavy tails for the M/G/1 queue: the regularly
varying case”. The Annals of Applied Probability. 21(2): 645–668.
Ovsiannikov, M., S. Rus, D. Reeves, P. Sutter, S. Rao, and J. Kelly. 2013.
“The quantcast file system”. Proceedings of the VLDB Endowment.
6(11): 1092–1101.
Oza, N. and N. Gohil. 2016. “Implementation of cloud based live stream-
ing for surveillance”. In: Communication and Signal Processing
(ICCSP), 2016 International Conference on. IEEE. 0996–0998.
Paganini, F., A. Tang, A. Ferragut, and L. Andrew. 2012. “Network
Stability Under Alpha Fair Bandwidth Allocation With General
File Size Distribution”. Automatic Control, IEEE Transactions on.
57(3): 579–591. issn: 0018-9286. doi: 10.1109/TAC.2011.2160013.
Papadatos, N. 1995. “Maximum variance of order statistics”. Annals of
the Institute of Statistical Mathematics. 47(1): 185–193.
Papailiopoulos, D. S., A. G. Dimakis, and V. R. Cadambe. 2013. “Repair
optimal erasure codes through hadamard designs”. IEEE Transac-
tions on Information Theory. 59(5): 3021–3037.
Parag, P., A. Bura, and J.-F. Chamberland. 2017. “Latency analysis for
distributed storage”. In: IEEE INFOCOM 2017-IEEE Conference
on Computer Communications. IEEE. 1–9.
Pedarsani, R., M. A. Maddah-Ali, and U. Niesen. 2014. “Online coded
caching”. In: IEEE International Conference on Communications,
ICC 2014, Sydney, Australia, June 10-14, 2014. 1878–1883. doi:
10.1109/ICC.2014.6883597.
Pedarsani, R., M. A. Maddah-Ali, and U. Niesen. 2015. “Online coded
caching”. IEEE/ACM Transactions on Networking. 24(2): 836–845.
Plank, J. S., J. Luo, C. D. Schuman, L. Xu, Z. Wilcox-O’Hearn, et al.
2009. “A Performance Evaluation and Examination of Open-Source
Erasure Coding Libraries for Storage.” In: Fast. Vol. 9. 253–265.
136 References

Ramaswami, V., K. Jain, R. Jana, and V. Aggarwal. 2014. “Modeling


Heavy Tails in Traffic Sources for Network Performance Evalua-
tion”. English. In: Computational Intelligence, Cyber Security and
Computational Models. Ed. by G. S. S. Krishnan, R. Anitha, R. S.
Lekshmi, M. S. Kumar, A. Bonato, and M. Graña. Vol. 246. Ad-
vances in Intelligent Systems and Computing. Springer India. 23–44.
isbn: 978-81-322-1679-7. doi: 10.1007/978-81-322-1680-3_4. url:
http://dx.doi.org/10.1007/978-81-322-1680-3_4.
Rashmi, K. V., N. B. Shah, and P. V. Kumar. 2011. “Optimal exact-
regenerating codes for distributed storage at the MSR and MBR
points via a product-matrix construction”. IEEE Transactions on
Information Theory. 57(8): 5227–5239.
Rashmi, K., N. B. Shah, D. Gu, H. Kuang, D. Borthakur, and K.
Ramchandran. 2014. “A" hitchhiker’s" guide to fast and efficient
data reconstruction in erasure-coded data centers”: 331–342.
Ren, X., G. Ananthanarayanan, A. Wierman, and M. Yu. 2015. “Hopper:
Decentralized speculation-aware cluster scheduling at scale”. In:
Proceedings of the 2015 ACM Conference on Special Interest Group
on Data Communication. 379–392.
Ross, M. 2019. “Introduction to Probability Theory. Introduction to
Probability Models”.
S3, A. “Amazon Simple Storage Service”. available online at http :
// aws.amazon.com/ s3/ .
Sasi, S., V. Lalitha, V. Aggarwal, and B. S. Rajan. 2020. “Straggler
Mitigation with Tiered Gradient Codes”. IEEE Transactions on
Communications.
Sathiamoorthy, M., M. Asteris, D. Papailiopoulos, A. G. Dimakis, R.
Vadali, S. Chen, and D. Borthakur. 2013. “XORing Elephants: Novel
Erasure Codes for Big Data”. arXiv: 1301.3791 [cs.IT].
Scutari, G., F. Facchinei, and L. Lampariello. 2017. “Parallel and
Distributed Methods for Constrained Nonconvex Optimization-Part
I: Theory”. IEEE Transactions on Signal Processing. 65(8): 1929–
1944. issn: 1053-587X. doi: 10.1109/TSP.2016.2637317.
Suh, C. and K. Ramchandran. 2011. “Exact-repair MDS code construc-
tion using interference alignment”. IEEE Transactions on Informa-
tion Theory. 57(3): 1425–1442.
References 137

Tian, C., B. Sasidharan, V. Aggarwal, V. A. Vaishampayan, and P. V.


Kumar. 2015. “Layered exact-repair regenerating codes via embed-
ded error correction and block designs”. IEEE Transactions on
Information Theory. 61(4): 1933–1947.
Varki, E., A. Merchant, H. Chen, et al. 2008. “The M/M/1 fork-join
queue with variable sub-tasks”. unpublished, available online.
Vulimiri, A., O. Michel, P. B. Godfrey, and S. Shenker. 2012. “More is
less: reducing latency via redundancy”. In: Proceedings of the 11th
ACM Workshop on Hot Topics in Networks. 13–18.
Wang, W., M. Harchol-Balter, H. Jiang, A. Scheller-Wolf, and R. Srikant.
2019. “Delay asymptotics and bounds for multitask parallel jobs”.
Queueing Systems. 91(3-4): 207–239.
Wang, X., M. Chen, T. T. Kwon, L. Yang, and V. C. Leung. 2013.
“AMES-cloud: a framework of adaptive mobile video streaming and
efficient social video sharing in the clouds”. IEEE Transactions on
Multimedia. 15(4): 811–820.
Weatherspoon, H. and J. Kubiatowicz. 2002. “Erasure Coding Vs. Repli-
cation: A Quantitative Comparison”. English. In: Peer-to-Peer Sys-
tems. Ed. by P. Druschel, F. Kaashoek, and A. Rowstron. Vol. 2429.
Lecture Notes in Computer Science. Springer Berlin Heidelberg. 328–
337. isbn: 978-3-540-44179-3.
Weil, S. A., S. A. Brandt, E. L. Miller, D. D. Long, and C. Maltzahn.
2006. “Ceph: A scalable, high-performance distributed file system”.
In: Proceedings of the 7th symposium on Operating systems design
and implementation. 307–320.
Xiang, Y., V. Aggarwal, Y. R. Chen, and T. Lan. 2019. “Differentiated
Latency in Data Center Networks with Erasure Coded Files Through
Traffic Engineering”. IEEE Transactions on Cloud Computing. 7(2):
495–508. issn: 2372-0018. doi: 10.1109/TCC.2017.2648785.
Xiang, Y., T. Lan, V. Aggarwal, and Y. F. Chen. 2017. “Optimizing
Differentiated Latency in Multi-Tenant, Erasure-Coded Storage”.
IEEE Transactions on Network and Service Management. 14(1):
204–216. issn: 1932-4537. doi: 10.1109/TNSM.2017.2658440.
138 References

Xiang, Y., V. Aggarwal, Y.-F. Chen, and T. Lan. 2015a. “Taming La-
tency in Data Center Networking with Erasure Coded Files”. In:
Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM
International Symposium on. 241–250. doi: 10.1109/CCGrid.2015.
142.
Xiang, Y., T. Lan, V. Aggarwal, and Y.-F. Chen. 2015b. “Multi-tenant
Latency Optimization in Erasure-Coded Storage with Differentiated
Services”. In: Distributed Computing Systems (ICDCS), 2015 IEEE
35th International Conference on. 790–791. doi: 10.1109/ICDCS.
2015.111.
Xiang, Y., T. Lan, V. Aggarwal, and Y. F. R. Chen. 2014. “Joint Latency
and Cost Optimization for Erasurecoded Data Center Storage”.
SIGMETRICS Perform. Eval. Rev. 42(2): 3–14. issn: 0163-5999.
doi: 10.1145/2667522.2667524. url: http://doi.acm.org/10.1145/
2667522.2667524.
Xiang, Y., T. Lan, V. Aggarwal, Y.-F. R. Chen, Y. Xiang, T. Lan,
V. Aggarwal, and Y.-F. R. Chen. 2016. “Joint latency and cost
optimization for erasure-coded data center storage”. IEEE/ACM
Transactions on Networking (TON). 24(4): 2443–2457.
Yadwadkar, N. J. and W. Choi. 2012. “Proactive straggler avoidance
using machine learning”. White paper, University of Berkeley.
Zaharia, M., A. Konwinski, A. D. Joseph, R. H. Katz, and I. Sto-
ica. 2008. “Improving MapReduce performance in heterogeneous
environments.” In: Osdi. Vol. 8. No. 4. 7.
Zhou, T. and C. Tian. 2020. “Fast erasure coding for data storage: a com-
prehensive study of the acceleration techniques”. ACM Transactions
on Storage (TOS). 16(1): 1–24.
Zhu, Q., A. Shankar, and Y. Zhou. 2004. “PB-LRU: A Self-tuning
Power Aware Storage Cache Replacement Algorithm for Conserving
Disk Energy”. In: Proceedings of the 18th Annual International
Conference on Supercomputing. ICS ’04. Malo, France: ACM. 79–88.
isbn: 1-58113-839-3. doi: 10.1145/1006209.1006221.
Zwart, A. and O. J. Boxma. 2000. “Sojourn time asymptotics in the
M/G/1 processor sharing queue”. Queueing systems. 35(1-4): 141–
166.

You might also like