2005.10855v1
2005.10855v1
Vaneet Aggarwal
Purdue University
[email protected]
Tian Lan
George Washington University
[email protected]
Contents
1 Introduction 3
1.1 Erasure Coding in Distributed Storage . . . . . . . . . . . 3
1.2 Key Challenges in Latency Characterization . . . . . . . . 5
1.3 Problem Taxonomy . . . . . . . . . . . . . . . . . . . . . 8
1.4 Outline of the Monograph . . . . . . . . . . . . . . . . . . 12
1.5 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
References 127
Modeling and Optimization of
Latency in Erasure-coded Storage
Systems
Vaneet Aggarwal1 and Tian Lan2
1 Purdue University
2 George Washington University
ABSTRACT
As consumers are increasingly engaged in social network-
ing and E-commerce activities, businesses grow to rely on
Big Data analytics for intelligence, and traditional IT in-
frastructures continue to migrate to the cloud and edge,
these trends cause distributed data storage demand to rise
at an unprecedented speed. Erasure coding has seen itself
quickly emerged as a promising technique to reduce storage
cost while providing similar reliability as replicated systems,
widely adopted by companies like Facebook, Microsoft and
Google. However, it also brings new challenges in character-
izing and optimizing the access latency when erasure codes
are used in distributed storage. The aim of this monograph
is to provide a review of recent progress (both theoretical
and practical) on systems that employ erasure codes for
distributed storage.
In this monograph, we will first identify the key challenges
and taxonomy of the research problems and then give an
overview of different approaches that have been developed
to quantify and model latency of erasure-coded storage. This
includes recent work leveraging MDS-Reservation, Fork-Join,
Probabilistic, and Delayed-Relaunch scheduling policies, as
well as their applications to characterize access latency (e.g.,
2
In this Chapter, we will introduce the problem in Section 1.1. This will
be followed by the key challenges in the problem in Section 1.2. Section
1.3 explains the different approaches for the problem considerd in this
monograph. Section 1.4 gives the outline for the remaining chapters,
and Chapter 1.5 provides additional notes.
1
We will use storage nodes and storage servers interchangeably throughout this
3
4 Introduction
monograph.
2
While we make the assumption of fixed chunk size here to simplify the problem
formulation, the results can be easily extended to variable chunk sizes. Nevertheless,
fixed chunk sizes are indeed used by many existing storage systems (Dimakis et al.,
2004; Aguilera et al., 2005; Lv et al., 2002).
1.2. Key Challenges in Latency Characterization 5
system model.
3 5
1
4
2 File B
File A
(3,2) coding
(4,2) coding 5: b1
1: a1 6: b2
2: a2 Scheduler 7: b1+b2
3: a1+a2
4: a1+2a2
…… Requests
Figure 1.1: An erasure-coded storage of 2 files, which partitioned into 2 blocks and
encoded using (4, 2) and (3, 2) MDS codes, respectively. Resulting file chunks are
spread over 5 storage nodes. Any file request must be processed by 2 distinct nodes
that have the desired chunks. Nodes 3, 4 are shared and can process requests for
both files.
In Figure 1.1, files A and B are encoded using (4, 2) and (3, 2) MDS
codes, respectively, file A will have chunks as A1 , A2 , A3 and A4 , and
file B will have chunks B1 , B2 and B3 . As depicted in Fig. 1.2, each file
request comes in as a batch of ki = 2 chunk requests, e.g., (R1A,1 , R1A,2 ),
(R2A,1 , R2A,2 ), and (R1B,1 , R1B,2 ), where RiA,j , denotes the ith request of
file A, j = 1, 2 denotes the first or second chunk request of this file
request. Denote the five nodes (from left to right) as servers 1, 2, 3, 4,
and 5, and we initialize 4 file requests for file A and 3 file requests for
file B, i.e., requests for the different files have different arrival rates.
The two chunks of one file request can be any two different chunks
1.2. Key Challenges in Latency Characterization 7
A,1 A,2
R R R A,2
R 3B,2
R B,1
2 R B,2
2
3 3 4
R 3B,1 R 3B,2
R 3A,1 R 3A,2 ……
R A,1
4 R A,2
4
Dispatch
3
Queueing analysis is not applicable to delayed relaunch.
10 Introduction
Table 1.2: The different regimes for the known results of the different scheduling
algorithms
In Table 1.2, we will describe the different cases where the anal-
ysis of these algorithms have been studied. The first line is for the a
single file and homogenous servers. As mentioned earlier, the MDS-
Reservation scheduling does not achieve optimal stability region for
4
Queueing analysis is not applicable to delayed relaunch.
1.3. Problem Taxonomy 11
25
Fork Join Scheduling
Mean Latency (in seconds) Probabilistic Scheduling
20 MDS-Reservation(1000)
15
10
0
20 22 24 26 28 30
Arrival Rate
Figure 1.3: Comparison of the different strategies in simulation. We note that prob-
abilistic scheduling outperforms the other strategies with the considered parameters.
choose the best one among these. Note that even though Fork-Join
queues have not been analyzed for heterogeneous servers, the results
indicate the simulated performance. The simulation results are provided
in Fig. 1.3. We note that the MDS-Reservation and the fork-join queue
does not achieve the optimal stable throughput region and it can be
seen that the mean latency starts diverging faster. Further, we note that
the probabilistic scheduling performs better than Fork-Join scheduling
for all arrival rates in this system.
1.5 Notes
14
2.1. MDS-Reservation Queue 15
completed service.
The key idea in analyzing these MDS queues is to show that the
corresponding Markov chains belong to a class of processes known
as Quasi-Birth-Death (QBD) processes (Lee et al., 2017). Thus, the
steady-state distribution can be obtained by exploiting the properties
of QBD processes. More precisely, a birth-death process is defined as
a continuous-time Markov process on discrete states {0, 1, 2, . . .}, with
transition rate λ from state i to i + 1, transition rate µ from state i + 1
to i, and rates µ0 , λ0 to and from the boundary state i = 0, respectively.
A QBD process is a generalization of such birth-death processes whose
states i are each replaced by a set of states, known as a level. Thus a
QBD process could have transitions both within a level and between
2.2. Characterization of Latency Upper Bound via MDS-Reservation
Scheduling 17
Proof. We briefly summarize the proof in (Lee et al., 2017). For any
state of the system (w1 , w2 , . . . , wt , m) ∈ {0, 1, . . . , k}t × {0, 1, . . . , ∞},
define
0 if w1 = 0
q= t else if wt 6= 0 (2.2)
arg max{τ : wτ 6= 0, 1 ≤ τ ≤ t} otherwise.
Then we can find the number of waiting request batches (b), the number
of idle servers in the system (z), the number of jobs of ith waiting batch
in the servers (si ), and the number of jobs of ith waiting batch in the
buffer (wi ) as follows:
0 if q = 0
q else if 0 < 1 < t
b= P (2.3)
m− w −n
j j
t + k otherwise.
X
z = n − m − wj − (b − t)+ k , (2.4)
j
18 MDS-Reservation Scheduling Approach
wi+1 − wi
if i ∈ {1, . . . , q − 1}
si = k − z − wi if i = q for i ∈ {1, . . . , b} (2.5)
0 if i ∈ {q + 1, . . . , b}
Using this queue model, we can also find the stability region of the
MDS-Reservation(t) scheduling policy. While an exact characterization
is non-tractable in general, bounds on the maximum stability region,
defined as the maximum possible number of requests that can be served
by the system per unit time (without resulting in infinite queue length)
is find in (Lee et al., 2017).
Theorem 2.3 ((Lee et al., 2017)). For any given (n, k) and t > 1,
the maximum throughput λ∗Resv(t) in the stability region satisfies the
following inequalities when k is treated as a constant:
n n
(1 − O(n−2 )) µ ≤ λ∗Resv(t) ≤ µ. (2.7)
k k
Proof. First, we note that for t ≥ 2, latency of each of the MDS-
Reservation(t) queues is upper bounded by that of MDS-Reservation(1),
since less batches of chunk requests are blocked and not allowed to move
forward into the servers, as t increases.
Next, we evaluate the maximum throughput in the stability region of
MDS-Reservation(1) by exploiting properties of QBD systems. We follow
the proof in (Lee et al., 2017). Using the QBD process representations in
Equation (2.6), the maximum throughput λ∗Resv(t) of any QBD system
is the value of λ such that: ∃v satisfying v T (A0 + A1 + A2 ) = 0 and
v T A0 1 = v T A1 1, where 1 is an all-one vector. For fixed values of µ
and k, it is easy to verify that the matrices A0 , A1 , and A2 are affine
transformations of arrival rate λ. Plugging in the values of A0 , A1 and
A2 in the QBD representation of MDS-Reservation(1) queues, we can
show that such v vector exist if λ∗Resv(1) ≥ (1 − O(n−2 )) nk µ.
It then follows that λ∗Resv(t) ≥ λ∗Resv(1) ≥ (1 − O(n−2 )) nk µ for any
t ≥ 2. The upper bound on λ∗Resv(t) is straightforward since each batch
consists of k chunk requests, the rate at which batches exit the system
(for all n servers combined) is at most nµ/k.
same file request must be processed by distinct servers after the first
t requests. It applies the MDS scheduling policy whenever there are t
or fewer file requests (i.e., t or fewer batches of chunk requests) in the
system, while ignoring the requirement of distinct servers when there
are more than t file requests.
Theorem 2.4 ((Lee et al., 2017)). The Markovian representation of the
M k /M/n(t) queue has a state space {0, 1, . . . , k}t ×{0, 1, . . . , ∞}. It is a
QBD process with boundary states {0, 1, . . . , k}t × {0, 1, . . . , n + tk} and
levels {0, 1, . . . , k}t × {n − k + 1 + jk, . . . , n + jk} for j = {t + 1, . . . , ∞}.
Proof. We again define q for any system state (w1 , w2 , . . . , wt , m) ∈
{0, 1, . . . , k}t × {0, 1, . . . , ∞} as in Equation (2.2). The values of b, z, si ,
and wi can be derived accordingly and are identical to those in Chapter
2.2. These equations capture the entire state transitions. It is then easy
to see that the M k /M/n(t) queue satisfy the following two properties:
i) Any transitions change the value of m by at most k; and ii) For
m ≥ n − k + 1 + tk, the transition from any state (w1 , m) to any other
0 0
states (w1 , m ≥ n − k + 1 + tk) depends on m mod k and not on the
actual value of m. This results in a QBD process with boundary states
and levels described in the theorem.
under (n, k) MDS codes, which sends (redundantly) each file request
to v > k servers. Clearly, upon completion of any k-out-of-v chunk
requests, the file request is considered to be served, and the remaining
v − k active chunk requests could be canceled and removed from the
system. It is easy to see that redundant requests allow the reduction
of individual requests at the expense of an increase in overall queuing
delay due to the use of additional resources on v − k straggler requests.
We note that when k = 1, the redundant-request policy reduces to a
replication-based scheme.
Formally, an MDS queue with redundant requests is associated to
five parameters (n, k), [λ, µ], and the redundant level v ≥ k, satisfying
the following modified assumptions: i) File requests arrive in batches of
v chunk requests each; ii) Each of the v chunk requests in a batch can
be served by an arbitrary set of v distinct servers; iii) Each batch of v
chunk requests is served when any k of the v requests are served.
While empirical results in (Ananthanarayanan et al., Submitted;
Liang and Kozat, 2013; Vulimiri et al., 2012) demonstrated that the use
of redundant requests can lead to smaller latency under various settings,
a theoretical analysis of the latency - and thus a quantification of the
benefits - is still an open problem. Nevertheless, structural results have
been obtained in (Ananthanarayanan et al., Submitted) using MDS-
queue models, e.g., to show that request flooding can indeed reduce
latency in certain special cases.
Theorem 2.6. Consider a homogeneous MDS(n, k) queue with Poisson
arrivals, exponential service time, and identical service rates. If the
system is stable in the absence of redundant requests, a system with
the maximum number v = n of redundant requests achieves a strictly
smaller average latency than any other redundant request policies,
including no redundancy v = k and time-varying redundancy.
Proof. Consider two systems, system S1 with redundant level v < n
and system S2 with redundant level n. We need to prove that under
the same sequence of events (i.e., the arrivals and server completions),
the number of batches remaining in system S1 is at least as much as
that in S2 at any given time. To this end, we use the notion “time z" to
denote the time immediately following the zth arrival/departure event.
22 MDS-Reservation Scheduling Approach
2.5 Simulations
Figure 2.2: A comparison of average file access latency and its upper/lower bounds
through MDS-Reservation(t) and M k /M/n(t) scheduling policies.
The proposed latency bounds - using MDS-Reservation(t) and
M k /M/n(t) scheduling policies respectively - have been compared in
(Lee et al., 2017) through numerical examples. For an MDS system
with (n, k) = (10, 5) and µ = 1, Figure 2.2 plots the average file access
latency for various scheduling policies. Here, average latency bounds
under MDS-Reservation(t) and M k /M/n(t) scheduling policies are com-
puted using Little’s Law to the stationary distribution. A Monte-Carlo
simulation is employed to numerically find the exact latency of MDS
24 MDS-Reservation Scheduling Approach
Figure 2.3: Simulation results showing the reduction of average latency with an
increase in the redundant level v for an MDS(10,5) queue.
For a homogeneous MDS(n, k) queue with redundant requests, Fig-
ure 2.3 shows the simulation of file access latency for varying redundant
levels v. It corroborates the analysis that when the service times are
i.i.d. exponential, the average latency is minimized by v = n requests.
Further, it also seems that average latency strictly reduces with an
increase in redundant level v. But it is unclear if such property carry
over to general service time distributions.
The study of latency using MDS queues was initiated by (Huang et al.,
2012b), which consider a special case of “block-one-scheduling" policy to
obtain a an upper bound on service latency. For arbitrary service time
distributions, an analysis of the blocking probability was presented in
(Ferner et al., 2012) in the absence of a shared request buffer. Later, these
results were extended in (Lee et al., 2017) to general MDS-Reservation(t)
queues and a tighter upper bound on request latency was provided.
There are a number of open problems that can be considered in future
work.
2.6. Notes and Open Problems 25
26
3.1. Fork-Join Scheduling 27
We now provide bounds on the expected file latency, which is the mean
response time T(n,k) of the (n, k) fork-join system. It is the expected
time that a job spends in the system, from its arrival until k out of n
of its tasks are served by their respective nodes.
Since the n tasks are served by independent M/M/1 queues, intuition
suggests that T(n,k) is the k th order statistic of n exponential service
times. However this is not true, which makes the analysis of T(n,k)
challenging. The reason why the order statistics approach does not work
is the cancellation of jobs in the queue, where their abandonment has
to be taken into account.
z be a generalized harmonic number of order z defined by
Let Hx,y
y
z
X 1
Hx,y = , (3.1)
j=x+1
jz
Theorem 3.1 ((Joshi et al., 2014)). The expected file latency, T(n,k) ,
satisfies
1
2 1 )2
Hn−k,n λ Hn−k,n + (Hn−k,n
T(n,k) ≤ + 1
, (3.2)
µ 2µ2 1 − ρHn−k,n
where λ is the request arrival rate, µ is the service rate at each queue,
ρ = λ/µ is the load factor. We note that the bound is valid only when
1
ρHn−k,n < 1.
Proof. To find this upper bound, we use a model called the split-merge
system, which is similar but easier to analyze than the fork-join system.
In the (n, k) fork-join queueing model, after a node serves a task, it
can start serving the next task in its queue. On the contrary, in the
split-merge model, the n nodes are blocked until k of them finish service.
Thus, the job departs all the queues at the same time. Due to this
blocking of nodes, the mean response time of the (n, k) split-merge
model is an upper bound on (and a pessimistic estimate of) T(n,k) for
the (n, k) fork-join system.
The (n, k) split-merge system is equivalent to an M/G/1 queue where
arrivals are Poisson with rate λ and service time is a random variable
S distributed according to the k th order statistic of the exponential
distribution.
The mean and variance of S are given as
1 2
Hn−k,n
Hn−k.n
E[S] = and Var[S] = . (3.3)
µ µ2
The Pollaczek-Khinchin formula (Zwart and Boxma, 2000) gives the
mean response time T of an M/G/1 queue in terms of the mean and
variance of S as,
λ(E[S]2 + Var[S]
T = E[S] + . (3.4)
2(1 − λE[S])
Substituting the values of E[S] and Var[S] given by (3.3), we get the
upper bound (3.2). Note that the Pollaczek-Khinchin formula is valid
only when λ1 > E[S], the stability condition of the M/G/1 queue. Since
E[S] increases with k, there exists a k0 such that the M/G/1 queue
is unstable for all k ≥ k0 . The inequality λ1 > E[S] can be simplified
30 Fork-Join Scheduling Approach
1
to ρHn−k,n < 1 which is the condition for validity of the upper bound
given in Theorem 3.1.
We also note that the stability condition for the upper bound is
1
ρHn−k,n < 1 which is not the same as the stability condition of the
Fork-Join queue λ < nµ/k. This shows that the upper bound technique
is loose, and does not result in an efficient bound in the region close to
λ = nµ/k. We now find the lower bound on the latency in the following
theorem.
Theorem 3.2 ((Joshi et al., 2014)). The expected file latency, T(n,k) ,
satisfies
k−1
X 1
T(n,k)≥ , (3.5)
j=0
(n − j)µ − λ
where λ is the request arrival rate and µ is the service rate at each
queue.
Proof. The lower bound in (3.5) is a generalization of the bound for the
(n, n) fork-join system derived in (Varki et al., 2008). The bound for the
(n, n) system is derived by considering that a job goes through n stages
of processing. A job is said to be in the j th stage if j out of n tasks
have been served by their respective nodes for 0 ≤ j ≤ n − 1. The job
waits for the remaining n − j tasks to be served, after which it departs
the system. For the (n, k) fork-join system, since we only need k tasks
to finish service, each job now goes through k stages of processing. In
the j th stage, where 0 ≤ j ≤ k − 1, j tasks have been served and the
job will depart when k − j more tasks to finish service.
We now show that the service rate of a job in the j th stage of
processing is at most (n − j)µ. Consider two jobs B1 and B2 in the
ith and j th stages of processing respectively. Let i > j, that is, B1 has
completed more tasks than B2 . Job B2 moves to the (j + 1)th stage
when one of its n − j remaining tasks complete. If all these tasks are
at the heads of their respective queues, the service rate for job B2 is
exactly (n − j)µ. However since i > j, B1 ’s task could be ahead of B2 ’s
in one of the n − j pending queues, due to which that task of B2 cannot
be immediately served. Hence, we have shown that the service rate of
in the j th stage of processing is at most (n − j)µ.
3.2. Characterization of Latency 31
Thus, the time for a job to move from the j th to (j + 1)th stage
is lower bounded by 1/((n − j)µ − λ), the mean response time of an
M/M/1 queue with arrival rate λ and service rate (n − j)µ. The total
mean response time is the sum of the mean response times of each of
the k stages of processing and is bounded below as in the statement of
the theorem.
We note that the lower bound does not achieve the optimal stability
region, giving the threshold as λ < (n − k + 1)µ.
An approximate characterization of latency has also been studied
(Badita et al., 2019). The approach follows the structure of the lower
bound mentioned above, which goes in stages. A job is said to be in
the j th stage if j out of n tasks have been served by their respective
nodes for 0 ≤ j ≤ k − 1. Since the job goes from stage 0 to stage 1, all
the way to stage k − 1 and then get served when k chunks have been
serviced, the procedure is akin to a tandem queue where a service from
stage j leads to stage j + 1. Thus, we consider k − 1 tandem queues
for the approximation, which are assumed to be uncoupled, labeled as
queue j ∈ {0, · · · , k − 1}. The arrival rate at the tandem queue 0 is the
external arrivals which is Poisson at rate λ. Since it is tandem queue
and service time is assumed to be exponential, the arrival rate at each
queue will be λ (Ross, 2019). In the case of the lower bound, the service
rate for tandem queue j was taken as (n − j)µ, while this is where a
better approximation will be used. We let γj be the approximate service
rate of queue j and πj (r) be the probability that the queue length of
tandem queue j is r.
The service rate of queue k − 1 is γk−1 = (n − k + 1)µ as in the
lower bound. For the other queues, service rate includes µ and the
additional resources from the later queues, which for the lower bound
became (n − j)µ. However, the later queues are not always empty
and the resources cannot be used to serve the earlier queues. In the
approximation, we let the resources of the later queues help the earlier
queues only when they are empty. Using additional resources of tandem
queue k − 1 to serve requests at queue k − 2 when queue k − 1 is empty
gives γk−2 = µ + γk−1 πk−1 (0). Proceeding back with the same method,
32 Fork-Join Scheduling Approach
k−1 k−1
X 1 X 1
T(n,k)≈ = . (3.8)
j=0
γj − λ j=0 (n − j)µ − (k − j)λ
This is summarized in the following lemma.
Lemma 3.3 ((Badita et al., 2019)). The expected file latency, T(n,k) can
be approximated as
k−1
X 1
T(n,k)≈ , (3.9)
j=0
(n − j)µ − (k − j)λ
where λ is the request arrival rate and µ is the service rate at each
queue.
Theorem 3.4 ((Joshi et al., 2014)). The mean response time T(n,k) of an
(n, k) fork-join system with general service time X such that E[X] = µ1
3.3. Extension to General Service Time Distributions 33
We
noteqthat the
stability condition for the upper bound of latency
1 k−1
is λ µ + σ n−k+1 < 1. For deterministic service times, σ = 0, and
34 Fork-Join Scheduling Approach
Theorem 3.5 ((Joshi et al., 2017)). The expected file latency, T(n,k) ,
satisfies
2 2
1 1
λ β+ nα + nα k−1 β+ 1
1 X nα
T(n,k)≥ β + + + ,
nα 2 1−λ β+ 1
(n − j) − λ β + nα1
j=1
nα
(3.15)
where λ is the request arrival rate and the the service distribution at
each queue is Sexp(β, α).
We now extend the setup where there are r files, where each file i is
encoded using (n, ki ) MDS code. We assume that file i is of size li .
The arrival process for file i is assumed to be Poisson with rate λi .
The service time at each server is assumed to follow an exponential
distribution with service rate µ (per unit file size). The effective service
rate at any server for file i is µi = kliiµ since each server stores 1/ki
fraction of data. Let ρi = µλii be the server utilization factor for file i.
The following result describes the conditions for the queues to be stable
using Fork-Join queueing.
Lemma 3.6 ((Kumar et al., 2017)). For the system to be stable using
Fork-Join queueing system, we require
r r r
! !
X X λi li X
ki λi < nµ λi . (3.16)
i=1 i=1
ki i=1
Proof. Jobs of file i enter the queue with rate λi . Each file i is serviced by
the system when ki sub-tasks of that job are completed. The remaining
n − kr sub-tasks are then cleared from the system. Thus for each request
of file i, (n−k
n
i)
fraction of the sub-tasks aredeleted and
hence the
n−ki ki λi
effective arrival rate of file i at any server is λi 1 − n = n . Thus
the overall arrival rate at any server, λeff , is
r
X ki λi
λeff = . (3.17)
i=1
n
where (3.18) follows the assumption that the service time for file i is
exponential with rate µi . To ensure stability, the net arrival rate should
be less than the average service rate at each server. Thus from (3.17)
36 Fork-Join Scheduling Approach
Theorem 3.7 ((Kumar et al., 2017)). The average latency for job re-
quests of file i using Fork-Join queueing is upper-bounded as follows:
r 2
P 2
λi [Hn−k 1
+ (Hn−k ) ]/µi 2
1
Hn−k i ,n i ,n
i ,n
Ti ≤ + i=1 . (3.19)
µi 2 (1 − Sr )
| {z } | {z }
Service time Waiting time
Without loss of generality, assume the files are relabeled such that
k1 ≤ k2 ≤ ... ≤ kr . The next theorem provides the lower bound of the
latency of file i.
Theorem 3.8 ((Kumar et al., 2017)). The average latency for file i is
lower-bounded as follows:
Pr t2s,j
ki λj
X t s,i j=c s,i +1
Ti ≥ + , (3.22)
r
s=1
λi 1 −
P
t
|{z} s,j
service time
j=cs,i +1
| {z }
waiting time
λi
where ts,i = (n−s+1)µi , and cs,i is given as
0, 1 ≤ s ≤ k1
1, k1 < s ≤ k2
cs,i = .. . (3.23)
.
i − 1,
ki−1 < s ≤ ki
i λE[(S s )2 ]
TFCFS,s = E[Sis ] + , (3.24)
2(1 − λE[S s ]))
where S s is a r.v. denoting the service time for any sub-task in stage
s and Sis denotes the service time for a sub-task of class i in stage s,
which are given as
R R
pi E[Sis ], E[(S s )2 ] = pi E[(Sis )2 ],
X X
E[S s ] = (3.25)
i=cs,i +1 i=cs,i +1
where pi = Prλi λi
. Substituting (3.25) in (3.24), we get
i=1
r
λj E[(Sjs )2 ]
P
i j=cs,i +1
Ts,c s,i
= E[Sis ] + !. (3.26)
r
2 1− λj E[Sjs ]
P
j=cs,i +1
Now we note that at any stage s, the maximum possible service rate for
a request of file j that is not finished yet is (n − s + 1)µj . This happens
when all the remaining sub-tasks of request of file j are at the head of
their buffers. Thus, we can enhance the latency performance in each
stage s by approximating it with a M/G/1 system with service rate
(n − s + 1)µj for request of file j. Then, the average latency for sub-task
of request of file i in stage s is lower bounded as,
r
P λj
(n−s+1)µj 2
i 1 j=cs,i +1
Ts,c ≥ + r , (3.27)
s,i
(n − s + 1)µi 1 − P λj
(n−s+1)µj
j=cs,i +1
Finally, the average latency for file i in this enhanced system is simply
ki
P i
Ts,c . This gives us the result as in the statement of the theorem.
s,i
s=1
3.5. Simulations 39
3.5 Simulations
2
Fork Join Simulation
Fork Join Approximation
Fork Join Lower Bound
Mean Latency, T(n,k)
1.6
1.4
2 4 8 12 16 20
Number of Servers, n
Figure 3.2: This graph displays the latency as the number of servers n increases.
Throughout, the code rate is kept constant at k/n = 0.5, the arrival rate is set to
λ = 0.3, and the service rate of each server is µ = 0.5. The approximate result,
upper bound, and lower bound in Chapter 3.2 are depicted along with the simulation
results.
The (n, k) fork-join system was first proposed in (Joshi et al., 2014)
to analyze content download latency from erasure coded distributed
storage for exponential service times. They consider that a content file
coded into n chunks can be recovered by accessing any k out of the
40 Fork-Join Scheduling Approach
8
Fork Join Simulation
Fork Join Approximation
Fork Join Lower Bound
1 4 8 12 16 20 24
Erasure Code Parameter k
Figure 3.3: This graph displays the latency as k increases. We let n = 24, λ = 0.45,
and µ = k/n. The approximate result, upper bound, and lower bound in Chapter
3.2 are depicted along with the simulation results.
We assume the model given in Section 1.1. Under (ni , ki ) MDS codes,
each file i can be retrieved by processing a batch of ki chunk requests at
distinct nodes that store the file chunks. Recall that each encoded file i
is spread over ni nodes, denoted by a set Si . Upon the arrival of a file
i request, in probabilistic scheduling we randomly dispatch the batch
42
4.1. Probabilistic Scheduling 43
A,1 A,2
R R R A,2
R 3B,2
R B,1
2 R B,2
2
3 3 4
R 3B,1 R 3B,2
R 3A,1 R 3A,2 ……
R A,1
4 R A,2
4
Dispatch
Lemma 4.1. For given erasure codes and chunk placement, average
service latency of probabilistic scheduling with feasible probabilities
{P(Ai ) : ∀i, Ai } upper bounds the latency of optimal scheduling.
The next result formally shows that the optimization can be trans-
P
formed into an equivalent form, which only requires i ni variables.
Proof. We first prove that the conditions m j=1 πi,j = ki ∀i and πi,j ∈
P
[0, 1] are necessary. πi,j ∈ [0, 1] for all i, j is obvious due to its definition.
Then, it is easy to show that
m
X m X
X
πi,j = 1{j∈Ai } P(Ai )
j=1 j=1 Ai ⊆Si
X X
= P(Ai )
Ai ⊆Si j∈Ai
X
= ki P(Ai ) = ki (4.3)
Ai ⊆Si
It is easy to show that j∈Si π̂i,j = πi,h + j∈Si πi,j = ki and π̂i,j ∈
P P
[0, 1], because π̂i,j = max(u, πi,j ) ∈ [0, 1]. Here we used the fact that
u < 1 since ki = j∈Si π̂i,j ≥ j∈Si u ≥ ki u. Therefore, the system of
P P
linear equations in (4.4) with π̂i,j on the right hand side must have a
non-negative solution due to our induction assumption for ni = |Si |.
Furthermore, without loss of generality, we assume that yh ≥ yj for all
j ∈ Si (otherwise a different h can be chosen). It implies that
X X
yj π̂i,j = yj (πi,j + [u − πi,j ]+ )
j∈Si j∈Si
(a) X X
≤ yj πi,j + yh [u − πi,j ]+
j∈Si j∈Si
(b) X X
= yj πi,j + yh [u − πi,j ]+
j∈Si j∈Si
(c) X (d)
= yj πi,j + yh πi,h ≤ 0, (4.8)
j∈Si
E[Qi ] , EWi,j EAi max Wi,j (4.11)
j∈Ai
where the first expectation EWj is taken over system queuing dynamics
and the second expectation EAi is taken over random dispatch decisions
Ai . Hence, we derive an upper-bound on the expected latency of a file
i, i.e., E[Qi ], as follows. Using Jensen’s inequality (Kuczma, 2009a), we
have for ti > 0
h i
eti E[Qi ] ≤ E eti Qi (4.12)
(a)
h i
E eti Qi = EAi ,Wi,j max eti Wi,j (4.13)
j∈Ai
ti Wi,j
= EAi EWi,j max e |Ai (4.14)
j∈Ai
(b) h i
eti Wi,j
X
≤ EAi EWi,j (4.15)
j∈Ai
4.2. Characterization of Mean Latency 49
h i
EAi EWi,j eti Wi,j 1(j∈Ai )
X
= (4.16)
j
h i h i
EWi,j eti Wi,j EAi 1(j∈Ai )
X
= (4.17)
j
h i
EWi,j eti Wi,j P(j ∈ Ai )
X
= (4.18)
j
(c)
h i
πi,j EWi,j eti Wi,j
X
= (4.19)
j
where (a)Xfollows from (4.11) and (4.12), (b) follows by replacing the
max by and (c) follows by probabilistic scheduling. We note that
j∈Ai
j∈Ai
the only inequality here is for replacing the maximum by the sum.
However, since this term will be inside the logarithm for the mean
latency, the gap between the term and its bound becomes additive
rather than multiplicative. Since the request pattern is Poisson and the
service time is general distributed, the Laplace-Stieltjes Transform of
the waiting time Wi,j can be characterized using Pollaczek-Khinchine
formula for M/G/1 queues (Zwart and Boxma, 2000) as follows
h i (1 − ρj ) ti Zj (ti )
E eti Wi,j = (4.20)
ti − Λj (Zj (ti ) − 1)
h i
d
where ρj = Λj E [Xj ] = Λj dt Zj (ti ) |ti =0 and Zj (ti ) is the moment
generating function of the chunk service time. Plugging (4.20) in (4.19)
and substituting in (4.12), we get the following Theorem.
Note that the above Theorem holds only in the range of ti when
ti − Λj (Zj (t) − 1) > 0. Further, the server utilization ρj must be less
than 1 for stability of the system.
50 Probabilistic Scheduling Approach
Corollary 4.6. The mean latency for file i for Shifted Exponential
Service time at each server is bounded by
m
1 X (1 − ρj )ti Zj (ti )
E[Qi ] ≤ log πi,j (4.24)
ti j=1
ti − Λj (Zj (ti ) − 1)
1
for any ti > 0, ρj = Λj αj + βj , ρj < 1, ti (ti − αj + Λj ) +
αj βj t
Λj αj (eβj ti − 1) < 0, and Zj (t) = αj −t e .
Proof.
E[Qi ] = EWi,j EAi max Wi,j
j∈Ai
" " + ##
≤ EWi,j EAi zi + max Wi,j − zi
j∈Ai
+
= EWi,j EAi zi + max [Wi,j − zi ]
j∈Ai
[Wi,j − zi ]+
X
≤ EWi,j EAi zi +
j∈Ai
1X
= EWi,j EAi zi + [Wi,j − zi + |Wi,j − zi |]
2 j∈A
i
1X
= EWi,j zi + πi,j [Wi,j − zi + |Wi,j − zi |]
2 j∈A
i
Huang et al., 2012b), we derive a tight upper bound on the latency tail
probability using Probabilistic Scheduling as follows (Aggarwal et al.,
2017b; Al-Abbasi et al., 2019a).
(d)
Pr (Qi ≥ σ) = Pr max Wi,j ≥ σ (4.27)
j∈Ai
= EAi EWi,j max Wi,j ≥ σ |Ai (4.28)
j∈Ai
= EAi ,Wi,j max 1(Wi,j ≥σ) (4.29)
j∈Ai
(e) X h i
≤ EAi ,Wi,j 1(Wi,j ≥σ) (4.30)
j∈Ai
X
= EAi [Pr(Wi,j ≥ σ)] (4.31)
j∈Ai
(f ) X
= πi,j [Pr(Wi,j ≥ σ)] (4.32)
j
E[eti,j Wi,j ]
Pr(Wi,j ≥ σ) ≤
eti,j σ
(g) 1 (1 − ρj ) ti,j Zj (ti,j )
= ti,j σ (4.33)
e ti,j − Λj (Zj (ti,j ) − 1)
where (g) follows from (4.20). Plugging (4.33) in (4.32), we have the
following Lemma.
Theorem 4.8. Under probabilistic scheduling, the latency tail proba-
bility for file i, i.e., Pr (Qi ≥ σ) is bounded by
h i
d
for any ti,j > 0, ρj = Λj dt Zj (ti,j ) ti,j =0 , ρj < 1, and Λj (Zj (ti,j )−1) <
ti,j .
We now specialize the result to the case where the service times of
the servers are given in (4.22) in the following corollary.
(n)
As mentioned earlier, each queue is an M/G/1 queue. Let Wi (t)
denote the workload of server i’s queue at time t, i.e., the total remaining
service time of all the tasks in the queue, including the partially served
task in service. So the workload of a queue is the waiting time of an
incoming task to the queue before the server starts serving it. Let
(n) (n) (n)
W(n) (t) = W1 (t), W2 (t), . . . , Wn (t) . Then the workload process,
(W(n) (t), t ≥ 0), is Markovian and ergodic. The ergodicity can be proven
using the rather standard Foster-Lyapunov criteria (Meyn and Tweedie,
1993), so we omit it here. Therefore, the workload process has a unique
stationary distribution and W(n) (t) ⇒ W(n) (∞) as t → ∞.
Let a random variable T (n) represent this steady-state job delay.
Specifically, the distribution of T (n) is determined by the workload
W(n) (∞) in the following way. When a job comes into the system, its
tasks are sent to k (n) queues and experience the delays in these queues.
Since the queueing processes are symmetric over the indices of queues,
without loss of generality, we can assume that the tasks are sent to
the first k (n) queues for the purpose of computing the distribution of
T (n) . The delay of a task is the sum of its waiting time and service
(n)
time. So the task delay in queue i, denoted by Ti , can be written as
(n) (n)
Ti = Wi (∞) + Xi with Xi being the service time. Recall that the
Xi ’s are i.i.d.∼ G and independent of everything else. Since the job is
completed only when all its tasks are completed,
n o
(n) (n) (n)
T (n) = max T1 , T2 , . . . , Tk(n) . (4.36)
Let T̂ (n) be defined as the job delay given by independent task delays.
Specifically, T̂ (n) can be expressed as:
n o
(n) (n) (n)
T̂ (n) = max T̂1 , T̂2 , . . . , T̂k(n) , (4.37)
(n) (n) (n) (n)
where T̂1 , T̂2 , . . . , T̂k(n) are i.i.d. and each T̂i has the same distri-
(n) (n)
bution as Ti . Again, due to symmetry, all the Ti ’s have the same
(n)
distribution. Let F denote the c.d.f. of Ti , whose form is known from
the queueing theory literature. Then, we have the following explicit
form for T̂ (n) :
(n)
Pr T̂ (n) ≤ τ = (F (τ ))k , τ ≥ 0. (4.38)
4.4. Characterization of Asymptotic Latency 55
Remark 4.1. We note that even though the authors of (Wang et al.,
2019) related their results to Fork-Join queue, but need n = k, while
the results naturally hold for uniform probabilistic scheduling rather
than Fork-Join queues.
Consequently, the steady-state job delay, T (n) , and the job delay given
by independent task delays as defined in (4.37), T̂ (n) , satisfy
For the special case where the service times are exponentially dis-
tributed, the job delay asymptotics have explicit forms presented in
Corollary 4.11 below.
56 Probabilistic Scheduling Approach
E T (n)
lim = 1. (4.43)
n→∞ H (n) /(µ − λ)
k
We omit proofs for the results, while refer the reader to (Wang et al.,
2019) for the detailed proofs in this subsection.
In this section, we quantify the tail index of service latency for arbi-
trary erasure-coded storage systems for Pareto-distributed file size and
4.5. Proof of Asymptotic Optimality for Heavy Tailed Service Rates 57
We assume that the arrival of client requests for each file i of size kLi
Mb is assumed to form an independent Poisson process with a known
rate λi . Further, the chunk size Cei Mb is assumed to have a heavy
tail and follows a Pareto distribution with parameters (xm , α) with
shape parameter α > 2 (implying finite mean and variance). Thus, the
complementary cumulative distribution function (c.c.d.f.) of the chunk
size is given as
(x /x)α x ≥ x
m m
Pr(Cei > x) = (4.46)
0 x < xm
For α > 1, the mean is E[Cei ] = αxm /(α − 1). The service time per Mb
at server j, Xj is distributed as an exponential distribution the mean
service time 1/µj . Service time for a chunk of size C Mb is Xj C.
We will focus on the tail index of the waiting time to access each file.
In order to understand the tail index, let the waiting time for the files TW
has Pr(TW > x) of the order of x−d for large x, then the tail index is d.
More formally, the tail index d is defined as limx→∞ − log Pr(Tlog x
W >x)
. This
index gives the slope of the tail in the log-log scale of the complementary
CDF.
that
Pr(Bj < y)
= Pr(Xj Cei < y)
Z ∞
1
= Pr(Xj < y/x)αxαm dx
x=xm xα+1
Z ∞
1
= (1 − exp(−µj y/x)) αxαm dx
x=xm xα+1
Z ∞
1
= 1− exp(−µj y/x)αxαm dx (4.47)
x=xm xα+1
Substitute t = µj y/x, and then dt = −µj y/x2 dx. Thus,
Pr(Bj > y)
Z ∞
1
= exp(−µj y/x)αxαm dx
x=xm xα+1
tα−1
Z µj y/xm
= exp(−t)αxαm dt
t=0 (µj y)α
1 µj y/xm
Z
= α(xm /µj )α exp(−t)tα−1 dt
y α t=0
= α(xm /µj )α γ(α, µj y/xm )/y α , (4.48)
Λ x1−α
Pr(W > x) ≈ V (x). (4.49)
1−ρα−1
Thus, we note that the waiting time from a server is heavy-tailed
with tail-index α − 1. Thus, we get the following result.
Theorem 4.13. Assume that the arrival rate for requests is Poisson
distributed, service time distribution is exponential and the chunk size
distribution is Pareto with shape parameter α. Then, the tail index for
the waiting time of chunk in the queue of a server is α − 1.
4.5. Proof of Asymptotic Optimality for Heavy Tailed Service Rates 59
Theorem 4.14. The tail index for distributed storage system is at most
α − 1.
The next result shows that the probabilistic scheduling achieves the
optimal tail index.
Finite sum of terms, each with tail index α − 1 will still give the term
with tail index α − 1 thus proving that the tail index with probabilistic
scheduling is α − 1.
4.6 Simulations
We define q = (πi,j ∀i = 1, · · · , r and j = 1, · · · , m), and t = te1 , te2 , . . . ,
ter ; t1 , t2 , . . . , tr . Note that the values of ti ’s used for mean latency and
tail latency probability may be different and the parameters te and t
indicate these parameters for the two cases, respectively. Our goal is
to minimize the two proposed QoE metrics over the choice of access
decisions and auxiliary bound parameters. The objective can be mod-
eled as a convex combination of the two QoE metrics since this is a
multi-objective optimization.
To incorporate for weighted fairness and differentiated services,
we assign a positive weight wi for each QoE for file i. Without loss of
generality, each file i is weighted by the arrival rate λi in the objective (so
larger arrival rates are weighted higher). However, any other weights can
be incorporated to accommodate for weighted fairness or differentiated
P
services. Let λ = i λi be the total arrival rate. Hence, wi = λi /λ is
the ratio of file i requests. The first objective is the minimization of the
mean latency, averaged over all the file requests, and is given as i λλi Qi .
P
averaged over all the file requests, and is given as i λλi Pr (Qi ≥ σ).
P
By using a special case of the expressions for the mean latency and
the latency tail probability in Sections 4.2 and 4.3, optimization of
a convex combination of the two QoE metrics can be formulated as
follows.
r m
X λi 1
θ log
X (1 − ρj )ti Zj (ti )
e e
min qi,j
i=1 λ ti ti − Λj Zj (tei ) − 1
e
j=1
e
m
X qi,j (1 − ρj )ti Zj (ti )
+(1 − θ) (4.50)
j=1 eti σ ti − Λj (Zj (ti ) − 1)
s.t.
αj
Zj (ti ) = eβj ti , ∀j (4.51)
αj − ti
Λj
ρj = + Λj βj < 1 , ∀j (4.52)
αj
X
Λj = λi qi,j , ∀j (4.53)
i
X
qi,j = ki , ∀i (4.54)
j
qi,j = 0, j ∈
/ Gi , ∀i, j (4.55)
qi,j ∈ [0, 1] , ∀i, j (4.56)
tei > 0 , ∀i (4.57)
ti > 0 , ∀i (4.58)
tei (tei − αj + Λj ) + Λj αj (eβjeti − 1) < 0 (4.59)
ti (ti − αj + Λj ) + Λj αj (eβj ti − 1) < 0 (4.60)
var q , t,
where θ ∈ [0, 1] is a trade-off factor that determines the relative sig-
nificance of mean latency and latency tail probability in the objective
function. By changing θ from θ = 1 to θ = 0, the solution for (4.50)
62 Probabilistic Scheduling Approach
spans the solutions that minimize the mean latency to ones that mini-
mize the tail latency probability. Note that constraint (4.52) gives the
load intensity of server j. Constraint (4.53) gives the aggregate arrival
rate Λj for each node for the given probabilistic scheduling probabili-
ties qi,j and arrival rates λi . Constraints (4.54)-(4.56) guarantee that
the scheduling probabilities are feasible. Also, Constraints (4.57)-(4.60)
ensure that the moment generating function given in (4.20) exists. Note
that the optimization over q helps decrease the overall latency which
gives significant flexibility over choosing the lowest-queue servers for
accessing the files. We further note that the optimization problem in
(4.50) is non-convex as, for instance, Constraint (4.59) is non-convex in
(q, t) jointly. In order to solve the problem, we can use an alternating op-
timization that divides the problem into two subproblems that optimize
one variable while fixing the another. In order to solve each subproblem,
we use the iNner cOnVex Approximation (NOVA) algorithm proposed
in (Scutari et al., 2017), which guarantees convergence to a stationary
point. Based on this, it can be shown that the alternating optimization
converges to a stationary point.
To validate our proposed algorithm for joint mean-tail latency and
evaluate its performance, we simulate our algorithm in a distributed
storage system of m = 12 distributed nodes, r = 1000 files, all of size
200 MB and using (7, 4). However, our model can be used for any given
number of storage servers, any number of files, and for any erasure
coding setting. We consider a shifted-exponential distribution for the
chunk service times as it has been shown in real system measurements
on Tahoe and Amazon S3 servers (Baccelli et al., 1989; Chen et al.,
2014a; S3, n.d.). The service time parameters αj and βj are shown in
Table 4.1. Unless otherwise explicitly stated, the arrival rate for the first
500 files is 0.002s−1 while for the next 500 files is set to be 0.003s−1 .
Node 1 Node 2 Node 3 Node 4 Node 5 Node 6
αj 18.23 24.06 11.88 17.06 20.19 23.91
Node 7 Node 8 Node 9 Node 10 Node 11 Node 12
αj 27.01 21.39 9.92 24.96 26.53 21.80
Table 4.1: Storage Node Parameters Used in our Simulation (Shift βj = 10 msec, ∀j
and rate α in 1/s).
4.6. Simulations 63
250
150
100
50
0.2 λi 0.4λi 0.6 λi 0.8λi 1λi 1.2 λi
Arrival Rate (λi)
Figure 4.2: Weighted mean latency for different file arrival rates. We vary the
arrival rate of file i from 0.2 × λi to 1.2 × λi , where λi is the base arrival rate.
0.12
Wighted Latency Tail Probability PEA Approach
PSP Approach
0.1
Approach in (Aggarwal et al., 2017b)
Prop. Approach
0.08
0.06
0.04
0.02
0
0.3 λi 0.6 λi 0.9λi 1.2 λi 1.5 λi 1.8λi 2.1λi
Arrival Rate (λi)
Figure 4.3: Weighted latency tail probability for different file arrival rates. We vary
the arrival rate of file i from 0.3 × λi to 2.1 × λi , where λi is the base arrival rate.
mean latency and the latency tail probability is the same? From Figure
4.4, we answer this question negatively since for r = 1000 and m = 12,
we find out that the optimal mean latency is approximately 43% lower
as compared to the mean latency at the value of (q,t) that optimizes
the weighted latency tail probability. Hence, an efficient tradeoff point
between the two QoE metrics can be chosen based on the point on the
curve that is appropriate for the clients.
0.014
Weighted Latency Tail Probability
0.012
θ decreases from θ = 10−4
−6
0.01 to θ = 10
−4
0.008 θ = 10
0.006
0.004
0.002
θ = 10−6
0
65 75 85 95 105 115
Weighted Mean Latency (in seconds)
Figure 4.4: Tradeoff between weighted mean latency and weighted latency tail
probability obtained by varying θ in the objective function given by (4.50). We vary
θ (coefficient of weighted mean latency) from θ = 10−4 to θ = 10−6 . These values are
chosen carefully to bring the two QoE metrics to a comparable scale, since weighted
mean latency is orders of magnitude higher than weighted latency tail probability.
69
70 Delayed-Relaunch Scheduling Approach
t0 t1 t2
0 1 2 3 4 5 6 7 8
n0
n1
n2
s1
Figure 5.1: This figure illustrates two-forking, by plotting the different completion
times on the real line, with the forked servers n0 = 4, n1 = 5, n2 = 3 at forking
points t0 = 0, t1 = 2, t2 = 4. The first task completes at s1 .
X i −1
1 h `X i
X i
W =λ (ti,r+1 − ti,r ) (nj − `j ) + `i − r . (5.1)
i=0 r=0 j=0
Thus, both the metrics rely on inter-service times ti,r − ti,r−1 , which
will be characterized in the next section, followed by the result on the
two metrics.
The next result provides the mean gap between two inter-arrivals.
Lemma 5.4. The mean time between two coded sub-task completions
in the single forking scheme for i.i.d. shifted exponential coded sub-task
completion times in stage 0 is
c + 1 , r = 1,
µn0
E [t0,r − t0,r−1 ] = 1
(5.7)
µ(n0 −r+1) , r ∈ {2, . . . , `0 }.
Proof. Since t0,r is the completion time of first r coded sub-tasks out
of n0 parallel coded sub-tasks, we have t0,r = c + Xrn0 . Hence, for each
r ∈ [`0 ], we have
n0
t0,r − t0,r−1 = (c + Xrn0 ) − (c + Xr−1 ). (5.8)
The chunk requests are initiated at time t0,0 = t0 = 0 and hence the
first chunk is completed at t0,1 − t0,0 = c + X1n0 .
From Lemma 5.2, we can write the following equality in distribution
0
c + T1 , r = 1,
n0
t0,r − t0,r−1 = Tr0 (5.9)
(n0 −r+1) , r ∈ {2, . . . , `0 } ,
where (T10 , . . . , Tn0 ) are i.i.d. exponentially distributed random variables
with rate µ. Taking expectations on both sides, we get the result.
From the definition of the task completion times, the possible values of
the vector s = (s1 , . . . , sm ) satisfy the constraint 0 < s1 < · · · < sm < c.
That is, we can write the set of possible values for vector s as Am , where
Am is a vector of increasing co-ordinates bounded between (0, c), and
can be written as
Am , {s ∈ Rm : 0 < s1 < · · · < sm < c} . (5.17)
This constraint couples the set of achievable values for the vector s,
and hence even though the conditional density has a product form, the
random variables (s1 , . . . , sm ) are not conditionally independent given
the event Em .
To compute the conditional expectation E [sr |Em ], we find the con-
ditional marginal density of sr given the event Em . To this end, we
integrate the conditional joint density of vector s over variables without
sr . In terms of sr ∈ (0, c), we can write the region of integration as the
following intersection of regions,
A−r
m = ∩i<r {0 < si < si+1 } ∩i>r {si−1 < si < c} . (5.18)
Using the conditional density of vector s defined in (5.16) in the above
equation, and denoting α , 1 − e−cµ and αr , 1 − e−µsr for clar-
ity of presentation, we can compute the conditional marginal density
function (Ross, 2019)
!
mµ(1 − αr ) m − 1
fsr |Em = (αr )r−1 (α − αr )m−r . (5.19)
αm r−1
The conditional mean E [sr |Em ] = 0c fsr |Em dsr is obtained by integrat-
R
ing the conditional marginal density in (5.19), over sr ∈ (0, c). For
r ∈ [m − 1], the result follows from the integral identity of Remark 5.1
for x = sr , q = r − 1, p = m − 1 and α = 1 − e−µc . Similarly, the result
for r = j − `0 follows from Corollary 5.3 for x = sm and m = j − `0 .
By using the fact t1,r−1 = t1 +sr−1 , we can write the conditional mean of
the second part as E [t1 + c − t1,r−1 |Ej−`0 ] = c − E [sr−1 |Ej−`0 ] , where
E [sr−1 |Ej−`0 ] is given by Lemma 5.6. Summing these two parts, we get
the conditional expectation for r = j − `0 + 1.
For the case when r ∈ [j − `0 ], the result follows from Lemma 5.6
and the fact t1,r = t1 + sr .
Proof. The result follows by using Lemma 5.7 and from the tower
property of nested expectations
Theorem 5.10. For the single forking case with n total servers for k sub-
tasks and initial number of servers n0 < k, the mean server utilization
cost is
λk
E [W ] = λnc + , (5.34)
µ
and the mean service completion time is
n0 k−1
1 X X 1
E [t2 ] = c + E [t1 ] + pj−`0 , (5.35)
µ j=` i=j
(n − i)
0
For n0 ≥ k, the mean service completion time and the mean server
utilization cost are given in the following theorem.
and the mean service completion time is given by E [t1 ] and the mean
server utilization cost is given by E [W0 ].
We next consider the case when `0 < k. In this case, the job comple-
tion occurs necessarily in stage 1. Thus, we need to compute E [t2 − t1 ]
and E [W1 ] in order to evaluate the mean service completion time E [t2 ]
and the mean server utilization cost E [W0 + W1 ]. The duration of
stage 1 can be written as a telescopic sum of inter service times
k−`
X 0 −1
t2 − t1 = (t1,r − t1,r−1 ). (5.39)
r=1
Further for `0 < k, the number of servers that are active in stage 1 after
(r − 1)th service completions are n − `0 − r + 1 and the associated cost
incurred in the interval [t1,r−1 , t1,r ) is λ(t1,r − t1,r−1 )(n − `0 − r + 1).
Therefore, we can write the server utilization cost in stage 1 as
k−`
X 0 −1
W1 = λ (n − `0 − r + 1)(t1,r − t1,r−1 ). (5.40)
r=1
The result follows from taking mean of the duration t2 −t1 and server
utilization cost W1 , from the linearity of expectations, and considering
both possible cases.
5.4 Simulations
2
2 4 6 8 10 12
Fork task threshold `0
Figure 5.2: For the setting n0 ≥ k, this graph displays the mean service completion
time E [S] as a function of fork task threshold `0 for single forking with the total
number of servers n = 24, the total needed coded sub-tasks k = 12, and different
numbers of initial servers n0 ∈ {12, 14, 16, 18, 20}. The single coded sub-task execution
time at servers are assumed to be i.i.d. shifted exponential distribution with shift
c = 1 and rate µ = 0.5.
n0 =12
40 n0 =14
n0 =16
n0 =18
n0 =20
n0 =24, No Forking
35
2 4 6 8 10 12
Fork task threshold `0
Figure 5.3: For the setting n0 ≥ k, this graph displays the mean server utilization
cost E [W ] as a function of fork task threshold `0 for single forking with the total
number of servers n = 24, the total needed coded sub-tasks k = 12, and different
numbers of initial servers n0 ∈ {12, 14, 16, 18, 20}. The single coded sub-task execution
time at servers are assumed to be i.i.d. shifted exponential distribution with shift
c = 1 and rate µ = 0.5.
46
44
42
n0 =12
40 n0 =14
n0 =16
38 n0 =18
n0 =20
n0 =24, No Forking
36
2 3 4 5 6 7
Mean service completion time E [S]
Figure 5.4: For the setting n0 ≥ k, we have plotted the mean server utilization
cost E [W ] as a function of the mean service completion time E [S] by varying fork
task threshold `0 ∈ [n0 ] in single forking. The total number of servers considered
are n = 24, the total coded sub-task needed are k = 12. The single coded sub-task
execution time at servers are assumed to be i.i.d. shifted exponential distribution
with shift c = 1 and rate µ = 0.5. We have plotted the same curve for different values
of initial servers n0 ∈ {12, 14, 16, 18, 20}. For each curve, `0 increasing from left to
right.
help minimizing the mean server utilization cost at the expense of the
mean service completion time.
In this Chapter, we extend the setup to assume that the servers store
video content. Rather than downloading the content, the users are
streaming the content, which makes the notion of stall duration more
important. We explain the system model in Section 6.1. The downlaod
and play times of different segments in a video is characterized in Section
6.2. This is further used to characterize upper bounds on mean stall
duration and tail stall duration in Sections 6.3 and 6.4, respectively.
Sections 6.5 and 6.6 contain simulation results and notes on future
directions, respectively.
89
90 Analyzing Latency for Video Content
Figure 6.1.
The encoded chunks are stored on the disks of ni distinct storage
nodes. These storage nodes are represented by a set Si , such that
(g )
Si ⊆ M and ni = |Si |. Each server z ∈ Si stores all the chunks Ci,jz
for all j and for some gz ∈ {1, · · · , ni }. In other words, each of the ni
storage nodes stores one of the coded chunks for the entire duration
of the video. The placement on the servers is illustrated in Figure 6.2,
where server 1 is shown to store first coded chunks of file i, third coded
chunks of file u and first coded chunks for file v.
The use of (ni , ki ) of MDS erasure code introduces a redundancy
factor of ni /ki which allows the video to be reconstructed from the
video chunks from any subset of ki -out-of-ni servers. We note that the
erasure-code can also help in recovery of the content i as long as ki of
the servers containing file i are available (Dimakis et al., 2010). Note
that replication along n servers is equivalent to choosing (n, 1) erasure
code. Hence, when a video i is requested, the request goes to a set Ai
of the storage nodes, where Ai ⊆ Si and ki = |Ai |. From each server
(g )
z ∈ Ai , all chunks Ci,jz for all j and the value of gz corresponding
to that placed on server z are requested. The request is illustrated in
(g )
Figure 6.2. In order to play a segment q of video i, Ci,qz should have
been downloaded from all z ∈ Ai . We assume that an edge router which
is a combination of multiple users is requesting the files. Thus, the
connections between the servers and the edge router is considered as
the bottleneck. Since the service provider only has control over this part
of the network and the last hop may not be under the control of the
provider, the service provider can only guarantee the quality-of-service
till the edge router.
We assume that the files at each server are served in order of the
request in a first-in-first-out (FIFO) policy. Further, the different chunks
are processed in order of the duration. This is depicted in Figure 6.3,
where for a server q, when a file i is requested, all the chunks are placed
in the queue where other video requests before this that have not yet
been served are waiting.
In order to schedule the requests for video file i to the ki servers,
the choice of ki -out-of-ni servers is important. Finding the optimal
6.1. Modeling Stall Duration for Video Requests 91
Video i:
(k ) (n )
Gi,L ii B i,ji C i,j i
Joint
Requests for video i, u, v Scheduler
(K i-out-of-n i)
File i
m
X
πij = ki ∀i and πij = 0 if j ∈
/ Si .
j=1
In other words, selecting each node j with probability πij would yield
a feasible choice of {P (Ai ) : ∀i , Ai }. Thus, we consider the request
probabilities πij as the probability that the request for video file i uses
server j. While the probabilistic scheduling have been used to give
bounds on latency of file download, this paper uses the scheduling to
give bounds on the QoE for video streaming.
We note that it may not be ideal in practice for a server to finish one
video request before starting another since that increases delay for the
future requests. However, this can be easily alleviated by considering
that each server has multiple queues (streams) to the edge router
which can all be considered as separate servers. These multiple streams
can allow multiple parallel videos from the server. The probabilistic
scheduling can choose ki of the overall queues to access the content.
This extension can be seen in (Al-Abbasi and Aggarwal, 2018d).
We now describe a queuing model of the distributed storage system.
We assume that the arrival of client requests for each video i form an
independent Poisson process with a known rate λi . The arrival of file
P
requests at node j forms a Poisson Process with rate Λj = i λi πi,j
which is the superposition of r Poisson processes each with rate λi πi,j .
(g )
We assume that the chunk service time for each coded chunk Ci,l j
6.2. Modeling Download and Play Times 93
the video files in queue before file i request and the service time of
all chunks of video file i up to the q th chunk. Let Wj be the random
variable corresponding to the waiting time of all the video files in queue
(q)
before file i request and Yj be the (random) service time of coded
chunk q for file i from server j. Then, the (random) download time for
(q)
coded chunk q ∈ {1, · · · , Li } for file i at server j ∈ Ai , Di,j , is given as
q
(q) X (v)
Di,j = Wj + Yj . (6.3)
v=1
since the service time is STi,j when file i is requested from server j. Let
Rj (s) = E[e−sRj ] be the Laplace-Stieltjes Transform of Rj .
h i
Lemma 6.1. The Laplace-Stieltjes Transform of Rj , Rj (s) = E e−sRj
is given as
r
!Li
X πij λi αj e−βj s
Rj (s) = (6.6)
i=1
Λj αj + s
6.2. Modeling Download and Play Times 95
Proof.
r
πij λi h i
E e−s(STi,j )
X
Rj (s) =
i=1
Λj
P #
r
" Li (ν)
X πij λi −s ν=1
Yj
= E e
i=1
Λj
#!Li
r
" (1)
X πij λi −s Yj
= E e
i=1
Λj
r
!Li
X πij λi αj e−βj s
= (6.7)
i=1
Λj αj + s
Corollary 6.2. The moment generating function for the service time of
video files when requested from server j, Bj (t), is given by
r
!Li
X πij λi αj eβj t
Bj (t) = (6.8)
i=1
Λj αj − t
!q
(q)
−sDi,j (1 − ρj ) s αj
E[e ]= e−βj s . (6.11)
s − Λj 1 − Rj (s) αj + s
(q) (q)
Di = max Di,j . (6.12)
j∈Ai
(1) (1)
Ti = max ds , Di . (6.13)
(Li −1)
(Li ) (Li )
Ti = max Ti + τ, Di
(Li −2) (Li −1)
(Li )
= max Ti + 2τ, Di + τ, Di
= max (ds + (Li − 1)τ,
Li +1(z−1)
max Di + (Li − z + 1)τ (6.15)
z=2
(Li ) Li +1
Ti = max max (pi,j,z ) , (6.16)
z=1 j∈Ai
where
ds + (Li − 1) τ , z=1
pi,j,z = (6.17)
(z−1)
Di,j + (Li − z + 1)τ , 2 ≤ z ≤ (Li + 1)
where
In the next two sections, we will use this stall time to determine the
bounds on the mean stall duration and the stall duration tail probability.
In this section, we will provide a bound for the first QoE metric, which
is the mean stall duration for a file i. We will find the bound through
probabilistic scheduling and since probabilistic scheduling is one feasible
strategy, the obtained bound is an upper bound to the optimal strategy.
Using (6.20), the expected stall time for file i is given as follows
h i h i
(Li )
E Γ(i) = E Ti − ds − (Li − 1) τ
h i
(Li )
= E Ti − ds − (Li − 1) τ (6.21)
h i
(L )
ti E Ti i (L )
ti Ti i
e ≤E e . (6.22)
(Li )
(a)
E eti Ti = E max max eti pijz
z j∈Ai
= EAi E max max eti pijz | Ai
z j∈Ai
(b) X h i
≤ EAi E max eti pijz
z
j∈Ai
X
= EAi Fij 1{j∈Ai }
j
X h i
= Fij EAi 1{j∈Ai }
j
X
= Fij P (j ∈ Ai )
j
(c) X
= Fij πij (6.23)
j
where (a) follows from (6.16), (b) follows by upper bounding maxj∈Ai
by j∈Ai , (c) follows by probabilistic scheduling where P (j ∈ Ai ) = πij ,
P
h i
and Fij = E max eti pijz . We note that the only inequality here is for
z
replacing the maximum by the sum. Since this term will be inside the
logarithm for the mean stall latency, the gap between the term and its
bound becomes additive rather than multiplicative.
Substituting (6.23) in (6.22), we have
m
h
(Li )
i 1 X
E Ti ≤ log πij Fij . (6.24)
ti j=1
(`) (`)
Let Hij = L −ti (ds +(`−1)τ ) Z (t ), where Z (t) is defined in
i
P
`=1 e i,j i i,j
equation (6.19). We note that Hij can be simplified using the geometric
series formula as follows.
Lemma 6.4.
L
1 − fj (ti ) i
M
e−ti (ds −τ ) (1 − ρj ) ti Mj (ti )
f
Hij = , (6.25)
ti − Λj (Bj (ti ) − 1) 1−M fj (ti )
100 Analyzing Latency for Video Content
where Mfj (ti ) = Mj (ti )e−ti τ , Mj (ti ) is given in (6.2), and Bj (ti ) is given
in (6.8).
Proof.
!`
Li
X e−ti (ds +(`−1)τ ) (1 − ρj ) ti αj eti βj
Hij =
`=1
ti − Λj (Bj (ti ) − 1) αj − ti
!`
L
e−ti ds (1 − ρj ) ti X i
e−ti (`−1)τ
αj eti βj
=
ti − Λj (Bj (ti ) − 1) `=1 αj − t i
Li
!`
e−ti (ds −τ ) (1 − ρj ) ti X αj eti βj
= e−ti τ
ti − Λj (Bj (ti ) − 1) `=1 αj − ti
Li
!`
e−ti (ds −τ ) (1 − ρj ) ti X αj eti βj −ti τ
=
ti − Λj (Bj (ti ) − 1) `=1 αj − ti
e−ti (ds −τ ) (1 − ρj ) ti
= ×
ti − Λj (Bj (ti ) − 1)
1 − (Mj (ti ))Li e−ti Li τ
!
Mj (ti )e−ti τ
1 − Mj (ti )e−ti τ
L
1 − fj (ti ) i
M
e−ti (ds −τ ) (1
− ρj ) ti Mj (ti )
f
= (6.26)
ti − Λj (Bj (ti ) − 1) 1−M fj (ti )
Theorem 6.5. The mean stall duration time for file i is bounded by
m
h i 1 X
E Γ(i) ≤ log πij (1 + Hij ) (6.27)
ti j=1
P 1
for any ti > 0, ρj = i πij λi Li βj + αj , ρj < 1, and
−β t Lf
Pr αj e j i
f =1 πf j λf αj −ti − (Λj + ti ) < 0, ∀j.
6.3. Characterization of Mean Stall Duration 101
where (d) follows by bounding the maximum by the sum, (e) follows
from (6.18), and (f) follows by substituting ` = z − 1.
Further, substituting the bounds (6.28) and (6.24) in (6.21), the
mean stall duration is bounded as follows.
h i
E Γ(i)
m
1
πij eti (ds +(Li −1)τ )
X
≤ log
ti j=1
Li
(`)
eti (Li −`)τ Zi,j (ti ) − (ds + (Li − 1) τ )
X
+
`=1
m
1
πij eti (ds +(Li −1)τ )
X
= log
ti j=1
Li
(`) 1
eti (Li −`)τ Zi,j (ti ) − log eti (ds +(Li −1)τ )
X
+
`=1
ti
m i L
1 (`)
e−ti (ds +(`−1)τ ) Zi,j (ti )
X X
= log πij 1 + (6.29)
ti j=1 `=1
102 Analyzing Latency for Video Content
Note that Theorem 6.5 above holds only in the range of ti when
Lf
Pr αj e−βj ti
ti − Λj (Bj (ti ) − 1) > 0 which reduces to f =1 πf j λf αj −ti −
(Λj + ti ) < 0, ∀i, j, and αj − ti > 0. Further, the server utilization ρj
must be less than 1 for stability of the system.
We note that for the scenario, where the files are downloaded rather
than streamed, a metric of interest is the mean download time. This is a
special case of our approach when the number of segments of each video
is one, or Li = 1. Thus, the mean download time of the file follows as a
special case of Theorem 6.5. This special case was discussed in detail in
Section 4.2.
(a)
(Li )
Pr Γ(i) ≥ x = Pr Ti ≥ x + ds + (Li − 1) τ
(Li )
= Pr Ti ≥x (6.30)
(b)
(L )
Pr Ti i ≥x = Pr max maxpijz ≥ x
z j∈Ai
(c)
= EAi ,pijz max 1
j∈Ai max pijz ≥x
z
(d) X
≤ EAi ,pijz 1
maxpijz ≥x
j∈Ai z
(e) X
= πij Epijz 1
maxpijz ≥x
j z
X
= πij P max pijz ≥ x (6.32)
z
j
where (b) follows from (6.16), (c) follows as both max over z and max
over Aj are discrete indicies (quantities) and do not depend on other
P
so they can be exchanged, (d) follows by replacing the max by Ai , (e)
follows from probabilistic scheduling. Using Markov Lemma, we get
" #
ti max pijz
E e z
P max pijz ≥ x ≤
z
heti x i
E max eti pijz
z
=
eti x
(f ) Fij
= (6.34)
eti x
where (f) follows from (6.28). Substituting (6.34) in (6.32), we get the
stall duration tail probability as described in the following theorem.
Theorem 6.6. The stall distribution tail probability for video file i is
bounded by
X πij
1 + e−ti (ds +(Li −1)τ ) Hij (6.35)
j
eti x
104 Analyzing Latency for Video Content
1
, ρj ≤ 1,
P
for any ti > 0, ρj = i πij λi Li βj + αj
−β t Lf
Pr αj e j i
f =1 πf j λf αj −ti − (Λj + ti ) < 0, ∀i, j, and Hij is given by
(6.25).
(Li )
Pr Ti ≥x
X
≤ πij P max pijz ≥ x
z
j
X Fij
≤ πij
j
eti x
(g) X πij
≤ eti (ds +(Li −1)τ ) + H ij
j
eti x
πij
eti (ds +(Li −1)τ ) + Hij
X
=
j
eti (x+ds +(Li −1)τ )
X πij
−ti (ds +(Li −1)τ )
= 1 + e Hij (6.36)
j
eti x
We note that for the scenario, where the files are downloaded rather
than streamed, a metric of interest is the latency tail probability which
is the probability that the file download latency is greater than x. This
is a special case of our approach when the number of segments of each
video is one, or Li = 1. Thus, the latency tail probability of the file
follows as a special case of Theorem 6.6. In this special case, the result
reduces to that in (Aggarwal et al., 2017b).
6.5 Simulations
Using the expressions for the mean stall duration and the stall duration
tail probability in Chapters 6.3 and 6.4, respectively, optimization of
a convex combination of the two QoE metrics can be formulated as
follows.
X λi m
θ
1 X
min log πij 1 + H
e ij
i λ tei j=1
X πij
+ (1 − θ) 1 + e−ti (ds +(Li −1)τ ) H ij (6.37)
j eti x
Li
Mj (ti ) 1 − Mj (ti )
f f
Qij = , (6.41)
1−M
fj (ti )
αj e(βj −τ )t
M
fj (t) = , (6.42)
αj − t
106 Analyzing Latency for Video Content
r
!Lf
X λf πf j αj eβj t
Bj (t) = , (6.43)
f =1
Λj αj − t
αj e(βj −τ )t
M
fj (t) = , (6.44)
αj − t
r
!Lf
X λf πf j αj eβj t
Bj (t) = , (6.45)
f =1
Λj αj − t
r
!
X 1
ρj = πf j λf Lf βj + < 1 ∀j (6.46)
f =1
αj
Xr
Λj = λf πf,j ∀j (6.47)
f =1
m
X
πi,j = ki (6.48)
j=1
πi,j =0 if j ∈
/ Si , πi,j ∈ [0, 1] (6.49)
|Si | = ni , ∀i (6.50)
0 < tei < αj , ∀j (6.51)
0 < ti < αj , ∀j (6.52)
(βj −τ )e
ti
αj e − 1 + tei < 0 , ∀j (6.53)
αj e(βj −τ )ti − 1 + ti < 0 , ∀j (6.54)
r
!Lf
X αj eβjeti
πf j λf − Λj + tei < 0, ∀i, j (6.55)
f =1
αj − tei
r
!Lf
X αj eβj ti
πf j λf − (Λj + ti ) < 0, ∀i, j (6.56)
f =1
αj − ti
var. π, t, S (6.57)
Here, θ ∈ [0, 1] is a trade-off factor that determines the relative
significance of mean and tail probability of the stall durations in the
minimization problem. Varying θ = 0 to θ = 1, the solution for (6.37)
spans the solutions that minimize the mean stall duration to ones that
minimize the stall duration tail probability. Note that constraint (6.46)
6.5. Simulations 107
gives the load intensity of server j. Constraint (6.47) gives the aggregate
arrival rate Λj for each node for the given probabilistic scheduling prob-
abilities πij and arrival rates λi . Constraints (6.49)-(6.50) guarantees
that the scheduling probabilities are feasible. Constraints (6.51)-(6.54)
ensure that M fj (t) exist for each tei and ti . Finally, Constraints (6.55)-
(6.56) ensure that the moment generating function given in (6.19) exists.
We note that the the optimization over π helps decrease the objective
function and gives significant flexibility over choosing the lowest-queue
servers for accessing the files. The placement of the video files S helps
separate the highly accessed files on different servers thus reducing the
objective. Finally, the optimization over the auxiliary variables t gives a
tighter bound on the objective function. We note that the QoE for file
i is weighed by the arrival rate λi in the formulation. However, general
weights can be easily incorporated for weighted fairness or differentiated
services.
Note that the proposed optimization problem is a mixed integer
non-convex optimization as we have the placement over n servers and
the constraints (6.55) and (6.56) are non-convex in (π, t). The problem
can be solved using an optimization algorithm described in (Al-Abbasi
and Aggarwal, 2018d), which in part uses NOVA algorithm proposed
in (Scutari et al., 2017).
We simulate our algorithm in a distributed storage system of m = 12
distributed nodes, where each video file uses an (10, 4) erasure code. The
parameters for storage servers are chosen as in Table 4.1, which were
chosen in (Xiang et al., 2016) in the experiments using Tahoe testbed.
Further, (10, 4) erasure code is used in HDFS-RAID in Facebook (al.,
2010) and Microsoft (Huang et al., 2012a). Unless otherwise explicitly
stated, we consider r = 1000 files, whose sizes are generated based on
Pareto distribution (Arnold, 2015) with shape factor of 2 and scale of
300, respectively. We note that the Pareto distribution is considered
as it has been widely used in existing literature (Ramaswami et al.,
2014) to model video files, and file-size distribution over networks. We
also assume that the chunk service time follows a shifted-exponential
distribution with rate αj and shift βj , whose values are shown in Table
I, which are generated at random and kept fixed for the experiments.
Unless explicitly stated, the arrival rate for the first 500 files is 0.002s−1
108 Analyzing Latency for Video Content
while for the next 500 files is set to be 0.003s−1 . Chunk size τ is set to
be equal to 4 s. When generating video files, the sizes of the video file
sizes are rounded up to the multiple of 4 sec. We note that a high load
scenario is considered for the numerical results. In order to initialize
our algorithm, we use a random placement of files on all the servers.
Further, we set πij = k/n on the placed servers with ti = 0.01 ∀i and
j ∈ Si . However, these choices of πij and ti may not be feasible. Thus,
we modify the initialization of π to be closest norm feasible solution
given above values of S and t. We compare the proposed approach with
some baselines:
600
Proposed Algorithm
RP−OA
500
OP−PSP
Average Stall Time (Sec)
RP−PSP
400 OP−PEA
RP−PEA
300
200
100
0
1 2 3 4 5 6
Arrival Rates for Different File Sizes −3
x 10
Figure 6.4: Mean stall duration for different video arrival rates with different video
lengths.
duration and stall duration tail probability are optimized over the t,
and placement S.
−1
10
−2
10
Prop. Alg.
RP−OA
−3 OP−PSP
10
RP−PSP
OP−PEA
−4 RP−PEA
10
50 100 150 200
x (in seconds)
Figure 6.5: Stall duration tail probability for different values of x (in seconds).
0.013
θ changes from
0.007
θ = 10−4 to θ = 10−6
0.004
θ = 10−6
0.001
75 80 85 90 95 100
Meas Stall Duration (Sec)
Figure 6.6: Tradeoff between mean stall duration and stall duration tail probability
obtained by varying θ.
113
114 Lessons from prototype implementation
CA# NJ#
194#Mbps#
#73.5#ms#
TX#
Storage#server# Storage#Client#
Figure 7.1: Our Tahoe testbed with average ping (RTT) and bandwidth measure-
ments among three data centers in New Jersey, Texas, and California
7.1. Exemplary implementation of erasure-coded storage 115
0.8
0.6
0.4
0
0 5 10 15 20 25 30 35 40 45 50
Latency (sec)
Figure 7.3: Comparison of joint latency and cost minimization with some obliv-
ious approaches. Algorithm JLCM minimizes latency-plus-cost over 3 dimensions:
load-balancing (LB), chunk placement (CP), and erasure code (EC), while any
optimizations over a subset of the dimensions is non-optimal.
Remark 2: Latency and storage cost tradeoff. The use of (ni , ki ) MDS
erasure code allows the content to be reconstructed from any subset
of ki -out-of-ni chunks, while it also introduces a redundancy factor
of ni /ki . To model storage cost, we assume that each storage node
j ∈ M charges a constant cost Vj per chunk. Since ki is determined
by content size and the choice of chunk size, we need to choose an
appropriate ni which not only introduces sufficient redundancy for
improving chunk availability, but also achieves a cost-effective solution.
We consider RTT plus expected queuing delay and transfer delay as
a measure of latency. To find the optimal parameters for scheduling,
118 Lessons from prototype implementation
Empirical CDF
1
(12,6)
0.8 (10,7)
0
0 20 40 60 80 100 120 140 160 180
Latency (Sec)
Figure 7.4: Actual service latency distribution for 1000 files of size 150 MB using
erasure codes (12, 6), (10, 7), (10, 6), and (8, 4) for each quarter with aggregate
request arrival rates set to λ = 0.118/s.
arrival rates for the two classes using the same value to see the latency
distribution under different coding strategies. We retrieve the 1000 files
at the designated request arrival rate and plot the CDF of download
latency for each file in Fig. 10. We note that 95% of download requests
for files with erasure code (10, 7) complete within 100s, while the same
percentage of requests for files using (12, 6) erasure code complete
within 32s due to higher level of redundancy. In this experiment, erasure
code (12, 6) outperforms (8, 4) in latency though they have the same
level of redundancy because the latter has larger chunk size when file
size are set to be the same.
Remark 4: Latency and file size tradeoff. Increasing file size clearly gen-
erates high load on the storage system, thus resulting in higher latency.
To illustrate this tradeoff, we vary file size in the experiment from (30,
20)MB to (150, 100)MB and plot download latency of individual files 1,
2, 3, average latency, and the analytical latency upper bound (Xiang
et al., 2016) in Figure 7.5. We see that latency increases super-linearly
as file size grows, since it generates higher load on the storage system,
causing larger queuing latency (which is super-linear according to our
analysis). Further, smaller files always have lower latency because it is
less costly to achieve higher redundancy for these files. We also observe
that analytical latency bound in (Xiang et al., 2016) tightly follows
120 Lessons from prototype implementation
120
100
Latency (sec)
80
60
40
20
0
50M 100M 150M 200M
File Size(MB)
actual service latency. In one case, service latency exceeds the analytical
bound by 0.5 seconds. This is because theoretical bound quantifying
network and queuing delay does not take into account Tahoe protocol
overhead, which is indeed small compared to network and queuing delay.
12
Storage
Cost
Per
File
(US
Dollars)
200
11
Average
Latency
(Sec)
10
150
9
8
100
7
50 6
5
0
4
r=0.1
r=0.11
r=0.115
r=0.12
r=0.125
Request
Arrival
Rate
(/sec)
Figure 7.6: Evaluation of different request arrival rates. As arrival rates increase,
latency increases and becomes more dominating in the latency-plus-cost objective
than storage cost.
7.2. Illuminating key design tradeoffs 121
135
130
125
Latency (Sec)
120
115
110
105
100
10.4113 10.8514 11.3527 11.711 12
Average Storage Cost Per User (US Dollar)
Figure 7.7: Visualization of latency and cost tradeoff for varying θ = 0.5 sec-
ond/dollar to θ = 100 second/dollar. As θ increases, higher weight is placed on the
storage cost component of the latency-plus-cost objective, leading to less file chunks
and higher latency.
However, caching with erasure codes has not been well studied. The
current results for caching systems cannot automatically be carried
over to caches in erasure coded storage systems. First, using an (n, k)
maximum-distance-separable (MDS) erasure code, a file is encoded into
n chunks and can be recovered from any subset of k distinct chunks.
Thus, file access latency in such a system is determined by the delay
to access file chunks on hot storage nodes with slowest performance.
Significant latency reduction can be achieved by caching a few hot chunks
(and therefore alleviating system performance bottlenecks), whereas
caching additional chunks only has diminishing benefits. Second, caching
the most popular data chunks is often optimal because the cache-miss
rate and the resulting network load are proportional to each other.
However, this may not be true for an erasure-coded storage, where
cached chunks need not be identical to the transferred chunks. More
precisely, a function of the data chunks can be computed and cached, so
that the constructed new chunks, along with the existing chunks, also
satisfy the property of being an MDS code. There have been caching
schemes that cache the entire file (Nadgowda et al., 2014; Chang et al.,
2008; Zhu et al., 2004), while we can cache partial file for an eraure-
coded system (practically proposed for replicated storage systems in
(Naik et al., 2015)) which gives extra flexibility and the evaluation
results depict the advantage of caching partial files.
…
F1 F3
F2 F6 F4
F5
Storage Nodes Storage Nodes Storage Nodes
120
100
Average
Latency
(Sec)
80
60
40
20
0
0.0149
0.0225
0.0301
0.0384
0.0456
Request
Arrival
Rate
(/Sec)
Figure 7.9: Comparison of average latency of functional caching and Tahoe’s native
storage system without caching, with varying average arrival rates for r = 1000 files
of 200MB, where the cache size fixed at 2500.
126
References
127
128 References
Naik, M., F. Schmuck, and R. Tewari. 2015. “Read and write requests
to partially cached files”. US Patent 9,098,413. url: http://www.
google.com/patents/US9098413.
Nelson, R. and A. N. Tantawi. 1988. “Approximate analysis of fork/join
synchronization in parallel queues”. IEEE transactions on computers.
37(6): 739–743.
Olvera-Cravioto, M., J. Blanchet, and P. Glynn. 2011. “On the transition
from heavy traffic to heavy tails for the M/G/1 queue: the regularly
varying case”. The Annals of Applied Probability. 21(2): 645–668.
Ovsiannikov, M., S. Rus, D. Reeves, P. Sutter, S. Rao, and J. Kelly. 2013.
“The quantcast file system”. Proceedings of the VLDB Endowment.
6(11): 1092–1101.
Oza, N. and N. Gohil. 2016. “Implementation of cloud based live stream-
ing for surveillance”. In: Communication and Signal Processing
(ICCSP), 2016 International Conference on. IEEE. 0996–0998.
Paganini, F., A. Tang, A. Ferragut, and L. Andrew. 2012. “Network
Stability Under Alpha Fair Bandwidth Allocation With General
File Size Distribution”. Automatic Control, IEEE Transactions on.
57(3): 579–591. issn: 0018-9286. doi: 10.1109/TAC.2011.2160013.
Papadatos, N. 1995. “Maximum variance of order statistics”. Annals of
the Institute of Statistical Mathematics. 47(1): 185–193.
Papailiopoulos, D. S., A. G. Dimakis, and V. R. Cadambe. 2013. “Repair
optimal erasure codes through hadamard designs”. IEEE Transac-
tions on Information Theory. 59(5): 3021–3037.
Parag, P., A. Bura, and J.-F. Chamberland. 2017. “Latency analysis for
distributed storage”. In: IEEE INFOCOM 2017-IEEE Conference
on Computer Communications. IEEE. 1–9.
Pedarsani, R., M. A. Maddah-Ali, and U. Niesen. 2014. “Online coded
caching”. In: IEEE International Conference on Communications,
ICC 2014, Sydney, Australia, June 10-14, 2014. 1878–1883. doi:
10.1109/ICC.2014.6883597.
Pedarsani, R., M. A. Maddah-Ali, and U. Niesen. 2015. “Online coded
caching”. IEEE/ACM Transactions on Networking. 24(2): 836–845.
Plank, J. S., J. Luo, C. D. Schuman, L. Xu, Z. Wilcox-O’Hearn, et al.
2009. “A Performance Evaluation and Examination of Open-Source
Erasure Coding Libraries for Storage.” In: Fast. Vol. 9. 253–265.
136 References
Xiang, Y., V. Aggarwal, Y.-F. Chen, and T. Lan. 2015a. “Taming La-
tency in Data Center Networking with Erasure Coded Files”. In:
Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM
International Symposium on. 241–250. doi: 10.1109/CCGrid.2015.
142.
Xiang, Y., T. Lan, V. Aggarwal, and Y.-F. Chen. 2015b. “Multi-tenant
Latency Optimization in Erasure-Coded Storage with Differentiated
Services”. In: Distributed Computing Systems (ICDCS), 2015 IEEE
35th International Conference on. 790–791. doi: 10.1109/ICDCS.
2015.111.
Xiang, Y., T. Lan, V. Aggarwal, and Y. F. R. Chen. 2014. “Joint Latency
and Cost Optimization for Erasurecoded Data Center Storage”.
SIGMETRICS Perform. Eval. Rev. 42(2): 3–14. issn: 0163-5999.
doi: 10.1145/2667522.2667524. url: http://doi.acm.org/10.1145/
2667522.2667524.
Xiang, Y., T. Lan, V. Aggarwal, Y.-F. R. Chen, Y. Xiang, T. Lan,
V. Aggarwal, and Y.-F. R. Chen. 2016. “Joint latency and cost
optimization for erasure-coded data center storage”. IEEE/ACM
Transactions on Networking (TON). 24(4): 2443–2457.
Yadwadkar, N. J. and W. Choi. 2012. “Proactive straggler avoidance
using machine learning”. White paper, University of Berkeley.
Zaharia, M., A. Konwinski, A. D. Joseph, R. H. Katz, and I. Sto-
ica. 2008. “Improving MapReduce performance in heterogeneous
environments.” In: Osdi. Vol. 8. No. 4. 7.
Zhou, T. and C. Tian. 2020. “Fast erasure coding for data storage: a com-
prehensive study of the acceleration techniques”. ACM Transactions
on Storage (TOS). 16(1): 1–24.
Zhu, Q., A. Shankar, and Y. Zhou. 2004. “PB-LRU: A Self-tuning
Power Aware Storage Cache Replacement Algorithm for Conserving
Disk Energy”. In: Proceedings of the 18th Annual International
Conference on Supercomputing. ICS ’04. Malo, France: ACM. 79–88.
isbn: 1-58113-839-3. doi: 10.1145/1006209.1006221.
Zwart, A. and O. J. Boxma. 2000. “Sojourn time asymptotics in the
M/G/1 processor sharing queue”. Queueing systems. 35(1-4): 141–
166.