ESDB - Processing Extremely SkewedWorkloads in Real-Time
ESDB - Processing Extremely SkewedWorkloads in Real-Time
ABSTRACT 1k
Normalized throughput
With the rapid growth of cloud computing, efficient management
of multi-tenant databases has become a vital challenge for cloud
100
service providers. It is particularly important for Alibaba, which
hosts a distributed multi-tenant database supporting one of the
world’s largest e-commerce platforms. It serves tens of millions 10
2286
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
to add customized attributes (e.g., sizes, materials of clothes, weights Contributions. In this paper, we introduce ESDB, a cloud-native
of food) to the commodities, which leads to a diverse data schemas multi-tenant database which features processing extremely skewed
(i.e., different transaction logs have different attributes). Since it is workloads. ESDB is built upon Elasticsearch [17] and inherits its
unrealistic to add columns for all customized attributes to the table, core characteristics, such as full-text search, distributed indexing
we build an "attributes" column where all customized attributes and querying, and using double hashing [5] as its request routing
are concatenated together as a string. In practice, over 1500 sub- policy. The novelties of ESDB lie in a new load balancing technology
attributes have been added to the transaction database. Before the that enables workload-adaptive routing of multi-tenant workloads,
adoption of ESDB, querying about "attributes" column was highly and optimizations that overcome the shortcomings of Elasticsearch
inefficient, since it is non-trivial to build a suitable index of such a as a database, such as high RT for multi-column queries and high
non-standard string column. cost of index computation. More concretely, we make the following
In practice, sellers often find themselves in need to perform full- contributions of this paper:
text search queries that contain non-standard strings or keywords. • We present the architecture of ESDB, a cloud-native document-
For example, bookstore sellers search transaction status of book oriented database which has been running on Alibaba Cloud
transactions by keywords in the auction titles. Although MySQL for 5 years as the main transaction database behind Alibaba’s e-
provides fuzzy search (e.g., LIKE, REGEXP) as well as full-text search commerce platform. ESDB provides support for elastic writing for
using full-text indexing, the support for these functions are limited extremely skewed workloads, as well as highly efficient ad-hoc
and the performance is unstable especially with the rapid growth of queries and real-time analysis.
the transaction logs over years. Consequently, it drives us to shift • We introduce dynamic secondary hashing as the solution to the
our backbone from MySQL to document-oriented databases. performance degradation caused by the high skewness of the
Skewed and unpredictable workloads. Another challenge is workloads. This mechanism enables real-time workload balanc-
that the workload distribution in the sellers’ database is extremely ing to allow balanced write throughput on different instances. At
skewed because of the tremendous variation of the numbers of the same time, unlike double hashing which requires expensive
transactions conducted by different sellers. The variation is further distributed read to fetch results from virtually all instances, it
magnified at the kickoff of major sale and promotion events during avoids the heavy burden induced by distributed queries.
which the overall throughput increases dramatically. For example, • We further introduce several optimizations, such as the adop-
Figure 1 shows the normalized throughput of the top 1000 sellers tion of a query optimizer, physical replication, frequency-based
in the first 10 seconds of the Single’s Day Global Shopping Festival, indexing. These optimizations enable ESDB to incorporate fea-
2021. In the figure, the normalized throughput roughly follows a tures of a document-oriented database (e.g., high scalability and
power law curve where the aggregated throughput of the top 10 strong support for full-text retrieval), while providing low RTs
sellers comprises 14.14% of the total throughput. The unpredictabil- for ad-hoc queries.
ity of the throughput distribution adds another layers of complexity: • Finally, we evaluate ESDB both in laboratory environment with
the throughput of different sellers fluctuates substantially over time, simulated workloads and in production environment with real-
depending on multiple factors such as the availability of merchan- world workloads. The experimental results show that ESDB is
dise promotions, the readiness of stock preparation, etc. In practice, able to balance write workloads in different skewness scales and
the popularity of sellers can change significantly in a short time achieves high query throughput and latencies. Our results show
period, and the peaks of top sellers are difficult to forecast. that the deployment of ESDB succeeds reducing write delays
Before the adoption of ESDB, transaction logs of each seller are even at the spike of Single’s Day Global Shopping Festival.
uniquely assigned to a MySQL instance based on their seller IDs.
The remainder of this paper is organized as follows. In Section 2,
In practice, when a top seller’s promotion starts and brings in volu-
we introduce as background document-oriented databases, and the
minous transactions, the write throughput can easily overwhelm
general concept of load balancing. Section 3 presents the architec-
the instance’s write capability. On the other hand, most instances
tural overview of ESDB. We then introduce the design of dynamic
(for ordinary sellers) are deeply under-utilized. The consequence
second hashing in Section 4, and the optimizations in Section 5. Sec-
of such skewed workload is waste of resources on idle instances,
tion 6 presents the evaluation of ESDB. Finally, we discuss related
failed real-time queries on hotspot instances, and an overall perfor-
work in Section 7 and conclude in Section 8.
mance degradation. For example, the write delays (i.e., a metric that
evaluates how long ESDB takes to complete a write of a transaction 2 BACKGROUND AND MOTIVATION
log into the seller’s database) of large tenants could rise as high as
In this section, we introduce as background document-oriented
over 100 minutes in early years, in which case, the sellers would
databases, and a particular example, Elasticsearch, which is used
lose the capability of adjusting their sale strategy based on the mar-
as the basis of ESDB. We then discuss the general concept of load
ket’s response. A straightforward multi-tenant solution is to allow
balancing, as well as the motivation of dynamic second hashing.
multiple sellers to share a database instance, e.g., by routing sellers’
transaction logs to instances through consistent hashing [42, 49].
However, the distribution of transaction logs is still skewed due 2.1 Document-oriented Database
to the inherent imbalance of transactions performed by different Document-oriented database is a subclass of NoSQL database where
sellers. Therefore, we need a workload-adaptive load balancing the data is stored in form of documents. Unlike relational data-
mechanism, especially with the high fluctuation and unpredictabil- base, document-oriented database does not have predefined data
ity in the workloads. format thus supports flexible schema. Compared to other NoSQL
2287
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
2288
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
Application Layer Query clients. ESDB inherits the RESTful API and the query lan-
Write Clients Query Clients Xdriver4ES guage ES-DSL from Elasticsearch. However, ES-DSL is less user-
friendly compared to SQL, and it does not support all expressions
Control Layer
and data types of SQL (e.g., type conversion expression date_format).
Master
Monitor Coordinator We therefore need a tool to rewrite SQL queries into ES-DSL queries.
Add/Drop Index Cluster
Manager Route Rule
Frequency- R/W Router In order to solve this problem, we develop a plugin Xdriver4ES as a
Generator
Based
Indexing Load bridge between SQL and ES-DSL.
Balancer Query Result Aggregator
Xdriver4ES adopts a smart translator which generates cost-effective
ES-DSL queries from SQL queries. Unlike SQL, ES-DSL encodes
Execution Layer
query ASTs directly which are then parsed to generate execu-
Worker tion plans. Therefore, instead of building ASTs from SQL queries,
Query
Translog (WAL) Real-time Synchronize Optimizer
Xdriver4ES adopts the following two optimization techniques: (1)
Query
In-memory Buffer Segment Files Physical Replication CNF/DNF conversion. Considering queries as boolean formulas,
Executor
Local Storage
Xdriver4ES converts them into CNF/DNFs in order to reduce the
depth of ASTs. (2) Predicate merge. Xdriver4ES merges predicates
Shard Shard Replica Replica
that involve the same column in order to reduce the width of
ASTs (e.g., merge tenant_id=1 OR tenant_id=2 to tenant_id IN
(1,2)). Xdriver4ES further utilizes a mapping module which con-
Figure 3: Architecture of ESDB. verts the query results into a format that a SQL engine understands.
For example, we implement in this module built-in functions of SQL,
processes workloads independently. In an ESDB cluster, shards such as data type conversion and IFNULL. In this way, Xdriver4ES
and replicas (each shard has one replica) are randomly allocated allows the execution of SQL queries on ESDB.
to different nodes. Nodes play different roles: each node works as
3.2 Control layer
both a coordinator (on control layer) and a worker (on execution
layer); each cluster further elects one node as the master node. The control layer consists of a master node, coordinator nodes and a
Figure 3 shows an architectural overview of ESDB. In addition monitor which collects metrics for workload balancing. In an ESDB
to the three layers of workload processing, this figure also includes cluster, master node works as the manager of the whole cluster.
ESDB’s key features that enable real-time workload balancing and It is responsible for cluster management tasks such as managing
read/write optimizations. The rest of this section, we introduce metadata, shard allocation and rebalance, and tracking and mon-
these components in more details. itoring nodes. The coordinator nodes are mainly responsible to
route read/write workloads to the corresponding worker nodes [8].
3.1 Application layer Specifically, during the query execution phrase, coordinators first
The Application layer consists of ESDB’s write clients and query collect row IDs of the selected rows from all involved shards, and
clients. We briefly introduce these two kinds of clients: then fetch the corresponding raw data (indicated by the retrieved
Write clients. Accompanied by ESDB’s load balancer, ESDB’s write row ID list). Therefore, coordinators contain a query result aggrega-
clients use the following techniques to accelerate writing and al- tor that is in charge of row ID collection and perform aggregation
leviate hotspots. (1) One-hop routing. The original write clients operations (e.g. global sort, sum, avg).
of Elasticsearch are transport clients which are unaware of work- Load balancer. ESDB’s load balancer is developed to generate
loads’ destinations [12]. These transport clients route workloads and commit routing rules which instruct the reader/write routers.
to coordinators in a round-robin fashion which leads to two-hop Since it is enabled by dynamic secondary hashing, the routing
routing (write client → coordinator → worker). In ESDB, we allow rules are also called secondary hashing rules. Once the monitor
the transport clients to be aware of the routing policies. In this way, detects hotspots, the coordinators will generate new routing rules
we achieve one-hop routing (write client → worker) and thus accel- that essentially split and extend the shards that are hosting the
erate writing. (2) Hotspot isolation. In write clients, all workloads current hotspots. Through exploiting new routing rules, ESDB
are temporally buffered in a queue before they are routed to their distributes workloads of large tenants to more shards thus alleviates
corresponding workers batch by batch. Once a worker is overloaded the over-burdened nodes. Once it receives new routing rules from
(probably caused by both skewed write workloads and slow read coordinators, the master node is then responsible to commit these
queries), the queue will be blocked and the write delay will rise. rules. In this way, the load balancer ensures the consistency of
In order to solve this problem, ESDB implements hotspot isolation workload routing in the whole cluster. We present more design
which isolates workloads of hotspots to another queue, such that details of the load balancer in Section 4.
they will not negatively affect other workloads. (3) Workload batch- Frequency-based indexing. In order to mitigate the cost of main-
ing. When a write client detects that a row (identified by its row ID) taining indices of the sub-attributes of the “attribute” column (see
will be frequently modified in a short period of time, it will batch- Section 2.1), we adopt frequency-based indexing (i.e., build indices
execute the workloads by aggregating together these modifications only for the most frequently queried sub-attributes). This idea is
and only materializing the eventual state of this row. By adopting derived from the observation that sub-attributes’ read/write fre-
the workload batching, the write clients avoid the repeated writes quencies are skewed. For example, some generic sub-attributes,
to the same row and thus improve write throughput. such as “activity” that indicates what e-commerce activity the seller
2289
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
ment in query latency with the cost of a slightly increased storage Dynamic secondary 56
hashing hash of *!
overhead. We demonstrate the effectiveness of frequent-based in-
dexing in Section 6.3.3.
! "! = 16 Level-1
3.3 Execution layer 48 hash of *! 16
The execution layer consists of worker nodes which maintain shards Level-2 Level-2
hash of hash of *"
and replicas in their local SSD, and execute read/write workloads. *"
Level-1
The worker nodes have a shared-nothing architecture where each ! "! = 8 hash of *! 24
41
worker maintains its own storage, independent to other workers. - = 16
31
Write execution and data replication. When executing write
workloads, ESDB inherents the feature of near real-time search [7]
Figure 4: Comparison of workload routing across 64 shards
from Elasticsearch. Raw data and indices are temporally written
between hashing, double hashing and dynamic secondary
into an in-memory buffer before they are periodically refreshed
hashing. Colored nodes represent the shards to which work-
to segment files and become available to search. In order to ad-
loads of attribute 𝑘 1 are routed.
dress persistence and durability, ESDB adopts Translog [11] which
pertains to the disk. Every write workload will be added to the where 1 ≤ 𝐿(𝑘 1 ) ≤ 𝑁 , 𝐿(𝑘 1 ) ∈ Z and 𝐿(𝑘 1 ) depends on the real-
Translog once it is successfully submitted. In this way, the data that time workloads of tenant 𝑘 1 .
has not been flushed [3] to the disk can be safely recovered from Figure 4 depicts three different routing policies: hashing, double
Translogs, if ESDB encounters process crash or hardware failures. hashing and dynamic secondary hashing. Apparently, hashing can
Moreover, segment merge [9] is another important mechanism that only route a workload based on the 1-level hash of the partition
merges smaller segments to a large segment. It costs computation key (e.g., tenant ID) thus falls short when it needs to balance multi-
resources but effectively improves query efficiency. tenant workloads. The ordinary double hashing is able to route
Elasticsearch adopts a logical replication scheme [4]: a primary workloads of a tenant to a fixed set of consecutive shards based on
shard will forward a write workload to all its replicas once this 2-level hashing of two keys (i.e., tenant ID and record ID) but fails
workload has been locally executed. In other word, same write to manage dynamic workloads. The dynamic secondary hashing is
workload is executed respectively by the replicas, which causes inspired by double hashing, which also route workloads to consec-
𝑛-fold computation overhead to the cluster (𝑛 is the number of utive shards. However, it is capable of extending to the successive
replicas). In ESDB, write workloads are still forwarded and added shards when the workloads increase (e.g., 𝐿(𝑘 1 ) increases from 8
to Translogs on replicas in real-time, but are never executed by to 16 in Figure 4).
replicas. Instead, it implements physical replication of segment files.
We describe the details of physical replication in Section 5.2. Algorithm 1 ESDB load balancer
Query optimizer. For query workloads, ESDB builds optimized 1: 𝐾 ← collect the set of tenants
query plans using a rule-based optimizer. Instead of using index 2: 𝑃 ← collect the set of shards
scan for all columns, ESDB provides more operations such as com- 3: 𝑅←∅ ⊲ initialize secondary hashing rule list
posite index scan and sequential scan. Through exploiting combi- 4: 𝑡𝑖𝑛𝑖𝑡 ← select effective time
nations of different operations, ESDB significantly reduces query 5: 𝑆 (𝐾) ← collect current storage of each 𝑘
latencies. Details of optimizations are presented in Section 5. 6: for each 𝑘 in 𝐾 do
7: 𝑟 ← Í 𝑆 (𝑘𝑆) (𝑘 )
𝑘 ∈𝐾
𝑠 ← 𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑂 𝑓 𝑓 𝑠𝑒𝑡𝑆𝑖𝑧𝑒 (𝑟 )
4 LOAD BALANCING 8:
9: 𝑅 ← 𝑈 𝑝𝑑𝑎𝑡𝑒𝑅𝑢𝑙𝑒𝐿𝑖𝑠𝑡 (𝑅, 𝑡𝑖𝑛𝑖𝑡 , 𝑠, 𝑘)
The load balancer of ESDB is designed to meet the following two 10: end for
requirements: (1) Query efficiency. Data of multiple tenants should 11: while Service on do
be placed on as few shards as possible in order to avoid query 12: 𝑡𝑢𝑝𝑑𝑎𝑡𝑒 ← select effective time
executions across too many shards. (2) Load balancing. Distribution 13: 𝑇 (𝐾) ← collect periodic write throughput of each 𝑘
of workloads across multiple shards should be as uniform as possible 14: for each 𝑘 in 𝐾 do
in order to avoid overload on a single shard. In practice, we make a 15: 𝑟 ← Í 𝑇 (𝑘𝑇) (𝑘 )
𝑘 ∈𝐾
trade-off between these two contradictory requirements by limiting 16: if 𝐶ℎ𝑒𝑐𝑘𝐻𝑜𝑡𝑆𝑝𝑜𝑡 (𝑟 ) then
data of small tenants on single shard and distributing data of large 17: 𝑠 ← 𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑂 𝑓 𝑓 𝑠𝑒𝑡𝑆𝑖𝑧𝑒 (𝑟 )
18: 𝑅 ← 𝑈 𝑝𝑑𝑎𝑡𝑒𝑅𝑢𝑙𝑒𝐿𝑖𝑠𝑡 (𝑅, 𝑡𝑢𝑝𝑑𝑎𝑡𝑒 , 𝑠, 𝑘)
tenants across multiple shards (see Figure 2 (c)).
19: end if
4.1 Dynamic Secondary Hashing 20: end for
21: end while
The key intuition of dynamic secondary hashing is to adopt a
workload-adaptive offset function 𝐿(𝑘 1 ) in the secondary hash-
In practice, the maximum offset in the secondary hashing 𝑠 =
ing period of double hashing. Compared to Equation 1, the fixed
𝐿(𝑘 1 ) relies on two metrics: (1) Current storage. We assume that
maximum offset 𝑠 is replaced with 𝐿(𝑘 1 ):
tenants with larger storage proportion are more likely to have large
𝑝 = (ℎ 1 (𝑘 1 ) + ℎ 2 (𝑘 2 ) mod 𝐿(𝑘 1 )) mod 𝑁 (2) forthcoming workloads. Therefore, we select larger 𝑠 for tenants
2290
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
2291
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
all executed records). If this condition is satisfied, participant blocks Index Search tenant_id Index Search created_time Index Search status
predicate predicate created_time predicate
all workloads whose creation times are later than the effective time tenant_id = 10086 between ‘2021-09-16 00:00:00’ status= 1
and replies the master node an acceptance message. Otherwise, and ‘2021-09-17 00:00:00’
participant reports error to the master node. If the master node Figure 7: Example query plan of Elasticsearch. A, B, C, D,
receives any error message or detects any timeout (a participant E and F represent posting lists generated by corresponding
does not respond within 𝑇2 ), this secondary hashing rule is aborted. operations.
Otherwise, the commit phase begins.
corresponding indices. Then it aggregates the posting lists through
Commit Phase. The master node broadcasts commit message as intersections and unions. This query plan introduces large overhead
well as the secondary hashing rule to all participating coordinators. since the posting lists are generated sequentially. It will become
Since all nodes have reached consensus in the last phase, they will more time-consuming when the selectivity of a column is high and
accept the commit message and add the rule to their local secondary the posting list grows prohibitively large. In order to solve this prob-
hashing rule lists. Once the commit phase is complete, the nodes will lem, we let ESDB incorporate features from relational databases:
remove the block of workload execution (i.e., continue to process composite indices, sequential scan and a rule-based optimizer (RBO)
workloads with creation time greater than the effective time). which produces cost-effective query plans.
Choose of time interval. The time interval 𝑇 provides a buffering Composite index. In relational databases, composite index is built
period for the system to reach consensus on the secondary hashing on multiple columns which are concatenated in a specific order. Tak-
rules. 𝑇 should be much larger than the roundtrip latency of broad- ing advantage of composite indices usually avoids time-consuming
casting (e.g., 100ms) and the maximum of local clock deviations operation, such as table scan, thus accelerates queries that involve
across the cluster (no more than 1s in ESDB) to ensure strict consis- multiple columns [24]. Meanwhile, composite indices have limited
tency. At the same time, 𝑇 should be shorter than our expected time applicability, as the columns must comply with the leftmost se-
of load balancing (e.g., 60s) for effectiveness. As long as 𝑇 is larger quence (i.e. the leftmost principle). For example, if the composite
than the time for the cluster to reach consensus, ESDB’s consensus index is built on two columns column1_column2, we can either
protocol achieves non-blocking of workload processing. query about (column1) or (column1, column2); on the other hand,
Fault tolerance. Although ESDB’s consensus protocol ensures queries about (column2) cannot leverage this composite index. In or-
strict consistency of 𝑅, it still suffers from network partition and der to increase availability of composite indices, DBAs are expected
node failures during the commit phase. ESDB adopts an automatic to manually build composite indices among a massive amount of
solution of fault tolerance. However, it relies on the detection of column combinations [27].
network partition and node failure, that is, it needs to differenti- Elasticsearch uses Bkd-tree [55] to index numeric data and multi-
ate temporarily unresponsive nodes from failed nodes. A typical dimensional (e.g., geo-information) data. Bkd-tree is an index struc-
solution uses a pre-defined timeout; in ESDB, we manually verify a ture which combines kd-tree [57] and B+ tree. Unlike B+ Tree,
raised alarm (a node becoming unresponsive), to definitively decide Bkd-tree enables division of search space along different dimen-
whether a node failure or a network partition has occurred. sions. Therefore, it is not necessary to follow the leftmost principle
and this makes the composite index more flexible. Furthermore, it
5 OPTIMIZATION optimizes disk I/O and significantly reduces overheads of insertion
ESDB focuses mainly on multi-column SFW (SELECT-FROM-WHERE) and deletion by dynamically maintaining a forest of kd-trees [55].
queries on a single table, where multiple predicates are connected However, Bkd-tree suffers from the curse of dimensionality where
by AND and OR operators. Before the deployment of optimizations, the search performance degrades as the number of dimensions
ESDB’s query performance is decent when queries only involve grows high [15]. In ESDB, we build concatenated columns and one-
few columns. However, we observed more than 10x overhead from dimension Bkd-trees on these columns as the composite indices.
ESDB’s multi-column queries compared to that of a MySQL cluster. Although such design has less flexibility, ESDB’s composite indices
(Queries from sellers usually involves with more than 10 columns). search performs fast and is able to cover most query workloads in
After a performance analysis, we identify that the suboptimal per- practical application scenarios.
formance mainly results from Lucene’s rigid query plans. To address Another challenge of composite index is the growing key size
this problem, we introduce a query optimizer (see Section 5.1). when we concatenate columns, which makes operations like key
comparisons expensive. In order to solve this problem, we take
5.1 Query Optimizer the advantage of common prefixes: since the concatenated keys
As an example, consider a query involving four columns (shown are sorted in the composite index, the leaf nodes in the Bkd-tree
in Figure 6). The execution plan generated by Lucene is shown in usually contain keys that share a common prefix. By leveraging the
Figure 7. First, Lucene generates posting lists, which record the common prefixes, we manage to increase the storage efficiency and
row IDs of the selected rows, for each column by searching the reduce the cost of key comparisons.
2292
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
2293
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
(a)
(a)
(b)
(b)
Figure 10: Comparisons of three routing policies when 𝜃 = 1.
Figure (a) and (b) respectively present write throughput and Figure 11: Write throughput and average delay of three rout-
average delay with different generating rate. ing policies with different skewness factor 𝜃 s.
larger visibility delay of Segment B. In order to solve this problem, 6.2 Balanced Write
we further introduce pre-replication of merged segments. When 6.2.1 Write Throughput and Delay. In the first set of our experi-
a merged segment is generated, the primary shard immediately ments, we measure the cluster throughput and the write delay to
starts to replicate it to replicas. This pre-replication is indepen- evaluate the performance of three different routing policies:
dently running along with the quick incremental replications. In • Hashing, the baseline policy without any workload balancing;
this way, merged segments never appear in segment diff and thus • Double hashing, another baseline policy that distributes data of
have limited influence on the replication of refreshed segments. each tenant to 8 shards;
• Dynamic secondary hashing, the routing policy used by ESDB’s
6 EVALUATION
In this section, we present the evaluation results of ESDB to demon- load balancer.
strate its capability of processing skewed write workloads while Figure 10 presents the cluster throughput and write delays when
retaining high throughput and low latency of distributed queries. 𝜃 = 1 with different data generating rate. In Figure 10 (a), we ob-
serve that the throughput of hashing reaches its limit at around
6.1 Experimental Setup 90K TPS while the other two does not stop until they reach 140K.
All experiments are performed on a cluster consisting of 11 ECS This is mainly because hashing fails to balance the skewed work-
virtual machines (ecs.c7.2xlarge) on Alibaba Cloud. Each virtual loads, and thus waste the resource that could have been used to
machine contains 8 vCPUs, 16GB memory and 1TB SSD disk. We handle workflows targeted at hotspots. On the contrary, dynamic
use three machines to simulate ESDB’s write and query clients and secondary hashing manages to balance the skewed workloads, and
the rest eight machines as the worker nodes of a ESDB cluster. therefore has close performance to double hashing, which is the
In order to simulate real-time processing of Alibaba’s e-commerce optimal option since the data is uniformly distributed on the nodes.
transaction logs, we build a benchmark which generates random Figure 10 (b) shows how the average delay changes as the data
workloads based on the template of our transaction logs and col- generating rate grows. Delays of three routing policies all rise
lects metrics of ESDB cluster in real-time. During the evaluation, when the generating rate surpasses their throughput upper bounds.
the simulated workloads are routed to 512 shards located on the However, we observe that the delay of hashing increases rapidly
eight worker nodes. The simulated workloads contain columns after it reaches its throughput upper bound while the other two have
of transaction ID (an auto-increment unique key), tenant ID and smoother trends. This figure further demonstrates that dynamic
creation time which are essential for ESDB’s workload balancer. second hashing significantly outperforms hashing and has close-
In order to simulate different level of skewness situations, we let to-optimal write performance.
the workload generators sample tenant IDs from Zipf distribution We use Figure 11 to show that dynamic secondary hashing is
tunable by a skewness factor 𝜃 . The sampling size of tenant 𝑘 is capable of balancing workloads with different skewness factors. In
set to be proportional to ( 𝑘1 )𝜃 . We select 5 different 𝜃 s: 0, 0.5, 1, this set of experiments, we collect the average write throughput
1.5 and 2. When 𝜃 = 0, Zipf distribution is effectively reduced to during a period of more than 15 minutes for more stable results.
a uniform distribution. When 𝜃 = 1, simulated workloads are the Figure 11 (a) shows the write throughput with the three routing
closest to real workloads. Simulations with 𝜃 = 1.5 and 𝜃 = 2 rarely policies when the data generating rate is 160K TPS. When the skew-
happen in our production environment, but serve to evaluate the ness factor 𝜃 = 0, the workload is naturally balanced and all three
performance of ESDB in the case of extreme skewness. policies exhibits similar write throughput and practically reach the
2294
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
2295
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
Figure 13: Write throughput and CPU usage with hashing (a), double hashing (b) and dynamic secondary hashing (c). Bars
represent throughput, lines represent CPU usage. Figure (d) shows normalized shard sizes with three routing policies.
20k
120k
Throughput (QPS)
Throughput (TPS)
110k 16k
Hashing
100k
Double hashing
90k Dynamic secondary hashing 12k
80k
70k 8k
60k
Hashing
4k
50k Double hashing
40k Dynamic secondary hashing
0 50 100 150 200 250 300 0
1 200 400 600 800 1000 1200 1400 1600 1800 2000
Time (s) Ranked tenant ID
Figure 14: Real-time write throughput with three routing Figure 16: Query throughput of the top 2000 tenants with
policies in 6 minutes. three routing policies.
Figure 15: Write throughput (a) and average CPU usage of the Figure 17: Average (a) and quantiles (b) query latencies of the
cluster (b) with logical replication and physical replications. top 100 tenants with and without ESDB’s query optimizer.
Compared to hashing, dynamic secondary hashing also has its
6.3.1 Query Throughput. In this experiment, we evaluate the query own advantage for queries issued to large tenants: since the shard
throughput when we issue queries on an ESDB cluster consisting of sizes for large tenants are much smaller compared to those of hash-
eight worker nodes, 512 shards and 40M simulated transaction logs. ing (Figure 13 (d)), subqueries can be executed in parallel and there-
These transaction logs belong to 100K tenants with a skewness fore complete faster. For this reason, we do not observe signifi-
factor 𝜃 = 1. For each of the top 2000 tenants, we let three machines cant drop of query throughput for large tenants. When processing
concurrently generate SQL queries and send requests to the ESDB queries issued to small tenants, both dynamic secondary hashing
cluster to evaluate the upper bound of query throughput. In order and hashing only execute one subquery on the target shard and
to collect more stable results, we add LIMIT 100 statement after have similar performance.
every SQL query statement, which avoids fetching too many rows. 6.3.2 ESDB’s Query Optimizer. In order to prove the effectiveness
Figure 16 shows the query throughput for the top 2000 tenants of ESDB’s query optimizer, we build a query sets of 1000 queries for
with the three routing policies. When using double hashing, each each of the top 100 tenants. We then collect the total time consumed
tenant’s data is distributed to 8 shards, which means a query has to finish the execution of the query set with a single-threaded
to be expanded to 8 subqueries, one for each shard. Therefore, the query client. The target database is the same to the one used in
query throughput for double hashing is much lower than the other the previous experiments for evaluating the query throughput. As
two routing policies. On the contrary, dynamic secondary hashing shown in Figure 17 (a), query latencies decrease after enabling query
achieves query throughput as high as hashing for both large tenants optimizer for all the top 100 tenants. Figure 17 (b) further confirms
and small tenants. This is because, in our experiments, dynamic that ESDB’s query optimizer is able to reduce query latencies, and
secondary hashing distribute a tenant’s data to a smaller set of that the query latency is under 200 ms even for 99-percentile latency.
shards. Therefore, the number of subqueries is notably smaller than Overall, with ESDB’s query optimizer, the average query latency is
double hashing, and it increases the query throughput by as much improved by 2.41 times, where the latency of the queries issued to
as 63% (for the smaller tenants). the largest tenant is most significantly improved by 5.08 times.
2296
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
7 RELATED WORK
Different load balancing techniques have been proposed for dif-
ferent applications. Google Slicer [14] proposes a weighted-move
sharding algorithm, which automatically executes merge of cold
slices and split of hot slices based on "weight" (a metric to evaluate
skewness) and "key churn" (cost of split/merge). Facebook’s Shard
Manager [47] moves hot shards out of overloaded servers. Cock-
(a) (b)
roachDB [62], Spanner [28], HBase [18], Yak [44] use resharding
Figure 18: Average (a) and quantile (b) query latencies of the methods which automatically split and move shards of "hot" tenants.
top 100 tenants with and without frequency-based indices. Live migration is another type of migration-based load balancing
400 technique which moves entire database applications of hotspots
140
300 and Slacker [21], adopt different cost optimizations in order to mini-
120 mize the service interruption and downtime. Albatross [29] is a live
250
100 migration technique used in shared-storage database architectures.
200
80 Instead of migrating data, Albatross migrates database cache and
150
60 the state of transactions. Although effective, migration-based load
100
50 40 balancers introduce extra bandwidth and computation overheads,
0 20 consuming resources that are already very limited.
23:50:21 23:54:21 23:58:20 00:03:55 00:07:57 00:11:57 00:15:58
Time E-Store [61] identifies tuple-level hotspots and uses smart heuris-
Figure 19: Max write delay and average query latency at the tics to generate optimal data migration plans for load balancing.
beginning of Single’s Day Global Shopping Festival, 2021. SPORE [38] uses self-adaptive replication of popular key-value tu-
ples in distributed memory caching systems. Compared to data
6.3.3 Frequency-based indexing. Imitating the real-world data from migration, SPORE incurs fewer overhead and disperses workloads
our production environment, the simulated “attributes” column in of "hot" tuples to multiple nodes. SWAT [50] implements a load
our benchmark consists of 1500 sub-attributes whose frequencies balancing method which swaps replica roles as the primary and
are skewed (top 30 sub-attributes appear in about 50% of both secondary replicas to process imbalanced workloads. Although
write and query workloads). When generating “attributes” for each lightweight, these three methods are not appropriate for our ap-
simulated row, we sample 20 sub-attributes from Zipf distribution plication because our major skewness is caused by imbalance of
(𝜃 = 1). In this experiment, We build indices only for the top 30 tenants other than tuples nor replicas. Centrifuge [13] uses tempo-
sub-attributes, which incurs only 6.7% storage overhead. When rary leases between continuous key ranges and servers to provide
generating query workloads, we append a filter of a sub-attribute, consistency for in-memory server pools. It balances workloads by
which is also sampled from Zipf distribution, to the query template changing the mapping from virtual nodes to physical worker nodes,
used in Section 6.3. Figure 18 (a) and (b) show the average and which cannot be used to address single hotspot. LogStore [25]
quantile query latencies for the top 100 tenants. We observe that, achieves real-time workload balancing by maintaining and updat-
with frequency-based indices, query latencies improve significantly; ing a routing table during runtime. Using a max-flow algorithm,
the average query latency of top 100 tenants is reduced by as much LogStore generates routing plans which maximize overall write
as 94.1%. throughput. However, LogStore’s router has no read-your-writes
consistency guarantee, and this makes it risky to process UPDATE
6.4 Online Performance and DELETE workloads.
In our last experiment, we evaluate ESDB’s performance in a pro-
duction environment. More concretely, ESDB is used to support Al- 8 CONCLUSION
ibaba’s e-commerce platform during the 2021’s Single’s Day Global This paper presents ESDB, a cloud-native document-oriented data-
Shopping Festival. Figure 19 shows the max write delay and av- base which supports elastic write for extremely skewed work-
erage query latency during a period of approximately 30 minutes loads and efficient ad-hoc queries. ESDB adopts dynamic secondary
around the beginning of the festival. We observe that the max write hashing, a lightweight load balancing technique which eliminates
delay starts to rise notably at 00:00 am due to the dramatic increase hotspots of multi-tenant workloads in real-time. Compared to hash-
of workloads. After the detection of hotspots and the adoption of ing and double hashing, dynamic secondary hashing fulfills effi-
secondary hashing rules, it takes less than 7 minutes for ESDB to cient query and load balancing thus overcomes shortcomings of
process the workloads generated during the first few seconds after both techniques. In addition, we introduce optimizations that sig-
00:00 am and then fully eliminate write delays after the adapta- nificantly reduce the computation overheads and query latencies.
tion. This is a significant improvement over previous year’s max We evaluate ESDB both in a laboratory environment with simu-
write delay, which can be as high as over 100 minutes. In addi- lated workloads and in a production environment with real-world
tion, ESDB retains decent average query latency during the first 30 workloads. Our results show that ESDB is able to enhance write
minutes of shopping festival. The average query latency does not throughput and reduce write delays when processing extremely
surpass 164ms even when both the write and query throughputs skewed workloads, as well as maintain high throughput and low
are extremely high. latency for ad-hoc queries on distributed multi-tenant data.
2297
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
2298