0% found this document useful (0 votes)

59 views

ESDB - Processing Extremely SkewedWorkloads in Real-Time

Uploaded by

Peng Xiao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views

ESDB - Processing Extremely SkewedWorkloads in Real-Time

Uploaded by

Peng Xiao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA

ESDB: Processing Extremely Skewed Workloads in Real-time

Jiachi Zhang∗†§ Shi Cheng§ Zhihui Xue§ Jianjun Deng§ Cuiyun Fu§
Wenchao Zhou§ Sheng Wang§ Changcheng Chen§ Feifei Li§
[email protected],{chengshi.cs,zhihui.xzh,jianjun.djj,cuiyun.fcy,zwc231487,sh.wang}@alibaba-
inc,[email protected],[email protected]
§ Alibaba Group and † Georgetown University
§ Hangzhou, China and † Washington DC, USA

ABSTRACT 1k

Normalized throughput
With the rapid growth of cloud computing, efficient management
of multi-tenant databases has become a vital challenge for cloud
100
service providers. It is particularly important for Alibaba, which
hosts a distributed multi-tenant database supporting one of the
world’s largest e-commerce platforms. It serves tens of millions 10

of sellers as tenants, and supports transactions from hundreds of

millions of buyers. The inherent imbalance of shopping preferences
1
from the buyers essentially generates a drastically skewed work- 1 10 100 1000
Ranked seller ID
load on the database, which could create unpredictable hotspots
and consequently large throughput decline and latency increase. Figure 1: Normalized throughput of top 1000 sellers in the
In this paper, we present the architecture and implementation of first 10 sec of Single’s Day Global Shopping Festival, 2021.
ESDB (ElasticSearch Database), a cloud-native document-oriented cloud vendors are supposed to provide database products for mil-
database which has been running on Alibaba Cloud for 5 years as lions of customers, including applications and online services that
the main transaction database behind Alibaba’s e-commerce plat- host millions of users. This drives cloud service providers to adopt
form. ESDB provides strong full-text search and retrieval capability, distributed multi-tenant databases where data from multiple ten-
and proposes dynamic secondary hashing as the solution for pro- ants are allocated to share a same set of computation and storage
cessing extremely skewed workloads. We evaluate ESDB with both resources. Such multi-tenant architecture allows for high resource
simulated workloads and real-world workloads, and demonstrate utilization, and, consequently, high elasticity and low cost for data-
that ESDB significantly enhances write throughput and reduces the base products. On the other hand, however, due to the inherent im-
completion time of writes without sacrificing query throughput. balance among the workloads from different tenants, multi-tenant
databases can suffer from severe throughput decline when facing
CCS CONCEPTS extremely skewed workloads.
• Information systems → Data management systems. Alibaba is running the world’s largest e-commerce platforms,
which include Taobao, TMall and others. As the backbone of Al-
KEYWORDS ibaba’s e-commerce platforms, the transaction database is expected
multi-tenant; cloud-native; load balancing; document-oriented data- to provide stable read/write throughput and low response time (RT)
base during both ordinary times and under heavy workloads (such as
during major promotion events). In Alibaba’s transaction database,
ACM Reference Format:
transaction logs consist of both structured data (e.g., transaction
Jiachi Zhang†§ Shi Cheng§ Zhihui Xue§ Jianjun Deng§ Cuiyun Fu§ ,
ID, seller ID, created time and transaction status) and full-text data
Wenchao Zhou§ Sheng Wang§ Changcheng Chen§ Feifei Li§ . 2022.
ESDB: Processing Extremely Skewed Workloads in Real-time. In Proceedings (e.g., auction title, sellers’ and buyers’ nickname). Produced from e-
of the 2022 International Conference on Management of Data (SIGMOD ’22), commerce platforms, transaction logs record shopping transactions
June 12–17, 2022, Philadelphia, PA, USA. ACM, New York, NY, USA, 13 pages. initiated by the buyers, and are then sequentially written into the
https://doi.org/10.1145/3514221.3526051 sellers’ database through a coordination platform. As a targeting
goal, the sellers’ database is dedicated to processing high workloads
1 INTRODUCTION and providing highly efficient seller-oriented query and analysis
With the prevalence of cloud computing, enterprises are migrat- service. Deployed on Alibaba Cloud [33], the sellers’ database was
ing their applications as well as databases to the cloud. Nowadays, initially maintained in a shared-nothing MySQL cluster [52, 53]
where transaction logs were organized by their seller IDs. As the
Permission to make digital or hard copies of part or all of this work for personal or throughput of Alibaba’s transaction logs has grown rapidly in re-
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation cent years, we observed that the sellers’ database started to face
on the first page. Copyrights for third-party components of this work must be honored. the following two challenges:
For all other uses, contact the owner/author(s).
SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA. Diverse data schema and full-text search. As the transaction
© 2022 Copyright held by the owner/author(s). logs contain information of various commodities, users are allowed
ACM ISBN 978-1-4503-9249-5/22/06.
https://doi.org/10.1145/3514221.3526051 ∗ This author takes part in this work as a research intern at Alibaba DAMO Academy.

2286
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA

to add customized attributes (e.g., sizes, materials of clothes, weights Contributions. In this paper, we introduce ESDB, a cloud-native
of food) to the commodities, which leads to a diverse data schemas multi-tenant database which features processing extremely skewed
(i.e., different transaction logs have different attributes). Since it is workloads. ESDB is built upon Elasticsearch [17] and inherits its
unrealistic to add columns for all customized attributes to the table, core characteristics, such as full-text search, distributed indexing
we build an "attributes" column where all customized attributes and querying, and using double hashing [5] as its request routing
are concatenated together as a string. In practice, over 1500 sub- policy. The novelties of ESDB lie in a new load balancing technology
attributes have been added to the transaction database. Before the that enables workload-adaptive routing of multi-tenant workloads,
adoption of ESDB, querying about "attributes" column was highly and optimizations that overcome the shortcomings of Elasticsearch
inefficient, since it is non-trivial to build a suitable index of such a as a database, such as high RT for multi-column queries and high
non-standard string column. cost of index computation. More concretely, we make the following
In practice, sellers often find themselves in need to perform full- contributions of this paper:
text search queries that contain non-standard strings or keywords. • We present the architecture of ESDB, a cloud-native document-
For example, bookstore sellers search transaction status of book oriented database which has been running on Alibaba Cloud
transactions by keywords in the auction titles. Although MySQL for 5 years as the main transaction database behind Alibaba’s e-
provides fuzzy search (e.g., LIKE, REGEXP) as well as full-text search commerce platform. ESDB provides support for elastic writing for
using full-text indexing, the support for these functions are limited extremely skewed workloads, as well as highly efficient ad-hoc
and the performance is unstable especially with the rapid growth of queries and real-time analysis.
the transaction logs over years. Consequently, it drives us to shift • We introduce dynamic secondary hashing as the solution to the
our backbone from MySQL to document-oriented databases. performance degradation caused by the high skewness of the
Skewed and unpredictable workloads. Another challenge is workloads. This mechanism enables real-time workload balanc-
that the workload distribution in the sellers’ database is extremely ing to allow balanced write throughput on different instances. At
skewed because of the tremendous variation of the numbers of the same time, unlike double hashing which requires expensive
transactions conducted by different sellers. The variation is further distributed read to fetch results from virtually all instances, it
magnified at the kickoff of major sale and promotion events during avoids the heavy burden induced by distributed queries.
which the overall throughput increases dramatically. For example, • We further introduce several optimizations, such as the adop-
Figure 1 shows the normalized throughput of the top 1000 sellers tion of a query optimizer, physical replication, frequency-based
in the first 10 seconds of the Single’s Day Global Shopping Festival, indexing. These optimizations enable ESDB to incorporate fea-
2021. In the figure, the normalized throughput roughly follows a tures of a document-oriented database (e.g., high scalability and
power law curve where the aggregated throughput of the top 10 strong support for full-text retrieval), while providing low RTs
sellers comprises 14.14% of the total throughput. The unpredictabil- for ad-hoc queries.
ity of the throughput distribution adds another layers of complexity: • Finally, we evaluate ESDB both in laboratory environment with
the throughput of different sellers fluctuates substantially over time, simulated workloads and in production environment with real-
depending on multiple factors such as the availability of merchan- world workloads. The experimental results show that ESDB is
dise promotions, the readiness of stock preparation, etc. In practice, able to balance write workloads in different skewness scales and
the popularity of sellers can change significantly in a short time achieves high query throughput and latencies. Our results show
period, and the peaks of top sellers are difficult to forecast. that the deployment of ESDB succeeds reducing write delays
Before the adoption of ESDB, transaction logs of each seller are even at the spike of Single’s Day Global Shopping Festival.
uniquely assigned to a MySQL instance based on their seller IDs.
The remainder of this paper is organized as follows. In Section 2,
In practice, when a top seller’s promotion starts and brings in volu-
we introduce as background document-oriented databases, and the
minous transactions, the write throughput can easily overwhelm
general concept of load balancing. Section 3 presents the architec-
the instance’s write capability. On the other hand, most instances
tural overview of ESDB. We then introduce the design of dynamic
(for ordinary sellers) are deeply under-utilized. The consequence
second hashing in Section 4, and the optimizations in Section 5. Sec-
of such skewed workload is waste of resources on idle instances,
tion 6 presents the evaluation of ESDB. Finally, we discuss related
failed real-time queries on hotspot instances, and an overall perfor-
work in Section 7 and conclude in Section 8.
mance degradation. For example, the write delays (i.e., a metric that
evaluates how long ESDB takes to complete a write of a transaction 2 BACKGROUND AND MOTIVATION
log into the seller’s database) of large tenants could rise as high as
In this section, we introduce as background document-oriented
over 100 minutes in early years, in which case, the sellers would
databases, and a particular example, Elasticsearch, which is used
lose the capability of adjusting their sale strategy based on the mar-
as the basis of ESDB. We then discuss the general concept of load
ket’s response. A straightforward multi-tenant solution is to allow
balancing, as well as the motivation of dynamic second hashing.
multiple sellers to share a database instance, e.g., by routing sellers’
transaction logs to instances through consistent hashing [42, 49].
However, the distribution of transaction logs is still skewed due 2.1 Document-oriented Database
to the inherent imbalance of transactions performed by different Document-oriented database is a subclass of NoSQL database where
sellers. Therefore, we need a workload-adaptive load balancing the data is stored in form of documents. Unlike relational data-
mechanism, especially with the high fluctuation and unpredictabil- base, document-oriented database does not have predefined data
ity in the workloads. format thus supports flexible schema. Compared to other NoSQL

2287
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA

write workloads read workloads

2.2 Load Balancing
Write Coordinators Write Coordinators Write Coordinators
Typical load balancing methods include sharding and double hash-
ing. Sharding is a horizontal partitioning method which distributes
1 2 3 1 2 3 1 2 3
large tables across multiple machines by a shard key (e.g., time, re-
gion, seller ID). Through distributing storage as well as workloads,
1 2 3 1 2 3 1 2 3
sharding alleviates heavy burden on single machine. In addition,
some well-adopted databases (e.g., Spanner [28], MongoDB [40],
Query Clients Query Clients Query Clients
ONDB [54], Couchbase [39]) implement automatic resharding which
(a) (b) (c) automatically reshards and migrates data across machines in order
to balance workloads. HBase [18] uses an auto-augmentation re-
Figure 2: Workflows where the workloads originate from
shard method which splits one shard into two shards once the shard
the write coordinators (yellow boxes) to shards (blue boxes)
grows too large. Yak [44] uses a rule-based reshard method which
through double hashing and finally reach query clients
uses manually defined split rules for workload schedule and data
(green boxes). Black arrows represent workloads of small
migration based on metrics detected by a global monitor. Although
tenants. Thick red arrows represent workloads of large ten-
optimal sharding strategies are likely to guarantee load balancing,
ants; they are separated into thinner arrows as the workloads
high cost and risk caused by data migration cannot satisfy the
are routed to multiple shards by double hashing. Figure (a),
requirement of real-time processing and volatile workloads.
(b) and (c) depicts hashing, double hashing and dynamic sec-
Double hashing [35, 48] is a classic algorithm proposed to resolve
ondary hashing respectively.
hash collision in a hash table, in which a pair-wise independent
secondary hash function is introduced to produce an offset in case
databases, the advantages of document-oriented databases are han- the first hash function causes collision. In some applications [30,
dling complex queries and managing composite data structures, 43, 56], two independent hash functions are applied to one key.
such as nested documents. Generally, the documents are encoded in Particularly, the double hashing result of a key 𝑘 is 𝑝 = (ℎ 1 (𝑘) +
file formats, such as JSON, BSON, XML, YAML. These file formats 𝑖ℎ 2 (𝑘)) mod 𝑁 where 𝑖 increases as the sequence of collision grows
expose their internal structures which the database engine takes and 𝑁 is the size of hash table.
advantage of in order to optimize indices and query. When databases (e.g., Elasticsearch [17], HBase [18], Ocean-
Although the mainstream document-oriented databases (e.g., Base [34] , OpenTSDB [60]) take advantage of double hashing for
MongoDB [40], CouchDB [16] and Couchbase [39]) gain popularity more balanced routing, it commonly includes two independent
from the industry, their performance of full-text search is unstable hash functions separately applied on two attributes. For example,
and cannot meet our expectation of real-time query in a large scale. Elasticsearch implements a double hashing module which allows
Elasticsearch [17] and Solr [20] are search engines which are based users to choose two attributes 𝑘 1 and 𝑘 2 and a global static variable
on the search library Lucene [19]. As they provide operations on 𝑠. The corresponding routing destination is
documents (e.g., data is stored in JSON), they are also considered as 𝑝 = (ℎ 1 (𝑘 1 ) + ℎ 2 (𝑘 2 ) mod 𝑠) mod 𝑁 (1)
document-oriented databases. As of February 2022, Elasticsearch
where 𝑘 1 is the main attribute for load balancing (e.g., tenant ID), 𝑘 2
is arguably the most powerful and popular open-source full-text
is a key uniquely generated for each transaction (e.g., transaction
search tool [1]. In addition to the strong support for full-text search,
ID), 1 ≤ 𝑠 ≤ 𝑁 and 𝑠 ∈ Z.
Elasticsearch provides high horizontal scalability and availability
In Equation 1, the maximum offset 𝑠 plays an important role in
as it is a naturally distributed system which supports low-cost
the trade-off between load balancing and query efficiency. When
cluster extension, multiple replicas, and distributed queries. This
𝑠 = 1, double hashing degrades to hashing which fails in workload
feature is also essential for load balancing because the transaction
balancing but enables queries on single shard (Figure 2 (a)). On
logs and workloads of large tenants are supposed to be distributed
the contrary, when 𝑠 = 𝑁 , double hashing routes records to all
instead of uniquely allocated. Moreover, Elasticsearch implements
distributed shards regardless of whether they are hotspots or not.
double hashing [5] which is the basis of our load balancing method.
It enables complete balanced workloads of all tenants but requires
Therefore, we choose Elasticsearch as the backbone of ESDB.
query executions to search and aggregate query results from all
Unfortunately, we find Elasticsearch has two main drawbacks
distributed shards thus causes low query throughput (Figure 2
after deploying it as the sellers’ database: (1) High RT for multi-
(b)). Some operations, such as sort and top-𝑘, are much more time-
column queries. When a query involves with multiple columns
consuming once the data is stored in a distributed manner. In this
and complex conditions, RTs of Elasticsearch are usually higher
paper, we fulfill the goal of supporting both load balancing and
than the original MySQL cluster. (2) High cost of index computing.
high query throughput by replacing the static 𝑠 with a function
On the one hand, as every shard has a replica which is distributed
which dynamically changes with respect to the real-time workload
on a different machine, operations on Elasticsearch’s index file
of a tenant. We present the detailed design in Section 4.
(segment file) double the computation cost. On the other hand, in
order to improve query efficiency, we are supposed to build indices
3 SYSTEM OVERVIEW
for the sub-attributes of the "attributes" column of transaction logs.
It will cause unacceptable computation and storage considering ESDB is a distributed database system deployed on Alibaba Cloud.
the number of sub-attributes is around 1500. In Section 3 and 5, we It adopts a shared-nothing architecture where each database node
introduce our solutions to these two problems. (i.e., physical or virtual machines) stores the allocated shards and

2288
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA

Application Layer Query clients. ESDB inherits the RESTful API and the query lan-
Write Clients Query Clients Xdriver4ES guage ES-DSL from Elasticsearch. However, ES-DSL is less user-
friendly compared to SQL, and it does not support all expressions
Control Layer
and data types of SQL (e.g., type conversion expression date_format).
Master
Monitor Coordinator We therefore need a tool to rewrite SQL queries into ES-DSL queries.
Add/Drop Index Cluster
Manager Route Rule
Frequency- R/W Router In order to solve this problem, we develop a plugin Xdriver4ES as a
Generator
Based
Indexing Load bridge between SQL and ES-DSL.
Balancer Query Result Aggregator
Xdriver4ES adopts a smart translator which generates cost-effective
ES-DSL queries from SQL queries. Unlike SQL, ES-DSL encodes
Execution Layer
query ASTs directly which are then parsed to generate execu-
Worker tion plans. Therefore, instead of building ASTs from SQL queries,
Query
Translog (WAL) Real-time Synchronize Optimizer
Xdriver4ES adopts the following two optimization techniques: (1)
Query
In-memory Buﬀer Segment Files Physical Replication CNF/DNF conversion. Considering queries as boolean formulas,
Executor

Local Storage
Xdriver4ES converts them into CNF/DNFs in order to reduce the
depth of ASTs. (2) Predicate merge. Xdriver4ES merges predicates
Shard Shard Replica Replica
that involve the same column in order to reduce the width of
ASTs (e.g., merge tenant_id=1 OR tenant_id=2 to tenant_id IN
(1,2)). Xdriver4ES further utilizes a mapping module which con-
Figure 3: Architecture of ESDB. verts the query results into a format that a SQL engine understands.
For example, we implement in this module built-in functions of SQL,
processes workloads independently. In an ESDB cluster, shards such as data type conversion and IFNULL. In this way, Xdriver4ES
and replicas (each shard has one replica) are randomly allocated allows the execution of SQL queries on ESDB.
to different nodes. Nodes play different roles: each node works as
3.2 Control layer
both a coordinator (on control layer) and a worker (on execution
layer); each cluster further elects one node as the master node. The control layer consists of a master node, coordinator nodes and a
Figure 3 shows an architectural overview of ESDB. In addition monitor which collects metrics for workload balancing. In an ESDB
to the three layers of workload processing, this figure also includes cluster, master node works as the manager of the whole cluster.
ESDB’s key features that enable real-time workload balancing and It is responsible for cluster management tasks such as managing
read/write optimizations. The rest of this section, we introduce metadata, shard allocation and rebalance, and tracking and mon-
these components in more details. itoring nodes. The coordinator nodes are mainly responsible to
route read/write workloads to the corresponding worker nodes [8].
3.1 Application layer Specifically, during the query execution phrase, coordinators first
The Application layer consists of ESDB’s write clients and query collect row IDs of the selected rows from all involved shards, and
clients. We briefly introduce these two kinds of clients: then fetch the corresponding raw data (indicated by the retrieved
Write clients. Accompanied by ESDB’s load balancer, ESDB’s write row ID list). Therefore, coordinators contain a query result aggrega-
clients use the following techniques to accelerate writing and al- tor that is in charge of row ID collection and perform aggregation
leviate hotspots. (1) One-hop routing. The original write clients operations (e.g. global sort, sum, avg).
of Elasticsearch are transport clients which are unaware of work- Load balancer. ESDB’s load balancer is developed to generate
loads’ destinations [12]. These transport clients route workloads and commit routing rules which instruct the reader/write routers.
to coordinators in a round-robin fashion which leads to two-hop Since it is enabled by dynamic secondary hashing, the routing
routing (write client → coordinator → worker). In ESDB, we allow rules are also called secondary hashing rules. Once the monitor
the transport clients to be aware of the routing policies. In this way, detects hotspots, the coordinators will generate new routing rules
we achieve one-hop routing (write client → worker) and thus accel- that essentially split and extend the shards that are hosting the
erate writing. (2) Hotspot isolation. In write clients, all workloads current hotspots. Through exploiting new routing rules, ESDB
are temporally buffered in a queue before they are routed to their distributes workloads of large tenants to more shards thus alleviates
corresponding workers batch by batch. Once a worker is overloaded the over-burdened nodes. Once it receives new routing rules from
(probably caused by both skewed write workloads and slow read coordinators, the master node is then responsible to commit these
queries), the queue will be blocked and the write delay will rise. rules. In this way, the load balancer ensures the consistency of
In order to solve this problem, ESDB implements hotspot isolation workload routing in the whole cluster. We present more design
which isolates workloads of hotspots to another queue, such that details of the load balancer in Section 4.
they will not negatively affect other workloads. (3) Workload batch- Frequency-based indexing. In order to mitigate the cost of main-
ing. When a write client detects that a row (identified by its row ID) taining indices of the sub-attributes of the “attribute” column (see
will be frequently modified in a short period of time, it will batch- Section 2.1), we adopt frequency-based indexing (i.e., build indices
execute the workloads by aggregating together these modifications only for the most frequently queried sub-attributes). This idea is
and only materializing the eventual state of this row. By adopting derived from the observation that sub-attributes’ read/write fre-
the workload batching, the write clients avoid the repeated writes quencies are skewed. For example, some generic sub-attributes,
to the same row and thus improve write throughput. such as “activity” that indicates what e-commerce activity the seller

2289
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA

is participating in, are frequently used and likely to be queried. In Hashing 0

63 1
practice, frequency-based indexing provides significant improve- Double hashing

ment in query latency with the cost of a slightly increased storage Dynamic secondary 56
hashing hash of *!
overhead. We demonstrate the effectiveness of frequent-based in-
dexing in Section 6.3.3.
! "! = 16 Level-1
3.3 Execution layer 48 hash of *! 16

The execution layer consists of worker nodes which maintain shards Level-2 Level-2
hash of hash of *"
and replicas in their local SSD, and execute read/write workloads. *"
Level-1
The worker nodes have a shared-nothing architecture where each ! "! = 8 hash of *! 24
41
worker maintains its own storage, independent to other workers. - = 16
31
Write execution and data replication. When executing write
workloads, ESDB inherents the feature of near real-time search [7]
Figure 4: Comparison of workload routing across 64 shards
from Elasticsearch. Raw data and indices are temporally written
between hashing, double hashing and dynamic secondary
into an in-memory buffer before they are periodically refreshed
hashing. Colored nodes represent the shards to which work-
to segment files and become available to search. In order to ad-
loads of attribute 𝑘 1 are routed.
dress persistence and durability, ESDB adopts Translog [11] which
pertains to the disk. Every write workload will be added to the where 1 ≤ 𝐿(𝑘 1 ) ≤ 𝑁 , 𝐿(𝑘 1 ) ∈ Z and 𝐿(𝑘 1 ) depends on the real-
Translog once it is successfully submitted. In this way, the data that time workloads of tenant 𝑘 1 .
has not been flushed [3] to the disk can be safely recovered from Figure 4 depicts three different routing policies: hashing, double
Translogs, if ESDB encounters process crash or hardware failures. hashing and dynamic secondary hashing. Apparently, hashing can
Moreover, segment merge [9] is another important mechanism that only route a workload based on the 1-level hash of the partition
merges smaller segments to a large segment. It costs computation key (e.g., tenant ID) thus falls short when it needs to balance multi-
resources but effectively improves query efficiency. tenant workloads. The ordinary double hashing is able to route
Elasticsearch adopts a logical replication scheme [4]: a primary workloads of a tenant to a fixed set of consecutive shards based on
shard will forward a write workload to all its replicas once this 2-level hashing of two keys (i.e., tenant ID and record ID) but fails
workload has been locally executed. In other word, same write to manage dynamic workloads. The dynamic secondary hashing is
workload is executed respectively by the replicas, which causes inspired by double hashing, which also route workloads to consec-
𝑛-fold computation overhead to the cluster (𝑛 is the number of utive shards. However, it is capable of extending to the successive
replicas). In ESDB, write workloads are still forwarded and added shards when the workloads increase (e.g., 𝐿(𝑘 1 ) increases from 8
to Translogs on replicas in real-time, but are never executed by to 16 in Figure 4).
replicas. Instead, it implements physical replication of segment files.
We describe the details of physical replication in Section 5.2. Algorithm 1 ESDB load balancer
Query optimizer. For query workloads, ESDB builds optimized 1: 𝐾 ← collect the set of tenants
query plans using a rule-based optimizer. Instead of using index 2: 𝑃 ← collect the set of shards
scan for all columns, ESDB provides more operations such as com- 3: 𝑅←∅ ⊲ initialize secondary hashing rule list
posite index scan and sequential scan. Through exploiting combi- 4: 𝑡𝑖𝑛𝑖𝑡 ← select effective time
nations of different operations, ESDB significantly reduces query 5: 𝑆 (𝐾) ← collect current storage of each 𝑘
latencies. Details of optimizations are presented in Section 5. 6: for each 𝑘 in 𝐾 do
7: 𝑟 ← Í 𝑆 (𝑘𝑆) (𝑘 )
𝑘 ∈𝐾
𝑠 ← 𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑂 𝑓 𝑓 𝑠𝑒𝑡𝑆𝑖𝑧𝑒 (𝑟 )
4 LOAD BALANCING 8:
9: 𝑅 ← 𝑈 𝑝𝑑𝑎𝑡𝑒𝑅𝑢𝑙𝑒𝐿𝑖𝑠𝑡 (𝑅, 𝑡𝑖𝑛𝑖𝑡 , 𝑠, 𝑘)
The load balancer of ESDB is designed to meet the following two 10: end for
requirements: (1) Query efficiency. Data of multiple tenants should 11: while Service on do
be placed on as few shards as possible in order to avoid query 12: 𝑡𝑢𝑝𝑑𝑎𝑡𝑒 ← select effective time
executions across too many shards. (2) Load balancing. Distribution 13: 𝑇 (𝐾) ← collect periodic write throughput of each 𝑘
of workloads across multiple shards should be as uniform as possible 14: for each 𝑘 in 𝐾 do
in order to avoid overload on a single shard. In practice, we make a 15: 𝑟 ← Í 𝑇 (𝑘𝑇) (𝑘 )
𝑘 ∈𝐾
trade-off between these two contradictory requirements by limiting 16: if 𝐶ℎ𝑒𝑐𝑘𝐻𝑜𝑡𝑆𝑝𝑜𝑡 (𝑟 ) then
data of small tenants on single shard and distributing data of large 17: 𝑠 ← 𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑂 𝑓 𝑓 𝑠𝑒𝑡𝑆𝑖𝑧𝑒 (𝑟 )
18: 𝑅 ← 𝑈 𝑝𝑑𝑎𝑡𝑒𝑅𝑢𝑙𝑒𝐿𝑖𝑠𝑡 (𝑅, 𝑡𝑢𝑝𝑑𝑎𝑡𝑒 , 𝑠, 𝑘)
tenants across multiple shards (see Figure 2 (c)).
19: end if
4.1 Dynamic Secondary Hashing 20: end for
21: end while
The key intuition of dynamic secondary hashing is to adopt a
workload-adaptive offset function 𝐿(𝑘 1 ) in the secondary hash-
In practice, the maximum offset in the secondary hashing 𝑠 =
ing period of double hashing. Compared to Equation 1, the fixed
𝐿(𝑘 1 ) relies on two metrics: (1) Current storage. We assume that
maximum offset 𝑠 is replaced with 𝐿(𝑘 1 ):
tenants with larger storage proportion are more likely to have large
𝑝 = (ℎ 1 (𝑘 1 ) + ℎ 2 (𝑘 2 ) mod 𝐿(𝑘 1 )) mod 𝑁 (2) forthcoming workloads. Therefore, we select larger 𝑠 for tenants

2290
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA

Algorithm 2 Secondary Hashing Rule List Update Coordinator Master Participant1

new rule
1: function UpdateRuleList(𝑅, 𝑡, 𝑠, 𝑘)
2: if (𝑡, 𝑠) in 𝑅 then -"#$%&' = -./01. 345()
Prepare Prepare
3: 𝑘_𝑙𝑖𝑠𝑡 ← 𝑅.𝐺𝑒𝑡𝐾𝐿𝑖𝑠𝑡 (𝑡, 𝑠) block block
workload workload
4: 𝑘_𝑙𝑖𝑠𝑡 .𝑎𝑝𝑝𝑒𝑛𝑑 (𝑘) after 0
Prepare
after 0
Accept Accept Phase
5: 𝑅.𝑈 𝑝𝑑𝑎𝑡𝑒𝐾𝐿𝑖𝑠𝑡 (𝑡, 𝑠, 𝑘_𝑙𝑖𝑠𝑡 )
6: else
7: 𝑅.𝑖𝑛𝑠𝑒𝑟𝑡 ( (𝑡, 𝑠, [𝑘 ])) Commit Commit
cancel Commit Phase
8: end if blocking cancel
blocking
9: return R Ack
Ack
10: end function - = -"#$%&' + 7

Figure 5: ESDB’s secondary hashing rule consensus protocol

with larger storage proportion during the initialization phase. No-
tably, in order to satisfy the requirement of query efficiency, we set be earlier than the creation time of record 𝑡𝑐 . (2) 𝑘 1 is in 𝑘_𝑙𝑖𝑠𝑡. (3) 𝑠
𝑠 = 1 for most of the tenants who have a small storage proportion. is the largest among all the rules that satisfy the first two conditions.
(2) Real-time workload. Based on the real-time write throughput In this way, both the workloads to create new records (i.e., INSERT)
proportion, which is periodically reported from the workload mon- and the workloads to modify existing records (i.e., UPDATE, DELETE)
itor, the load balancer will enlarge the maximum offset 𝑠 for the are routed to the correct shards.
tenants who are considered as hotspots. This adjustment of 𝑠 hap- Once a read operation (i.e., SELECT), which is associated with
pens during the runtime of ESDB. tenant ID 𝑘 1 , arrives, the ESDB query client will use 𝑘 1 ’s matching
secondary hashing rule (𝑡, 𝑠, 𝑘_𝑙𝑖𝑠𝑡) to decide the correct consecu-
4.2 Read-your-writes Consistency tive shards where the query should be executed on, that is, from
Although the dynamic secondary hashing enables a flexible rout- shard ℎ 1 (𝑘 1 ) mod 𝑁 to shard (ℎ 1 (𝑘 1 ) + 𝑠 − 1) mod 𝑁 .
ing policy, the change of secondary hashing offset brings risks to
read-your-writes consistency, because it breaks the static mapping 4.3 Consensus
between records and shards. For example, for a tenant 𝑘 1 , if its maxi- In ESDB, each node works as a coordinator which is responsible
mum offset 𝑠 changes to 𝑠 ′ during runtime, it becomes very difficult to route workloads to matching shards. Therefore, it is essential
to find all the shards that have hosted 𝑘 1 ’s historical records: the sec- for the whole system to reach consensus on the secondary hashing
ondary hashing result changes from ℎ 2 (𝑘 2 ) mod 𝑠 to ℎ 2 (𝑘 2 ) mod 𝑠 ′ . rule list 𝑅 once a new rule is generated on any node, in order to
This inconsistency can cause severe problems such as duplicate ensure strict consistency [37]. That is, all coordinators always use
indices across different shards, deletion failure, and incorrect query the same secondary hashing rule for the same write workload.
results. In order to solve this problem, we maintain a secondary Over the past decades, classic consensus protocols were pro-
hashing rule list 𝑅 during the runtime of ESDB load balancer. Each posed for distributed transactions. 2PC and 3PC [22, 23, 59] are
second hashing rule is maintained as a tuple (𝑡, 𝑠, 𝑘_𝑙𝑖𝑠𝑡), where 𝑡 commitment protocols which aim for atomicity of transactions
represents the time when the rule takes effect, 𝑠 is the maximum (binary consensus on commit or abort). They cannot ensure strict
offset of secondary hashing and 𝑘_𝑙𝑖𝑠𝑡 records the tenant IDs that consistency of 𝑅 in the occurrence of network partition or node
adopt 𝑠 in its secondary hashing. The secondary hashing rule list failure. Paxos [45, 46], Raft [51] and ZAB [41] solve problems, such
consists of secondary hashing rules built during the initialization as fault tolerance, leader election, crash recovery [26, 32, 36], to
and runtime phases of ESDB’s load balancing process. reach consensus across replicas. However, they are not necessarily
Load balancing with hashing rules. Algorithm 1 demonstrates needed in the scenario of deciding the secondary hashing rules.
how ESDB performs load balancing through updates to the hashing This is because 𝑅 is an append-only list where each rule is associ-
rule lists. At the initialization phase (Line 5-10), the load balancer ated with an effective time. Therefore, we do not need to decide
builds secondary hashing rules from the current storage for each the ordering of rules (as they are sorted by their effective time), in-
tenant. During the runtime phase (Line 11-21), the load balancer stead, it reduces to deciding, for each rule, whether it is committed
updates secondary hashing rule list 𝑅 according to the fluctuation of or aborted. We can therefore use the more efficient commitment
the real-time workloads. Specifically, we manually design new sec- protocols (e.g., 2PC) to reach consensus on the secondary hashing
ondary hashing rules for "hot" tenants according to the storage pro- rules. In ESDB, we propose a 2PC variant protocol which is inspired
portion or real-time workload proportion (𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑂 𝑓 𝑓 𝑠𝑒𝑡𝑆𝑖𝑧𝑒). from Spanner’s commit wait mechanism [28] to solve the problem
In practice, we choose 𝑠 among exponents of 2 (e.g., 1, 2, 4) in or- of workload blocking. Figure 5 shows an overview of this protocol.
der to limit the number of secondary hashing rules and accelerate Prepare Phase. Whenever a coordinator node builds a secondary
the search in the rule list. The load balancer uses 𝑈 𝑝𝑑𝑎𝑡𝑒𝑅𝑢𝑙𝑒𝐿𝑖𝑠𝑡, hashing rule, it sends the new rule to the master node of ESDB
which is presented in Algorithm 2, to update 𝑅 (Line 9 and 18). cluster. The master node decides the effective time 𝑡 that the rule
Once a write operation (e.g., INSERT, UPDATE, DELETE), which takes effect using its local timer with a manually designed time
can be identified by tenant ID 𝑘 1 , record ID 𝑘 2 and record created interval 𝑇 , 𝑡 = 𝑡𝑖𝑚𝑒𝑟 .𝑛𝑜𝑤 () + 𝑇 . Next, the master node broadcasts
time 𝑡𝑐 , arrives at a coordinator node, the coordinator is responsi- the effective time as a proposal to all participant nodes. When a
ble to select a matching secondary hashing rule from 𝑅. The rule participant receives the proposal, it checks whether all its records
(𝑡, 𝑠, 𝑘_𝑙𝑖𝑠𝑡) must satisfy the following three conditions: (1) 𝑡 must has been created earlier than the effective time (i.e., ensure 𝑡𝑐 < 𝑡 for

2291
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA

SELECT logs FROM transaction_logs F: {3, 4, 6}

WHERE tenant_id = 10086 Union

AND created_time >= ‘2021-09-16 00:00:00’
D: {3, 4} E: {6}
AND created_time <= ‘2021-09-17 00:00:00’
AND status = 1 OR group = 666 Index Search group
Intersect
predicate
group= 666
Figure 6: Query example A: {1, 2, 3, 4}
C: {3, 4, 5}
B: {2, 3, 4, 5}

all executed records). If this condition is satisfied, participant blocks Index Search tenant_id Index Search created_time Index Search status
predicate predicate created_time predicate
all workloads whose creation times are later than the effective time tenant_id = 10086 between ‘2021-09-16 00:00:00’ status= 1
and replies the master node an acceptance message. Otherwise, and ‘2021-09-17 00:00:00’

participant reports error to the master node. If the master node Figure 7: Example query plan of Elasticsearch. A, B, C, D,
receives any error message or detects any timeout (a participant E and F represent posting lists generated by corresponding
does not respond within 𝑇2 ), this secondary hashing rule is aborted. operations.
Otherwise, the commit phase begins.
corresponding indices. Then it aggregates the posting lists through
Commit Phase. The master node broadcasts commit message as intersections and unions. This query plan introduces large overhead
well as the secondary hashing rule to all participating coordinators. since the posting lists are generated sequentially. It will become
Since all nodes have reached consensus in the last phase, they will more time-consuming when the selectivity of a column is high and
accept the commit message and add the rule to their local secondary the posting list grows prohibitively large. In order to solve this prob-
hashing rule lists. Once the commit phase is complete, the nodes will lem, we let ESDB incorporate features from relational databases:
remove the block of workload execution (i.e., continue to process composite indices, sequential scan and a rule-based optimizer (RBO)
workloads with creation time greater than the effective time). which produces cost-effective query plans.
Choose of time interval. The time interval 𝑇 provides a buffering Composite index. In relational databases, composite index is built
period for the system to reach consensus on the secondary hashing on multiple columns which are concatenated in a specific order. Tak-
rules. 𝑇 should be much larger than the roundtrip latency of broad- ing advantage of composite indices usually avoids time-consuming
casting (e.g., 100ms) and the maximum of local clock deviations operation, such as table scan, thus accelerates queries that involve
across the cluster (no more than 1s in ESDB) to ensure strict consis- multiple columns [24]. Meanwhile, composite indices have limited
tency. At the same time, 𝑇 should be shorter than our expected time applicability, as the columns must comply with the leftmost se-
of load balancing (e.g., 60s) for effectiveness. As long as 𝑇 is larger quence (i.e. the leftmost principle). For example, if the composite
than the time for the cluster to reach consensus, ESDB’s consensus index is built on two columns column1_column2, we can either
protocol achieves non-blocking of workload processing. query about (column1) or (column1, column2); on the other hand,
Fault tolerance. Although ESDB’s consensus protocol ensures queries about (column2) cannot leverage this composite index. In or-
strict consistency of 𝑅, it still suffers from network partition and der to increase availability of composite indices, DBAs are expected
node failures during the commit phase. ESDB adopts an automatic to manually build composite indices among a massive amount of
solution of fault tolerance. However, it relies on the detection of column combinations [27].
network partition and node failure, that is, it needs to differenti- Elasticsearch uses Bkd-tree [55] to index numeric data and multi-
ate temporarily unresponsive nodes from failed nodes. A typical dimensional (e.g., geo-information) data. Bkd-tree is an index struc-
solution uses a pre-defined timeout; in ESDB, we manually verify a ture which combines kd-tree [57] and B+ tree. Unlike B+ Tree,
raised alarm (a node becoming unresponsive), to definitively decide Bkd-tree enables division of search space along different dimen-
whether a node failure or a network partition has occurred. sions. Therefore, it is not necessary to follow the leftmost principle
and this makes the composite index more flexible. Furthermore, it
5 OPTIMIZATION optimizes disk I/O and significantly reduces overheads of insertion
ESDB focuses mainly on multi-column SFW (SELECT-FROM-WHERE) and deletion by dynamically maintaining a forest of kd-trees [55].
queries on a single table, where multiple predicates are connected However, Bkd-tree suffers from the curse of dimensionality where
by AND and OR operators. Before the deployment of optimizations, the search performance degrades as the number of dimensions
ESDB’s query performance is decent when queries only involve grows high [15]. In ESDB, we build concatenated columns and one-
few columns. However, we observed more than 10x overhead from dimension Bkd-trees on these columns as the composite indices.
ESDB’s multi-column queries compared to that of a MySQL cluster. Although such design has less flexibility, ESDB’s composite indices
(Queries from sellers usually involves with more than 10 columns). search performs fast and is able to cover most query workloads in
After a performance analysis, we identify that the suboptimal per- practical application scenarios.
formance mainly results from Lucene’s rigid query plans. To address Another challenge of composite index is the growing key size
this problem, we introduce a query optimizer (see Section 5.1). when we concatenate columns, which makes operations like key
comparisons expensive. In order to solve this problem, we take
5.1 Query Optimizer the advantage of common prefixes: since the concatenated keys
As an example, consider a query involving four columns (shown are sorted in the composite index, the leaf nodes in the Bkd-tree
in Figure 6). The execution plan generated by Lucene is shown in usually contain keys that share a common prefix. By leveraging the
Figure 7. First, Lucene generates posting lists, which record the common prefixes, we manage to increase the storage efficiency and
row IDs of the selected rows, for each column by searching the reduce the cost of key comparisons.

2292
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA

D: {3, 4, 6} Write Flows Replication Flows

Union Write to Primary Shard Replica Shard

Translog Real-time Synchronization
B: {3, 4} C: {6} Translog Translog

Scan Doc value of status In-memory Buffer In-memory Buffer

to filter posting list A Index Search group Build
3.Lock Segment A and B
predicate status = 1 predicate Index Snapshot
Send Primary State
Refresh 1
group= 666 Snapshot
Segment A Segment B Segment A Segment B
A: {2, 3, 4} 2
4.Request Segment B
Snapshot Primary State: Replica State:
3 Snapshot 3 5.Replicate Segment B Snapshot 1
Composite Index Search tenant_id_created_time Merge 1.Build New 2.Select the
predicate tenant_id = 10086 and Snapshot 3 Latest Segment C
Segment C
created_time between ‘2021-09-16 00:00:00’ Snapshot 6.Replication Finish
and ‘2021-09-17 00:00:00’ Unlock Segment A and B

Pre-replication of Merged Segments

Figure 8: Example query plan of ESDB. A, B, C and D repre-
sent posting lists generated by corresponding operations. Figure 9: Framework of ESDB’s physical replication.
Sequential scan. Another important operation supported in tradi- ESDB adopts physical replication with the goal of reducing the
tional databases is sequential scan. Although it is considered less CPU overhead incurred in the replication process. Specifically, we
efficient, this operation requires less I/Os and performs better when design it to overcome two drawbacks seen in previous works: 1) a
the selectivity or cardinality of a column is low (e.g., gender col- long monolithic replication process that can be easily interrupted
umn). However, sequential scan is not implemented in Elasticsearch by a new round of replication process; 2) a long visibility delay
because it is against core idea of search engines, which are typically (i.e., the interval between the timestamps when a segment becomes
designed for full-text search and ranking without fetching the raw visible on the primary shard and on a replica) caused by replicating
documents. In ESDB, we implement sequential scan as an auxil- a large merged segment. ESDB’s physical replication framework
iary of composite index. Based on the posting list generated from (shown in Figure 9) consists of three replication mechanisms:
a composite index search, ESDB sequentially scans through the Real-time synchronization of Translog. Since Translog is the
corresponding Doc values [2] to generate a filtered posting list. In durability guarantee of ESDB, we require that it should be synchro-
this way, sequential scan becomes an effective search operation and nized in real-time between the primary shard and the replicas. The
is capable of accelerating query in certain cases (e.g., for queries primary shard forwards a write workload to the replicas once it
that include columns without indices). In practice, we maintain a is executed successfully. The replica then adds the received write
scan list which includes the names of columns that can benefit from workload to its local Translog. In this way, ESDB ensures that all
sequential scans. replicas are able to recover the data locally in case of replica failures
Rule-based optimizer. Finally, we introduce ESDB’s rule-based or primary/replica switch.
optimizer (RBO) for multi-column SFW queries. RBO chooses exe- Quick incremental replication of refreshed segments. ESDB
cution plans for different columns based on the availabilities and further adopts a quick incremental replication mechanism to avoid
rankings of the following access paths: long replication processes which can be easily interrupted. Figure 9
• Composite index is available when the predicates connected depicts the six steps of the quick incremental replication:
by AND use columns that are in some composite indices. In this 1. The primary shard builds a snapshot of the current local segments
case, we use longest-match to select the composite index which and adds it to a snapshot list (shown as yellow boxes) every time
includes as many columns as possible. a refresh operation finishes.
• Sequential scan is available when the predicates connected by 2. The latest snapshot (i.e., Snapshot 3 in this example) is selected
AND use columns that are not included in any composite index, as the current primary state.
but are included in the scan list. 3. The primary shard locks the segments in the current snapshot
• Single column index is available when the predicates connected (i.e., Segment A and B) and sends them to the replica.
by AND use columns that are not included in any composite index 4. The replica computes the segment diff according to its local
nor in the scan list, or the predicates are connected by OR. snapshot and the snapshot received from the primary shard.
Figure 8 depicts the optimized query plan of the example query Based on the segment diff, the replica either requests segments
(shown in Figure 6). Posting list A is generated from composite index (i.e., Segment B) or deletes the segments that are already deleted
tenant_id_created_time. After scanning the Doc value of status, the by the primary shard.
final result is the union of filtered posting list B and posting list C 5. The primary shard sends the segments requested by the replica.
which is generated from a single index search. 6. After the replication finishes, the replica informs the primary
shard to unlock the segments in current snapshot.
5.2 Physical Replication Above mechanism stabilizes the physical replication process when
Instead of asking replicas to execute same write workloads (i.e. log- the refresh interval is short and guarantees the replication of the
ical replication), a physical replication framework, such as Lucene latest segments.
Replicator [6] and Solr 7’s TLOG [10], replicates segment files di- Pre-replication of merged segments. Since the merged seg-
rectly. When the primary shard refreshes a new segment file, it ments are usually large, replicating merged segments with quick
initializes a replication process, during which the replicas compute incremental replication can cause delays in replicating refreshed
segment diff (i.e., the difference between segment files located on segments. For example, if snapshot 3 in Figure 9 contains Segment
two shards) and request the missing segments from remote shards. C, the replicas will request Segment B and C together. This causes

2293
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA

(a)
(a)

(b)
(b)
Figure 10: Comparisons of three routing policies when 𝜃 = 1.
Figure (a) and (b) respectively present write throughput and Figure 11: Write throughput and average delay of three rout-
average delay with different generating rate. ing policies with different skewness factor 𝜃 s.

larger visibility delay of Segment B. In order to solve this problem, 6.2 Balanced Write
we further introduce pre-replication of merged segments. When 6.2.1 Write Throughput and Delay. In the first set of our experi-
a merged segment is generated, the primary shard immediately ments, we measure the cluster throughput and the write delay to
starts to replicate it to replicas. This pre-replication is indepen- evaluate the performance of three different routing policies:
dently running along with the quick incremental replications. In • Hashing, the baseline policy without any workload balancing;
this way, merged segments never appear in segment diff and thus • Double hashing, another baseline policy that distributes data of
have limited influence on the replication of refreshed segments. each tenant to 8 shards;
• Dynamic secondary hashing, the routing policy used by ESDB’s
6 EVALUATION
In this section, we present the evaluation results of ESDB to demon- load balancer.
strate its capability of processing skewed write workloads while Figure 10 presents the cluster throughput and write delays when
retaining high throughput and low latency of distributed queries. 𝜃 = 1 with different data generating rate. In Figure 10 (a), we ob-
serve that the throughput of hashing reaches its limit at around
6.1 Experimental Setup 90K TPS while the other two does not stop until they reach 140K.
All experiments are performed on a cluster consisting of 11 ECS This is mainly because hashing fails to balance the skewed work-
virtual machines (ecs.c7.2xlarge) on Alibaba Cloud. Each virtual loads, and thus waste the resource that could have been used to
machine contains 8 vCPUs, 16GB memory and 1TB SSD disk. We handle workflows targeted at hotspots. On the contrary, dynamic
use three machines to simulate ESDB’s write and query clients and secondary hashing manages to balance the skewed workloads, and
the rest eight machines as the worker nodes of a ESDB cluster. therefore has close performance to double hashing, which is the
In order to simulate real-time processing of Alibaba’s e-commerce optimal option since the data is uniformly distributed on the nodes.
transaction logs, we build a benchmark which generates random Figure 10 (b) shows how the average delay changes as the data
workloads based on the template of our transaction logs and col- generating rate grows. Delays of three routing policies all rise
lects metrics of ESDB cluster in real-time. During the evaluation, when the generating rate surpasses their throughput upper bounds.
the simulated workloads are routed to 512 shards located on the However, we observe that the delay of hashing increases rapidly
eight worker nodes. The simulated workloads contain columns after it reaches its throughput upper bound while the other two have
of transaction ID (an auto-increment unique key), tenant ID and smoother trends. This figure further demonstrates that dynamic
creation time which are essential for ESDB’s workload balancer. second hashing significantly outperforms hashing and has close-
In order to simulate different level of skewness situations, we let to-optimal write performance.
the workload generators sample tenant IDs from Zipf distribution We use Figure 11 to show that dynamic secondary hashing is
tunable by a skewness factor 𝜃 . The sampling size of tenant 𝑘 is capable of balancing workloads with different skewness factors. In
set to be proportional to ( 𝑘1 )𝜃 . We select 5 different 𝜃 s: 0, 0.5, 1, this set of experiments, we collect the average write throughput
1.5 and 2. When 𝜃 = 0, Zipf distribution is effectively reduced to during a period of more than 15 minutes for more stable results.
a uniform distribution. When 𝜃 = 1, simulated workloads are the Figure 11 (a) shows the write throughput with the three routing
closest to real workloads. Simulations with 𝜃 = 1.5 and 𝜃 = 2 rarely policies when the data generating rate is 160K TPS. When the skew-
happen in our production environment, but serve to evaluate the ness factor 𝜃 = 0, the workload is naturally balanced and all three
performance of ESDB in the case of extreme skewness. policies exhibits similar write throughput and practically reach the

2294
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA

largest hotspot resides in node 1, and it has a replica that resides in

on a different node (node 2 in the figure). With hashing, node 1 and
node 2 are the only two nodes that work at full capacity and the
rest worker nodes’ resources are wasted. With dynamic secondary
hashing, the throughput and CPU usage of the rest nodes are sig-
nificantly enhanced because they now participate in processing
the excessive workloads, which were previously allocated only to
node 1 and 2. We observe in Figure 13 (c) that, with dynamic sec-
(a) (b) ond hashing, the throughput of individual nodes is close to evenly
Figure 12: Standard deviation of write throughput of 8 nodes distributed, and the average CPU usage is around 85%.
(a) and 512 shards (b) with different skewness factor 𝜃 s. Figure 13 (d) shows normalized shard sizes when 𝜃 = 1. With
hashing, shard size approximately has a Zipf distribution. The
performance upper bound of this cluster. When 𝜃 increases, the largest shard is more than 100 times larger than the smallest shard.
throughput of hashing drops while the other two remain stable. On the contrary, with dynamic secondary hashing, the shard sizes
Compared to hashing, dynamic secondary hashing significantly are more balanced and the largest shard is only 16 times larger
enhances cluster throughput regardless of the level of skewness. than the smallest shard. Double hashing has the most uniform
Figure 11 (b) reports the average write delay as we use different distribution; the largest shard is 13 times of the smallest.
𝜃 s. In this figure, we observe that the average delay of hashing 6.2.3 Write adaptivity. Figure 14 shows how dynamic secondary
grows rapidly as 𝜃 increases. In the worst case, the average delay hashing adaptively alleviates hotspots in real-time. In this experi-
is more than 100 times higher than the delays without skewness. ment, we introduced two groups of hotspots by changing the map-
On the contrary, the average write delay for double hashing and ping between the tenant IDs and Zipf sampling results during a
dynamic secondary hashing remain stable and satisfactory (around period of six minutes. We observe that when the first group of
0.2 seconds) even when the skewness factor is extremely high. hotspots comes, the write throughputs of hashing and dynamic
Notably, the average delays of double hashing are always lower secondary hashing drop sharply. However, after the commitment
than dynamic secondary hashing even when 𝜃 = 0. It is the case of new secondary hashing rules, the write throughput of dynamic
because it is nearly impossible for dynamic secondary hashing to secondary hashing increases back to 120K while the write through-
reach complete uniform distribution. Nevertheless, the write delays put of hashing never rises again. Similarly, write throughput of
of dynamic secondary hashing retains low and closely track those dynamic secondary hashing drops and recovers on the arrival of
of double hashing, which is valuable in production environments. the second group of hotspots. Write throughput of double hash-
6.2.2 Distribution of Throughput on Nodes and Shards. In addition ing is not affected by random hotspots because double hashing
to the cluster throughput and delays, we also collect throughput of distributes workloads to all 8 worker nodes.
individual worker nodes and shards to show ESDB’s load balancing
6.2.4 Physical replication. Figure 15 compares write throughput
capability more directly. Figure 12 (a) shows the standard deviations
and average CPU usage with logical replication and with physical
of throughput on nodes with different 𝜃 s. When 𝜃 = 0 and 𝜃 = 0.5,
replication. In Figure 15 (a), write throughput with logical replica-
standard deviations of these three routing policies only have slight
tion stops rising when the generating rate surpasses 140K while
differences. When 𝜃 grows larger, the imbalance between different
write throughput with physical replication rises to more than 180K.
nodes become more obvious as the standard deviation of hashing
In Figure 15 (b), the average CPU usages with physical replication
becomes larger. On the contrary, dynamic secondary hashing dra-
are always lower than logical replication. The experimental results
matically reduces standard deviations of the throughput on nodes.
prove that physical replication is able to reduce CPU consumption
Although they are still higher than those of double hashing, the
and increase write throughput of the cluster.
throughput reduction (shown in Figure 11 (a)) caused by higher
standard deviation is acceptable. 6.3 Query Performance
Figure 12 (b) shows the standard deviations of the throughput In this section, we evaluate the query throughput of skewed multi-
on shards with different 𝜃 s. We observe that dynamic secondary tenant data when using different routing policies, as well as the
hashing is still able to reduce skewness across shards. Although the effectiveness of the query optimizer and the frequency-based in-
skewness across shards only has indirect impact on write through- dices. We still use the simulated workload in Section 6.2. When
put, it is an important factor for ESDB’s query performance. Queries building the top-k query, we select the most commonly used query
running on large shards incurs higher overhead compared to queries template in practice: retrieving transaction logs of a tenant in a
running on small shards. Therefore, we want the distribution among time period. For example, consider the following query template.
shards to be as uniform as possible in order to reduce query latency SELECT * FROM transaction_logs
variance of multiple tenants. WHERE tenant_id=1 AND created_time BETWEEN
Next, we study the distribution of individuals worker nodes and ‘2021-09-16 00:00:00’ AND ‘2021-09-17 00:00:00’
shards when 𝜃 = 1. Figure 13 shows the throughput and CPU usage Based on this template, we build a query benchmark that generates
of the eight worker nodes with hashing (a), double hashing (b) random queries with multiple filters appended after the predicates
and dynamic secondary hashing (c). In the figure, we observe that of tenant ID and time range. (The number of involved columns is
neighboring nodes have similar throughput and CPU usage; this randomly chosen from 3 to 10.) Details of experiment setup are
is because each shard has a replica. For example, the shard of the explained in the following sections.

2295
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA

(a) (b) (c) (d)

Figure 13: Write throughput and CPU usage with hashing (a), double hashing (b) and dynamic secondary hashing (c). Bars
represent throughput, lines represent CPU usage. Figure (d) shows normalized shard sizes with three routing policies.
20k
120k

Throughput (QPS)
Throughput (TPS)

110k 16k
Hashing
100k
Double hashing
90k Dynamic secondary hashing 12k
80k
70k 8k

60k
Hashing
4k
50k Double hashing
40k Dynamic secondary hashing
0 50 100 150 200 250 300 0
1 200 400 600 800 1000 1200 1400 1600 1800 2000
Time (s) Ranked tenant ID
Figure 14: Real-time write throughput with three routing Figure 16: Query throughput of the top 2000 tenants with
policies in 6 minutes. three routing policies.

(a) (b) (a) (b)

Figure 15: Write throughput (a) and average CPU usage of the Figure 17: Average (a) and quantiles (b) query latencies of the
cluster (b) with logical replication and physical replications. top 100 tenants with and without ESDB’s query optimizer.
Compared to hashing, dynamic secondary hashing also has its
6.3.1 Query Throughput. In this experiment, we evaluate the query own advantage for queries issued to large tenants: since the shard
throughput when we issue queries on an ESDB cluster consisting of sizes for large tenants are much smaller compared to those of hash-
eight worker nodes, 512 shards and 40M simulated transaction logs. ing (Figure 13 (d)), subqueries can be executed in parallel and there-
These transaction logs belong to 100K tenants with a skewness fore complete faster. For this reason, we do not observe signifi-
factor 𝜃 = 1. For each of the top 2000 tenants, we let three machines cant drop of query throughput for large tenants. When processing
concurrently generate SQL queries and send requests to the ESDB queries issued to small tenants, both dynamic secondary hashing
cluster to evaluate the upper bound of query throughput. In order and hashing only execute one subquery on the target shard and
to collect more stable results, we add LIMIT 100 statement after have similar performance.
every SQL query statement, which avoids fetching too many rows. 6.3.2 ESDB’s Query Optimizer. In order to prove the effectiveness
Figure 16 shows the query throughput for the top 2000 tenants of ESDB’s query optimizer, we build a query sets of 1000 queries for
with the three routing policies. When using double hashing, each each of the top 100 tenants. We then collect the total time consumed
tenant’s data is distributed to 8 shards, which means a query has to finish the execution of the query set with a single-threaded
to be expanded to 8 subqueries, one for each shard. Therefore, the query client. The target database is the same to the one used in
query throughput for double hashing is much lower than the other the previous experiments for evaluating the query throughput. As
two routing policies. On the contrary, dynamic secondary hashing shown in Figure 17 (a), query latencies decrease after enabling query
achieves query throughput as high as hashing for both large tenants optimizer for all the top 100 tenants. Figure 17 (b) further confirms
and small tenants. This is because, in our experiments, dynamic that ESDB’s query optimizer is able to reduce query latencies, and
secondary hashing distribute a tenant’s data to a smaller set of that the query latency is under 200 ms even for 99-percentile latency.
shards. Therefore, the number of subqueries is notably smaller than Overall, with ESDB’s query optimizer, the average query latency is
double hashing, and it increases the query throughput by as much improved by 2.41 times, where the latency of the queries issued to
as 63% (for the smaller tenants). the largest tenant is most significantly improved by 5.08 times.

2296
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA

7 RELATED WORK
Different load balancing techniques have been proposed for dif-
ferent applications. Google Slicer [14] proposes a weighted-move
sharding algorithm, which automatically executes merge of cold
slices and split of hot slices based on "weight" (a metric to evaluate
skewness) and "key churn" (cost of split/merge). Facebook’s Shard
Manager [47] moves hot shards out of overloaded servers. Cock-
(a) (b)
roachDB [62], Spanner [28], HBase [18], Yak [44] use resharding
Figure 18: Average (a) and quantile (b) query latencies of the methods which automatically split and move shards of "hot" tenants.
top 100 tenants with and without frequency-based indices. Live migration is another type of migration-based load balancing
400 technique which moves entire database applications of hotspots

Average query latency (ms)

Max write delay 160
350 across nodes. Notable applications, such as Zephyr [31], ProRea [58]
Average query latency
Max write delay (s)

140
300 and Slacker [21], adopt different cost optimizations in order to mini-
120 mize the service interruption and downtime. Albatross [29] is a live
250
100 migration technique used in shared-storage database architectures.
200
80 Instead of migrating data, Albatross migrates database cache and
150
60 the state of transactions. Although effective, migration-based load
100
50 40 balancers introduce extra bandwidth and computation overheads,
0 20 consuming resources that are already very limited.
23:50:21 23:54:21 23:58:20 00:03:55 00:07:57 00:11:57 00:15:58
Time E-Store [61] identifies tuple-level hotspots and uses smart heuris-
Figure 19: Max write delay and average query latency at the tics to generate optimal data migration plans for load balancing.
beginning of Single’s Day Global Shopping Festival, 2021. SPORE [38] uses self-adaptive replication of popular key-value tu-
ples in distributed memory caching systems. Compared to data
6.3.3 Frequency-based indexing. Imitating the real-world data from migration, SPORE incurs fewer overhead and disperses workloads
our production environment, the simulated “attributes” column in of "hot" tuples to multiple nodes. SWAT [50] implements a load
our benchmark consists of 1500 sub-attributes whose frequencies balancing method which swaps replica roles as the primary and
are skewed (top 30 sub-attributes appear in about 50% of both secondary replicas to process imbalanced workloads. Although
write and query workloads). When generating “attributes” for each lightweight, these three methods are not appropriate for our ap-
simulated row, we sample 20 sub-attributes from Zipf distribution plication because our major skewness is caused by imbalance of
(𝜃 = 1). In this experiment, We build indices only for the top 30 tenants other than tuples nor replicas. Centrifuge [13] uses tempo-
sub-attributes, which incurs only 6.7% storage overhead. When rary leases between continuous key ranges and servers to provide
generating query workloads, we append a filter of a sub-attribute, consistency for in-memory server pools. It balances workloads by
which is also sampled from Zipf distribution, to the query template changing the mapping from virtual nodes to physical worker nodes,
used in Section 6.3. Figure 18 (a) and (b) show the average and which cannot be used to address single hotspot. LogStore [25]
quantile query latencies for the top 100 tenants. We observe that, achieves real-time workload balancing by maintaining and updat-
with frequency-based indices, query latencies improve significantly; ing a routing table during runtime. Using a max-flow algorithm,
the average query latency of top 100 tenants is reduced by as much LogStore generates routing plans which maximize overall write
as 94.1%. throughput. However, LogStore’s router has no read-your-writes
consistency guarantee, and this makes it risky to process UPDATE
6.4 Online Performance and DELETE workloads.
In our last experiment, we evaluate ESDB’s performance in a pro-
duction environment. More concretely, ESDB is used to support Al- 8 CONCLUSION
ibaba’s e-commerce platform during the 2021’s Single’s Day Global This paper presents ESDB, a cloud-native document-oriented data-
Shopping Festival. Figure 19 shows the max write delay and av- base which supports elastic write for extremely skewed work-
erage query latency during a period of approximately 30 minutes loads and efficient ad-hoc queries. ESDB adopts dynamic secondary
around the beginning of the festival. We observe that the max write hashing, a lightweight load balancing technique which eliminates
delay starts to rise notably at 00:00 am due to the dramatic increase hotspots of multi-tenant workloads in real-time. Compared to hash-
of workloads. After the detection of hotspots and the adoption of ing and double hashing, dynamic secondary hashing fulfills effi-
secondary hashing rules, it takes less than 7 minutes for ESDB to cient query and load balancing thus overcomes shortcomings of
process the workloads generated during the first few seconds after both techniques. In addition, we introduce optimizations that sig-
00:00 am and then fully eliminate write delays after the adapta- nificantly reduce the computation overheads and query latencies.
tion. This is a significant improvement over previous year’s max We evaluate ESDB both in a laboratory environment with simu-
write delay, which can be as high as over 100 minutes. In addi- lated workloads and in a production environment with real-world
tion, ESDB retains decent average query latency during the first 30 workloads. Our results show that ESDB is able to enhance write
minutes of shopping festival. The average query latency does not throughput and reduce write delays when processing extremely
surpass 164ms even when both the write and query throughputs skewed workloads, as well as maintain high throughput and low
are extremely high. latency for ad-hoc queries on distributed multi-tenant data.

2297
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA

REFERENCES [34] Alibaba Group. OceanBase. https://www.alibabacloud.com/product/oceanbase

[1] DB-Engines Ranking - popularity ranking of search engines. https://db-engines. [35] Leo J Guibas and Endre Szemeredi. 1978. The analysis of double hashing. J.
com/en/ranking/search+engine Comput. System Sci. 2 (1978), 226–274.
[2] Doc Values. https://www.elastic.co/guide/en/elasticsearch/reference/current/ [36] Suyash Gupta and Mohammad Sadoghi. 2018. EasyCommit: A Non-blocking
doc-values.html Two-phase Commit Protocol.. In EDBT. 157–168.
[3] Elasticsearch Flush. https://www.elastic.co/guide/en/elasticsearch/reference/7. [37] Maurice P Herlihy and Jeannette M Wing. 1990. Linearizability: A correctness
3/indices-flush.html condition for concurrent objects. ACM Transactions on Programming Languages
[4] Elasticsearch Replication. https://www.elastic.co/guide/en/elasticsearch/ and Systems (TOPLAS) 3 (1990), 463–492.
reference/current/docs-replication.html [38] Yu-Ju Hong and Mithuna Thottethodi. 2013. Understanding and Mitigating the
[5] Elasticsearch Routing. https://www.elastic.co/guide/en/elasticsearch/reference/ Impact of Load Imbalance in the Memory Caching Tier. In Proceedings of the
8.0/mapping-routing-field.html 4th Annual Symposium on Cloud Computing (Santa Clara, California) (SOCC ’13).
[6] Lucene Replicator. https://blog.mikemccandless.com/2017/09/lucenes-near-real- ACM, New York, NY, USA, Article 13, 17 pages. https://doi.org/10.1145/2523616.
time-segment-index.html 2525970
[7] Near Real-time Search. https://www.elastic.co/guide/en/elasticsearch/reference/ [39] Couchbase Inc. Couchbase. https://www.couchbase.com/
current/near-real-time.html [40] MongoDB Inc. MongoDB. https://www.mongodb.com/
[8] Reading and Writing Documents. https://www.elastic.co/guide/en/elasticsearch/ [41] Flavio Junqueira and Benjamin Reed. 2013. ZooKeeper: distributed process coordi-
reference/current/docs-replication.html nation. " O’Reilly Media, Inc.".
[9] Segment Merge. https://www.elastic.co/guide/en/elasticsearch/reference/8.0/ [42] David Karger, Eric Lehman, Tom Leighton, Rina Panigrahy, Matthew Levine, and
index-modules-merge.html Daniel Lewin. 1997. Consistent hashing and random trees: Distributed caching
[10] Solr 7 – New Replica Types. https://sematext.com/blog/solr-7-new-replica- protocols for relieving hot spots on the world wide web. In Proceedings of the
types/ twenty-ninth annual ACM symposium on Theory of computing. 654–663.
[11] Translog. https://www.elastic.co/guide/en/elasticsearch/reference/6.8/index- [43] Adam Kirsch and Michael Mitzenmacher. 2008. Less hashing, same performance:
modules-translog.html Building a better Bloom filter. Random Structures & Algorithms 2 (2008), 187–218.
[12] Transport Client. https://www.elastic.co/guide/en/elasticsearch/client/java- [44] Markus Klems, Adam Silberstein, Jianjun Chen, Masood Mortazavi, Sahaya An-
api/current/transport-client.html drews Albert, P.P.S. Narayan, Adwait Tumbde, and Brian Cooper. 2012. The
[13] Atul Adya, John Dunagan, and Alec Wolman. 2010. Centrifuge: Integrated Lease Yahoo! Cloud Datastore Load Balancer. In Proceedings of the Fourth International
Management and Partitioning for Cloud Services.. In NSDI. 1–16. Workshop on Cloud Data Management (Maui, Hawaii, USA) (CloudDB ’12). ACM,
[14] Atul Adya, Daniel Myers, Jon Howell, Jeremy Elson, Colin Meek, Vishesh Khe- New York, NY, USA, 33–40. https://doi.org/10.1145/2390021.2390028
mani, Stefan Fulger, Pan Gu, Lakshminath Bhuvanagiri, Jason Hunter, et al. 2016. [45] Leslie Lamport. 2019. The part-time parliament. In Concurrency: the Works of
Slicer: {Auto-Sharding } for Datacenter Applications. In 12th USENIX Symposium Leslie Lamport. 277–317.
on Operating Systems Design and Implementation (OSDI 16). 739–753. [46] Leslie Lamport et al. 2001. Paxos made simple. ACM Sigact News 4 (2001), 18–25.
[15] Alexandr Andoni and Piotr Indyk. 2017. Nearest neighbors in high-dimensional [47] Sangmin Lee, Zhenhua Guo, Omer Sunercan, Jun Ying, Thawan Kooburat,
spaces. In Handbook of Discrete and Computational Geometry. Chapman and Suryadeep Biswal, Jun Chen, Kun Huang, Yatpang Cheung, Yiding Zhou, et al.
Hall/CRC, 1135–1155. 2021. Shard Manager: A Generic Shard Management Framework for Geo-
[16] Apache. CouchDB. http://couchdb.apache.org/ distributed Applications. In Proceedings of the ACM SIGOPS 28th Symposium
[17] Apache. Elasticsearch. https://www.elastic.co/cn/elastic-stack/ on Operating Systems Principles. 553–569.
[18] Apache. HBase. https://hbase.apache.org/ [48] George Lueker and Mariko Molodowitch. 1988. More analysis of double hashing.
[19] Apache. Lucene. https://lucene.apache.org/ In Proceedings of the twentieth annual ACM symposium on Theory of computing.
[20] Apache. Solr. https://solr.apache.org/ 354–359.
[21] Sean Barker, Yun Chi, Hyun Jin Moon, Hakan Hacigümüş, and Prashant Shenoy. [49] Bruce M Maggs and Ramesh K Sitaraman. 2015. Algorithmic nuggets in content
2012. " Cut me some slack" latency-aware live migration for databases. In Pro- delivery. ACM SIGCOMM Computer Communication Review 3 (2015), 52–66.
ceedings of the 15th international conference on extending database technology. [50] Hyun Jin Moon, Hakan Hacıgümüş, Yun Chi, and Wang-Pin Hsiung. 2013. SWAT:
432–443. A Lightweight Load Balancing Method for Multitenant Databases. In Proceedings
[22] Philip A Bernstein, Vassos Hadzilacos, and Nathan Goodman. 1987. Concurrency of the 16th International Conference on Extending Database Technology (Genoa,
control and recovery in database systems. Addison-wesley Reading. Italy) (EDBT ’13). ACM, New York, NY, USA, 65–76. https://doi.org/10.1145/
[23] Philip A Bernstein and Eric Newcomer. 2009. Principles of transaction processing. 2452376.2452385
Morgan Kaufmann. [51] Diego Ongaro and John Ousterhout. 2014. In search of an understandable consen-
[24] David Broneske, Veit Köppen, Gunter Saake, and Martin Schäler. 2017. Accelerat- sus algorithm. In 2014 {USENIX } Annual Technical Conference ( {USENIX } {ATC }
ing multi-column selection predicates in main-memory-the Elf approach. In 2017 14). 305–319.
IEEE 33rd International Conference on Data Engineering (ICDE). IEEE, 647–658. [52] Oracle. MySQL. https://www.mysql.com/
[25] Wei Cao, Xiaojie Feng, Boyuan Liang, Tianyu Zhang, Yusong Gao, Yunyang [53] Oracle. MySQL Cluster. https://www.mysql.com/products/cluster/
Zhang, and Feifei Li. 2021. LogStore: A Cloud-Native and Multi-Tenant Log [54] Oracle. Oracle NoSQL Database. https://www.oracle.com/database/technologies/
Database. In Proceedings of the 2021 International Conference on Management of related/nosql.html
Data. 2464–2476. [55] Octavian Procopiuc, Pankaj K Agarwal, Lars Arge, and Jeffrey Scott Vitter. 2003.
[26] Tushar D Chandra, Robert Griesemer, and Joshua Redstone. 2007. Paxos made Bkd-tree: A dynamic scalable kd-tree. In International Symposium on Spatial and
live: an engineering perspective. In Proceedings of the twenty-sixth annual ACM Temporal Databases. Springer, 46–65.
symposium on Principles of distributed computing. 398–407. [56] Robbi Rahim, Iskandar Zulkarnain, and Hendra Jaya. 2017. Double hashing
[27] Surajit Chaudhuri and Vivek R Narasayya. 1997. An efficient, cost-driven index technique in closed hashing search process. In IOP Conference Series: Materials
selection tool for Microsoft SQL server. In VLDB. Citeseer, 146–155. Science and Engineering. IOP Publishing, 012027.
[28] James C Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, [57] John T Robinson. 1981. The KDB-tree: a search structure for large multidimen-
Jeffrey John Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, sional dynamic indexes. In Proceedings of the 1981 ACM SIGMOD international
Peter Hochschild, et al. 2013. Spanner: Google’s globally distributed database. conference on Management of data. 10–18.
ACM Transactions on Computer Systems (TOCS) 3 (2013), 1–22. [58] Oliver Schiller, Nazario Cipriani, and Bernhard Mitschang. 2013. ProRea:
[29] Sudipto Das, Shoji Nishimura, Divyakant Agrawal, and Amr El Abbadi. 2011. Live Database Migration for Multi-Tenant RDBMS with Snapshot Isolation. In
Albatross: Lightweight Elasticity in Shared Storage Databases for the Cloud Proceedings of the 16th International Conference on Extending Database Tech-
Using Live Data Migration. Proc. VLDB Endow. 8 (may 2011), 494–505. https: nology (Genoa, Italy) (EDBT ’13). ACM, New York, NY, USA, 53–64. https:
//doi.org/10.14778/2002974.2002977 //doi.org/10.1145/2452376.2452384
[30] Peter C Dillinger and Panagiotis Manolios. 2004. Bloom filters in probabilistic [59] Dale Skeen. 1981. Nonblocking commit protocols. In Proceedings of the 1981 ACM
verification. In International Conference on Formal Methods in Computer-Aided SIGMOD international conference on Management of data. 133–142.
Design. Springer, 367–381. [60] StumbleUpon. OpenTSDB. http://opentsdb.net/
[31] Aaron J. Elmore, Sudipto Das, Divyakant Agrawal, and Amr El Abbadi. 2011. [61] Rebecca Taft, Essam Mansour, Marco Serafini, Jennie Duggan, Aaron J. Elmore,
Zephyr: Live Migration in Shared Nothing Databases for Elastic Cloud Platforms. Ashraf Aboulnaga, Andrew Pavlo, and Michael Stonebraker. 2014. E-Store: Fine-
In Proceedings of the 2011 ACM SIGMOD International Conference on Management Grained Elastic Partitioning for Distributed Transaction Processing Systems. Proc.
of Data (Athens, Greece) (SIGMOD ’11). ACM, New York, NY, USA, 301–312. VLDB Endow. 3 (nov 2014), 245–256. https://doi.org/10.14778/2735508.2735514
https://doi.org/10.1145/1989323.1989356 [62] Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan VanBenschoten, Jordan Lewis,
[32] Jim Gray and Leslie Lamport. 2006. Consensus on transaction commit. ACM Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, Raphael Poss, et al. 2020.
Transactions on Database Systems (TODS) 1 (2006), 133–160. Cockroachdb: The resilient geo-distributed sql database. In Proceedings of the
[33] Alibaba Group. Alibaba Cloud. https://www.alibabacloud.com 2020 ACM SIGMOD International Conference on Management of Data. 1493–1509.

2298

Fortinet SMB Partner Sales Guide
No ratings yet
Fortinet SMB Partner Sales Guide
11 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
How To Install Sap Smart Business Application
No ratings yet
How To Install Sap Smart Business Application
25 pages
Backend Development
From Everand
Backend Development
Kai Turing
No ratings yet
Management Strategies for the Cloud Revolution (Review and Analysis of Babcock's Book)
From Everand
Management Strategies for the Cloud Revolution (Review and Analysis of Babcock's Book)
BusinessNews Publishing
No ratings yet
The Cloud Database
No ratings yet
The Cloud Database
19 pages
Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
From Everand
Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
Anthony David Giordano
No ratings yet
Cloud Development
From Everand
Cloud Development
Mei Gates
No ratings yet
Mastering Cloud Computing With Best Practices
From Everand
Mastering Cloud Computing With Best Practices
Manish Soni
No ratings yet
Multi-Hosting Application & Balloon Services: A Novel Cloud Native Architecture & Its Ecosystem
From Everand
Multi-Hosting Application & Balloon Services: A Novel Cloud Native Architecture & Its Ecosystem
Aaron Ran
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Cloud-Native Transactions and Analytics in SingleStore
No ratings yet
Cloud-Native Transactions and Analytics in SingleStore
13 pages
Edge Cloud Operations: A Systems Approach
From Everand
Edge Cloud Operations: A Systems Approach
Larry L Peterson
No ratings yet
Artículo BDs en Computación en La Nube
No ratings yet
Artículo BDs en Computación en La Nube
10 pages
Essays on Infrastructure-as-code
From Everand
Essays on Infrastructure-as-code
Ravi Rajamani
No ratings yet
Cloud Database Management System
No ratings yet
Cloud Database Management System
12 pages
CPA USA Information Systems and Controls: The Complete Syllabus Guide
From Everand
CPA USA Information Systems and Controls: The Complete Syllabus Guide
Azhar ul Haque Sario
No ratings yet
Cloud computing: Moving IT out of the office
From Everand
Cloud computing: Moving IT out of the office
BCS, The Chartered Institute for IT
No ratings yet
Thesis On Distributed Database System
100% (3)
Thesis On Distributed Database System
8 pages
Cloud Computing: Harnessing the Power of the Digital Skies: The IT Collection
From Everand
Cloud Computing: Harnessing the Power of the Digital Skies: The IT Collection
Christopher Ford
No ratings yet
Linode UnderstandingDatabases ExtendedEdition
No ratings yet
Linode UnderstandingDatabases ExtendedEdition
259 pages
If the Cloud Is a Game Changer, Who's Playing?
From Everand
If the Cloud Is a Game Changer, Who's Playing?
K. C. Flynn
4/5 (1)
Whats New in Oracle Database 23c
No ratings yet
Whats New in Oracle Database 23c
49 pages
Linode Ebook Databases
No ratings yet
Linode Ebook Databases
36 pages
The Pandemic: Driven New Age of Cloud Computing
From Everand
The Pandemic: Driven New Age of Cloud Computing
VNS Surendra Chimakurthi
No ratings yet
Shedding Light on Cloud Computing
From Everand
Shedding Light on Cloud Computing
Gregor Petri
5/5 (1)
An Introduction to SDN Intent Based Networking
From Everand
An Introduction to SDN Intent Based Networking
alasdair gilchrist
5/5 (1)
ALI LogStore
No ratings yet
ALI LogStore
13 pages
61-Article Text-201-1-10-20151008
No ratings yet
61-Article Text-201-1-10-20151008
5 pages
Data Cloud Growth
From Everand
Data Cloud Growth
Kai Turing
No ratings yet
Cloud vs Edge
From Everand
Cloud vs Edge
Isaac Berners-Lee
No ratings yet
Consise Cloud Compute: It Professionals’ Handbook
From Everand
Consise Cloud Compute: It Professionals’ Handbook
Vijay
No ratings yet
Cloud Native Database Principle and Practice
No ratings yet
Cloud Native Database Principle and Practice
263 pages
Using Machine Learning From Your Database: Danilo Poccia, Chief Evangelist (EMEA) @danilop
No ratings yet
Using Machine Learning From Your Database: Danilo Poccia, Chief Evangelist (EMEA) @danilop
24 pages
Cloud Services Race
From Everand
Cloud Services Race
Zuri Deepwater
No ratings yet
DBMS MASTER: Become Pro in Database Management System
From Everand
DBMS MASTER: Become Pro in Database Management System
Ummed Singh
No ratings yet
The Decentralized Cloud: How Blockchains Will Disrupt and Unseat Centralized Computing
From Everand
The Decentralized Cloud: How Blockchains Will Disrupt and Unseat Centralized Computing
Daniel W. Marshall
No ratings yet
Database Management System
From Everand
Database Management System
Knowledge Flow
No ratings yet
Database Management System As A Cloud Service
100% (1)
Database Management System As A Cloud Service
13 pages
Cloud Migration of Oracle Analytical DB (On-Premise) To AWS Redshift
No ratings yet
Cloud Migration of Oracle Analytical DB (On-Premise) To AWS Redshift
10 pages
Data Storage Traditional Vs Blockchain
No ratings yet
Data Storage Traditional Vs Blockchain
13 pages
The Ultimate Guide to Unlocking the Full Potential of Cloud Services: Tips, Recommendations, and Strategies for Success
From Everand
The Ultimate Guide to Unlocking the Full Potential of Cloud Services: Tips, Recommendations, and Strategies for Success
Rick Spair
No ratings yet
Cloud Computing Essentials: A Practical Guide with Examples
From Everand
Cloud Computing Essentials: A Practical Guide with Examples
William E. Clark
No ratings yet
Cloud Computing Made Simple: Navigating the Cloud: A Practical Guide to Cloud Computing
From Everand
Cloud Computing Made Simple: Navigating the Cloud: A Practical Guide to Cloud Computing
Poonam Devi
No ratings yet
The Cloud Computing Revolution: From Virtualization to Automation: Unveiling the Cloud Computing Revolution
From Everand
The Cloud Computing Revolution: From Virtualization to Automation: Unveiling the Cloud Computing Revolution
Lisa Carter
No ratings yet
Zarmina DDB Ass#2
No ratings yet
Zarmina DDB Ass#2
10 pages
Multi Tenancy
No ratings yet
Multi Tenancy
19 pages
Database Design
From Everand
Database Design
Mei Gates
No ratings yet
Cloud Engineering
From Everand
Cloud Engineering
Kai Turing
No ratings yet
Database19c WP 6 10
No ratings yet
Database19c WP 6 10
5 pages
Assignment 200 -edited
No ratings yet
Assignment 200 -edited
26 pages
Mainframes in the Hybrid Cloud Era: Mainframes
From Everand
Mainframes in the Hybrid Cloud Era: Mainframes
Isaac Nangan
No ratings yet
Cloud Storage Evolution
From Everand
Cloud Storage Evolution
Lucas Lee
No ratings yet
Cloud Computing, revised and updated edition
From Everand
Cloud Computing, revised and updated edition
Nayan B. Ruparelia
No ratings yet
Network Coding and Signcryption for Cloud Data Integrity
From Everand
Network Coding and Signcryption for Cloud Data Integrity
Noah Joan
No ratings yet
Cloud Database Management System
No ratings yet
Cloud Database Management System
5 pages
Unit5part2 Cloud Database
No ratings yet
Unit5part2 Cloud Database
16 pages
The DynamoDB Handbook: Practical Solutions for Modern NoSQL Database Management
From Everand
The DynamoDB Handbook: Practical Solutions for Modern NoSQL Database Management
Robert Johnson
No ratings yet
Lean and the Art of Cloud Computing Management
From Everand
Lean and the Art of Cloud Computing Management
Gregor Petri
No ratings yet
Database - Design
No ratings yet
Database - Design
9 pages
Edge AI Solutions
From Everand
Edge AI Solutions
Kai Turing
No ratings yet
Integration of Database Management System With Cloud
No ratings yet
Integration of Database Management System With Cloud
3 pages
Url - Profile - CBD Stores To Extract Sheet Yellowpages CBD No Email
No ratings yet
Url - Profile - CBD Stores To Extract Sheet Yellowpages CBD No Email
21 pages
EN01 Data Center Network Overview
No ratings yet
EN01 Data Center Network Overview
63 pages
File Infection Techniques: Wei Wang
No ratings yet
File Infection Techniques: Wei Wang
30 pages
CV Luis Vidigal Apr 2012 en
No ratings yet
CV Luis Vidigal Apr 2012 en
30 pages
Descriptive Analysis of The 2019 Stack Overflow Developer Survey Data - Presentation PDF
No ratings yet
Descriptive Analysis of The 2019 Stack Overflow Developer Survey Data - Presentation PDF
19 pages
OpenSAP Hanasql1 Week 3 Transcript en
No ratings yet
OpenSAP Hanasql1 Week 3 Transcript en
15 pages
Seminar Report BIG DATA
No ratings yet
Seminar Report BIG DATA
28 pages
Prelim Examination - Attempt Review PDF
No ratings yet
Prelim Examination - Attempt Review PDF
9 pages
Sample Sla Sop
No ratings yet
Sample Sla Sop
3 pages
E Business
No ratings yet
E Business
10 pages
SaaS Plan With Stripe Integration
No ratings yet
SaaS Plan With Stripe Integration
2 pages
kabarak university_INTE 321 COMP 322 COSF 326 distributed systems
No ratings yet
kabarak university_INTE 321 COMP 322 COSF 326 distributed systems
5 pages
Changeover Withholding Tax: Symptom
No ratings yet
Changeover Withholding Tax: Symptom
3 pages
Analyses of Data, Classification of Data, Cross Classification, Arrangement of Data or Classes of Data, Group-Derive Generalizations Analysis of Data
No ratings yet
Analyses of Data, Classification of Data, Cross Classification, Arrangement of Data or Classes of Data, Group-Derive Generalizations Analysis of Data
2 pages
Roles: Unit 4: Workstream Overview
No ratings yet
Roles: Unit 4: Workstream Overview
1 page
Designing On-Prem SD-WAN Controllers-2023
No ratings yet
Designing On-Prem SD-WAN Controllers-2023
55 pages
CHFI Brochure
No ratings yet
CHFI Brochure
15 pages
Os
No ratings yet
Os
10 pages
Component Architecture (Corba - Rmi) : - Shalini Pradhan
No ratings yet
Component Architecture (Corba - Rmi) : - Shalini Pradhan
14 pages
Daily Monitoring Checklist
No ratings yet
Daily Monitoring Checklist
4 pages
ABU Red Hat Ansible Automation Platform Technical Deck
No ratings yet
ABU Red Hat Ansible Automation Platform Technical Deck
85 pages
az-305-pdf
No ratings yet
az-305-pdf
55 pages
Software Requirements Specification For Online Passport Registration System 1111
No ratings yet
Software Requirements Specification For Online Passport Registration System 1111
13 pages
BORGES CALDAS DA SILVA 2022 Archivage
No ratings yet
BORGES CALDAS DA SILVA 2022 Archivage
159 pages
Seba Deleersnyder - Practical Threat Modeling
No ratings yet
Seba Deleersnyder - Practical Threat Modeling
39 pages
20244_ImperativeParadigmsDifferentTypeOfParametersPassing
No ratings yet
20244_ImperativeParadigmsDifferentTypeOfParametersPassing
8 pages
There Are 8 Courses in This Professional Certificate: Introduction To Cybersecurity Tools & Cyber Attacks
No ratings yet
There Are 8 Courses in This Professional Certificate: Introduction To Cybersecurity Tools & Cyber Attacks
3 pages
Pestudio Features 8.81
No ratings yet
Pestudio Features 8.81
1 page

ESDB - Processing Extremely SkewedWorkloads in Real-Time

Uploaded by

ESDB - Processing Extremely SkewedWorkloads in Real-Time

Uploaded by

Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA

ESDB: Processing Extremely Skewed Workloads in Real-time

of sellers as tenants, and supports transactions from hundreds of

write workloads read workloads

is participating in, are frequently used and likely to be queried. In Hashing 0

Algorithm 2 Secondary Hashing Rule List Update Coordinator Master Participant1

Figure 5: ESDB’s secondary hashing rule consensus protocol

SELECT logs FROM transaction_logs F: {3, 4, 6}

WHERE tenant_id = 10086 Union

D: {3, 4, 6} Write Flows Replication Flows

Union Write to Primary Shard Replica Shard

Scan Doc value of status In-memory Buffer In-memory Buffer

Pre-replication of Merged Segments

largest hotspot resides in node 1, and it has a replica that resides in

(a) (b) (c) (d)

(a) (b) (a) (b)

Average query latency (ms)

REFERENCES [34] Alibaba Group. OceanBase. https://www.alibabacloud.com/product/oceanbase

You might also like