Cloud-Native Transactions and Analytics in SingleStore
Cloud-Native Transactions and Analytics in SingleStore
ABSTRACT two or three different databases with S2DB. The design of S2DB’s
storage, transaction processing, and query processing were
The last decade has seen a remarkable rise in specialized
developed to maintain this versatility.
database systems. Systems for transaction processing, data
warehousing, time series analysis, full-text search, data lakes, in-
memory caching, document storage, queuing, graph processing, CCS CONCEPTS
and geo-replicated operational workloads are now available to Information systems~Data management systems~Database
developers. A belief has taken hold that a single general-purpose management system engines~DBMS engine architectures
database is not capable of running varied workloads at a
reasonable cost with strong performance, at the level of scale KEYWORDS
and concurrency people demand today. There is value in Databases, Distributed Systems, Separation of storage and
specialization, but the complexity and cost of using multiple Compute, Transactions and Analytics
specialized systems in a single application environment is
becoming apparent. This realization is driving developers and IT ACM Reference format:
decision makers to seek databases capable of powering a broader Adam Prout, Szu-Po Wang, Joseph Victor, Zhou Sun, Yongzhu Li, Jack
set of use cases when looking to adopt a new database. Hybrid Chen, Evan Bergeron, Eric Hanson, Robert Walzer, Rodrigo Gomes, &
transaction and analytical (HTAP) databases have been Nikita Shamgunov. 2022. Cloud-Native Transactions and Analytics in
developed to try to tame some of this chaos. SingleStore In Proceedings of the 2022 Int’l Conference on Management of
In this paper we introduce SinglestoreDB (S2DB), formerly called Data (SIGMOD’22), June 12-17, 2022. Philadelphia, PA, USA. ACM, NY,
NY, USA. 13 pages. https://doi.org/10.1145/3514221.3526055
MemSQL, a distributed general-purpose SQL database designed
to have the versatility to run both operational and analytical
workloads with good performance. It was one of the earliest 1 Introduction
distributed HTAP databases on the market. It can scale out to
efficiently utilize 100s of hosts, 1000s of cores and 10s of TBs of The market is saturated with specialized database engines. As of
RAM while still providing a user experience similar to a single- January 2022, DB-Engines [18] ranks over 350 different
host SQL database such as Oracle or SQL Server. S2DB’s unified databases. Amazon Web Services alone supports 15+ different
table storage runs both transactional and analytical workloads database products[1]. There is value in special-case systems [2],
efficiently with operations like fast scans, seeks, filters, but when applications end up built as a complex web of different
aggregations, and updates. This is accomplished through a databases a lot of that value is eroded. Developers are manually
combination of rowstore, columnstore and vectorization rebuilding the general-purpose databases of old via ETL and data
techniques, ability to seek efficiently into a columnstore using flows between specialized databases.
secondary indexes, and using in-memory rowstore buffers for We believe two industry trends have driven this
recently modified data. It avoids design simplifications (i.e., only proliferation of new databases. The first trend is the shift to
supporting batch loading, or limiting the query surface area to cloud-native architectures designed to take advantage of elastic
particular patterns of queries) that sacrifice the ability to run a cloud infrastructure. Cloud blob stores (S3 [3]) and block storage
broad set of workloads. (EBS [44]) allow databases to tap into almost limitless, highly-
Today, after 10 years of development, S2DB runs demanding available and durable data storage. Elastic compute instances
production workloads for some of the world’s largest financial, (EC2 [4]) allow databases to bring more compute to bear at a
telecom, high-tech, and energy companies. These customers moment’s notice to deal with a complex query or a spike in
drove the product towards a database capable of running a throughput. The second trend is the demand from developers to
breadth of workloads across their organizations, often replacing store more data and access it with lower latency and with higher
throughput. Modern applications generate a lot of data. This
This work is licensed under a Creative performance and data capacity requirement is often combined
Commons Attribution-NoDerivs with a desire for flexible data access. These access patterns are
International 4.0 License. application-specific but can range from low-latency, high-
throughput writes (including updates) for real-time data loading
SIDMOD’22, June 12-17, 2022, Philadelphia, PA USA
© 2022 Copyright held by the owner/author(s).
and deduplication, to efficient batch loading and complex
ACM ISBN 978-1-4503-9249-5/22/06.
https://doi.org/10.1145/3514221.3526055
2340
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
analytical queries over the same data. Application developers S2DB is able to make efficient use of the cloud storage
have never been more demanding of databases. hierarchy (local memory, local disks, and blob storage) based on
A common approach to tackle these requirements is to use a data hotness. This is an obvious design, yet most cloud data
domain-specific database for different components of an warehouses that support using blob storage as a shared remote
application. In contrast, we believe it is possible to design a disk don’t do it for newly written data. They force new data for a
database that can take advantage of elastic cloud infrastructure write transaction to be written out to blob storage before that
while satisfying a breadth of requirements for transactional and transaction can be considered committed or durable [26, 27, 30].
analytical workloads. There are many benefits for users in This in effect forces hot data to be written to the blobstore
having a single integrated, scalable database that can handle harming write latency. S2DB can commit on local disk and push
many application types. These include: reduced training data asynchronously to blob storage. This gives S2DB all the
requirements for developers, reduced need to move and advantages of separation of storage and compute without the
transform data, reduction in the number of copies of data that write latency penalty of a cloud data warehouse. For example,
must be stored and resulting reduction in storage costs, reduced S2DB:
software license costs, and reduced hardware costs. Furthermore, • Can store more data than fits on local disks by keeping cold
S2DB enables modern workloads to provide interactive real-time data in blob storage and only the working set of recently
insights and decision-making, enabling both high-throughput queried data on local disks.
low-latency writes and complex analytical queries over ever- • Stores history in blob storage (deleted data can be retained).
changing data, with end-to-end latency of seconds to sub- This enables point-in-time restores to points in the past
seconds from new data arriving to analytical results. This without needing to take explicit backups or copy any data
outcome is difficult to achieve with multiple domain specific on a restore.
databases. • Can provision multiple read-only replicas of a database
Moreover, adding incrementally more functionality to cover from blob storage without any impact to the read-write
different use cases with a single distributed DBMS leverages master copy of the database. The read-only replicas are
existing fundamental qualities that any distributed data created on their own hosts (called a workspace in S2DB)
management system needs to provide. This yields more and can be attached and detached to the workspace on
functionality per unit of engineering effort on the part of the demand. This allows S2DB to support OLTP and OLAP
vendor, contributing to lower net costs for the customer. For workloads over the same data but using isolated compute
example, specialized scale-out systems for full-text search may for each.
need cluster management, transaction management, high Unified table storage
availability, and disaster recovery, just like a scale-out relational S2DB tables support transactions that need both the scan
system requires. Some specialized systems may forgo some of performance of a columnstore (scanning 100s of millions to
these capabilities for expediency, compromising reliability. trillions of rows in a second[49]) and the seek performance of
This paper introduces the architecture of the SingleStore rowstore indexes to speed up point reads and writes. In S2DB,
database engine, a cloud-native database that excels at running both OLAP and OLTP workloads use a single unified table
complex interactive queries over large datasets (100s of storage design. Data doesn’t need to be copied or replicated into
terabytes) as well as running high-throughput, low-latency read different data layouts as other HTAP systems often do [23].
and write queries with predictable response times (millions of S2DB’s unified table storage internally makes use of both
rows written or updated per second). The same SingleStore rowstore and columnstore formats, but end users need not be
database engine is used both in SingleStore Managed Service, a aware of this. At a high level, the design is that of a columnstore
cloud database service, and in the SingleStoreDB (S2DB) with modifications to better support selective reads and writes in
database product, which can be installed wherever desired. In the a manner that has very little impact on the columnstore’s
rest of the paper, we’ll simply refer to the SingleStore database compression and table scan performance. The columnstore data
engine as S2DB. is organized as a log-structured merge tree (LSM) [8], with
S2DB can support a breadth of workloads over disaggregated secondary hash indexes supported to speed up OLTP workloads.
storage by pushing only cold data to blob storage and making Unified tables support sort keys, secondary keys, shard keys,
intelligent use of local state when running queries to minimize unique keys and row-level locking, which is an extensive and
network use. The rest of this paper expands on other important unique set of features for table storage in columnstore format.
design decisions made while building S2DB. We believe our Unified table storage is also sometimes referred to as universal
design represents a good trade-off between efficiency and storage [39] for its ability to handle a universal set of workloads.
flexibility and can help simplify application development by The rest of the paper is structured as follows. Section 2 gives
avoiding complex data pipelines. a brief overview of how a S2DB cluster functions: how it
This paper presents two key components of S2DB that are distributes data, maintains high availability and runs queries.
important for cloud-native transactional and analytical Section 3 describes how S2DB separates storage and compute
workloads [42]. without sacrificing support for low-latency writes. Section 4
Separation of storage and compute details the design of our unified table storage, which we believe
is ideal for HTAP. Section 5 describes how query execution
2341
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
adapts to tradeoffs between different data access methods on the enough then replica partitions for any of its master partitions
unified table storage. Section 6 shows some experimental results will be promoted to master and take over running queries.
using industry-standard benchmarks to demonstrate that S2DB S2DB also supports the creation of asynchronously
is competitive with both operational and analytical databases on replicated replicas in different regions or data centers for disaster
benchmarks specific to each workload. recovery. These cross-region replicas can act as another layer of
HA in the event of a full region outage. They are queryable by
2 Background on SingleStoreDB read queries by default so can also be used to scale out reads.
SingleStoreDB is a horizontally-partitioned, shared-nothing Failovers across regions are not automated in S2DB today, they
DBMS [35] which is optionally able to use shared storage such as must be triggered by a DBA.
a blob store for cold data. An S2DB cluster is made up of
aggregator nodes, which coordinate queries, and leaf nodes,
2.1 Table storage formats
which hold copies of partitions of data and are responsible for S2DB uses two storage types internally: an in-memory rowstore
the bulk of compute for queries. Each leaf holds several backed by a lockfree skiplist [6], and a disk-based columnstore
partitions of data. Each partition is either a master which can [5]. In early versions of S2DB, users had to choose either storage
serve both reads and writes, or a replica which can only serve type on a per-table basis according to the workload
reads. characteristics. Unified table storage (described in section 4)
Tables are distributed across partitions by hash-partitioning combines both formats internally to support OLAP and OLTP
of a user-configurable set of columns called the shard key. This workloads using a single storage design.
enables fast query execution for point reads and query shapes 2.1.1 Rowstore storage
that do not require moving data between leaves. When join
Each index in an S2DB in-memory rowstore table uses a lockfree
conditions or group-by columns match their referenced tables’
skiplist to index the rows. A node in the skiplist corresponds to a
shard keys, S2DB pushes down execution to individual partitions
row, and each node stores a linked list of versions of the row to
avoiding any data movement. Otherwise, SingleStore
implement multiversion concurrency control so that readers
redistributes data during query execution, performed as a
don’t need to wait on writers. Writes use pessimistic
broadcast or reshuffle operation, as described in [46]. S2DB’s
concurrency control, implemented using row locks stored on
query processor is able to run complex analytical queries such as
each skiplist node to handle concurrent writes to the same row.
those in the TPC-H and TPD-DS benchmarks competitively with
Each version of the row is stored as a fixed-sized struct (variable-
cloud data warehouses [19, 46].
length fields are stored as pointers) according to the table
SingleStore also supports full-query code generation
schema, along with bookkeeping information such as the
targeting LLVM [40] through intermediate bytecode. LLVM
timestamp and the commit status of the version.
compilation happens asynchronously while the query begins
In addition to writing to the in-memory skiplists, write
running via a bytecode interpreter. The compiled LLVM code is
operations also write the affected rows to a log before
hotswapped in during query execution when compilation
committing. A log is created for each database partition, and it’s
completes. Using native code-generation to execute queries
persisted to disk and replicated to guarantee the durability of
reduces the instructions needed to run a query compared to the
writes. On node restarts, the state of each database partition is
more typical hand-built interpreters in other SQL databases [41].
recovered by replaying the writes in the persisted log. A
The details of S2DB’s query compilation pipeline are omitted
background process periodically creates a snapshot file
from this paper.
containing the serialized state of the in-memory rowstore tables
S2DB maintains high availability (HA) by storing multiple
at a particular log position. This allows the recovery process to
replicas of each partition on different nodes in the cluster. By
start replay from the latest snapshot’s log position to limit
default, data is replicated synchronously to the replicas as
recovery time.
transactions commit on the master partitions. Read queries never
run on HA replicas, they exist only for durability and 2.1.2 Columnstore storage
availability. Queries only run on master partitions or specifically The data in a columnstore table is organized into segments,
created read replicas. Since HA replicas exist on the same set of where each segment stores a disjoint subset of rows as a set of
hosts that store masters for other partitions (see Figure 2), using data files on disk. Within a segment, each column is stored in the
HA replicas to run queries wouldn’t help spread the load across same row order but compressed separately. Common encodings
the cluster (the same hosts are already busy running queries). like bit packing, dictionary, run-length encoding, and LZ4 are
S2DB does support the creation of read replicas for scaling out supported for column compression. The same column can use a
queries on other hosts without any master partitions, the details different encoding in each segment optimized for the data
of which are described in section 3.3. HA replicas are hot copies specific to that segment. The segment metadata is stored in a
of the data on the master partition such that a replica can pick durable in-memory rowstore table (described in section 2.1.1),
up the query workload immediately after a failover without containing information including the file locations, the
needing any warm up. Failovers and auto-healing are encodings, and the min/max values for each column.
coordinated by a special aggregator node called the master
aggregator. If a node stops responding to heartbeats for long
2342
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
Additionally, a bit vector is stored within the segment metadata where the nodes in the cluster are the source of truth. When
to represent the deleted rows in the segment. running with a blob store, S2DB is atypical in that it doesn’t
This representation is mostly optimized for OLAP, however, store all persistent data on the blob store and only transient data
several considerations were made to speed up point read on local storage. Instead, newly written data is only persisted on
operations commonly found in OLTP workloads. The column the local storage of the cluster and later moved to the blob store
encodings are each implemented to be seekable to allow efficient asynchronously. This design allows S2DB to commit
reads at a specific row offset without decoding all the rows. transactions without the latency penalty of needing to write all
Storing min/max values allows segment elimination to be the transaction data to blob storage to make it durable. By
performed using in-memory metadata to skip fetching segments treating blob storage truly as cold storage, S2DB is able to
with no row matched. support low-latency writes while still getting many of the
A sort key can be specified on each columnstore table to benefits of separated storage (faster provisioning and scaling,
allow more efficient segment elimination. If specified, rows are storing datasets bigger than local disk, cheaper historical storage
fully sorted by the sort key within each segment. The sort order for point in time restores etc.). So in order to describe how S2DB
across segments is maintained similar to LSM trees [8, 5] by separates storage, it’s important to understand how S2DB’s local
building up sorted runs of segments. A background merger or integrated durability and compute functions, as that local
process is used to merge the segments incrementally to maintain durability mechanism works in tandem with blob storage to
a logarithmic number of sorted runs. maintain durability of committed transactions.
For each columnstore table, S2DB creates a rowstore table as Durability is managed by the cluster on each partition using
a write-optimized store to store small writes and avoid creating replication. The in-cluster replication is fast and log pages can be
small sorted runs across many files. This corresponds to the level replicated out-of-order and replicated early without waiting for
0 storage in other LSM trees, like MemTable in RocksDB [10]. A transaction commit. Replicating out-of-order allows small
background flusher process periodically deletes rows from the transactions to commit without waiting for big transactions,
rowstore and converts those rows into a columnstore segment in guaranteeing that commits have low and predictable latency. By
a transaction. For read performance, this write-optimized store is default, data is considered committed when it is replicated in-
kept small relative to the table size. Since the background merger memory to at least one replica partition for every master
and flusher processes can move rows between the rowstore and partition involved in a transaction. This means loss of a single
different segments, reads need to use partition-local snapshot node will never lose data, and if replication is configured across
isolation to guarantee a consistent view of the table. availability zones, loss of an entire availability zone will never
Columnstore tables support vectorized execution [9], and, for lose data. If all copies of a partition are lost due to concurrent
some filter, group-by and hash join operations, encoded node failures, before any new HA replicas are provisioned,
execution [7]. S2DB vectorized execution uses late recently written data that was only present in-memory will be
materialization, only decoding columns if data in them qualifies lost, but any data synced to local disk is recoverable as long as
based on filters on other columns. Encoded execution can the local disk survived the failure (i.e., a database process crash).
achieve large speedups on filtering and aggregation operations While SingleStore supports synchronously committing to local
by operating directly on compressed data and using SIMD disk as well, this tradeoff often doesn’t make sense in cloud
instructions when appropriate. environments where loss of a host often implies loss of the local
storage attached to that host. For this reason, S2DB doesn’t
3 Separation of storage and compute synchronously commit transactions to local disk by default.
S2DB can run with and without access to a blob store for Section 2 showed that the data for a partition of a
separated storage. When running without access to blob storage, columnstore table is stored in a LSM tree, where the top level is
S2DB behaves like a typical shared-nothing distributed database, an in-memory rowstore, and lower levels are HTAP-optimized
columnstore data files. To explain how the integrated durability
Figure 1: Example of a sequence of writes in S2, showing the state of the in memory and persisted structures in each step. (a)
Inserting rows 1,2,3 in two transactions (b) Converting in-memory rows 1,2,3 to segment 1 (c) Deleting row 2 from segment 1.
2343
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
mechanism works with the table storage, figure 1 presents the blob storage available. Since the tail of the log is stored on the
database state after a few example write operations. The bottom local disk and memory of the leaf nodes, no blob store writes
of figure 1 shows the persisted structures stored on disk, are required to commit a transaction
including the log and the physical columnstore data files, while • Newly committed columnstore data files are uploaded
the top shows the corresponding in-memory state at each step. asynchronously to blob storage as quickly as possible after
Figure 1(a) shows two insert transactions made durable in the being committed. Hot data files are kept in a cache locally on
log and the rows inserted to the in-memory rowstore. disk for use by queries and cold data files are removed from
When enough rows are amassed in the in-memory rowstore, local disk once uploaded.
they will be converted into a columnstore segment by the • Transaction logs are uploaded to blob storage in chunks
flushing process described in section 2.1.2. As illustrated in below a position in the log known to contain only fully
figure 1(b), the flushing process creates data files to store the durable and replicated data. The tail of the log newer than
rows in a segment, and deletes the converted rows from the in- this position is still receiving active writes, thus these newer
memory rowstore in the same transaction. Each data file is log pages are never uploaded to blob storage until replication
named after the log page at which it was created, so that data advances the fully durable and replicated log position past
files can be considered as logically existing in the log stream, them.
while physically being separate files. For larger write • Snapshots of rowstore data are taken only on master
transactions that involve the creation of multiple data files, each partitions and written directly to the blob store reducing
file is replicated as soon as it’s written on the master without local disk IO compared to S2DB running without blob
need to wait for the transaction to commit. storage. Replicas don’t need to take their own snapshots. If a
Data files are immutable - to delete a row from a segment, replica ever needs a snapshot (say because it was
only the segment metadata is updated to mark the row as deleted disconnected for a long period of time), it can get the
in the deleted bit vector. Figure 1(c) shows how the metadata snapshot from blob storage.
change is logged for a delete (omitting for simplicity the move • To add more compute to the cluster, new replica databases
transaction described in section 4.2). The key observation is that get the snapshots and logs they need from blob storage and
the data file itself is immutable, but the metadata changes are replicate the tail of the log (not yet in blob storage) from the
logged, which will lend itself well to the separation of storage master databases. Columnstore data files are pulled from the
and compute. blob store on demand and stored in the data file cache as
needed. This design allows new replicas to be provisioned
quickly, as they don’t need to download all data files before
they can start acknowledging transactions or servicing read
queries. This fast provisioning process allows pools of
compute called workspaces to be created over the same set of
databases as shown in Figure 2. These are discussed in more
detail in section 3.2.
The biggest advantage of this design compared to the designs
used by cloud data warehouses is that blob store writes are not
needed to commit a transaction, so write latency is low and
Figure 2: Cluster architecture with separated storage. The predictable. Since data files are uploaded to the blob store
left side shows how the master workspace uploads data asynchronously and the working set of recently used data files are
files, logs, and snapshots to blob storage asynchronously, kept cached on local disks, short periods of unavailability in the
while using replication to ensure durability of the log blob store doesn’t affect the steady-state workload, as long as
tails. The right side shows a read-only workspace reads happen within the cached working set. This allows S2DB to
provisioned from blob storage not be limited by the availability guarantee of the underlying blob
store, which is usually much less than its durability guarantee. For
3.1 Staging Data from Local to Remote Storage example, Amazon S3 guarantees 11 nines of durability but only 3
As shown in the previous section, S2DB has durable, low-latency nines of availability [45]. S2DBs design contrasts with many cloud
data storage in which all data for a given partition is recorded in a data warehouses that need data to be written to the blobstore for it
single log. The log is the only file which is ever updated (via to be considered durable [26, 27, 30]. On the other hand, persisting
appends). The data files containing columnstore data are to local disk and then moving the data to blob storage
immutable once written. This immutability is important for asynchronously does have some drawbacks compared to keeping
making use of blob storage because cloud blob stores typically all persistent data in the blob store. Our approach is more complex
don’t support efficient file updates. as keeping persistent state on local disk/memory requires a local
S2DB’s separation of storage and compute design is shown in high performance replication and recovery protocol. It’s also less
Figure 2 and can be summarized as follows: elastic. Adding or removing hosts requires moving the local data
• Transactions are committed to the tail of the log and not yet in blob storage carefully to maintain durability and
replicated to other nodes just as when S2DB runs without availability guarantees. In the event of multiple concurrent
2344
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
failures, for instance loss of all nodes which have a copy of a single time travel querying, only a full restore of the database state
partition, committed data could be lost. This can be mitigated by via PITR is supported.
having sync replicas spread across multiple availability zones, so • S2DB supports the creation of read-only workspaces which
losing data would require concurrent failures between availability are a set of hosts that replicate recently written data
zones or the loss of an entire region. asynchronous from the primary writable workspace, but
Cloud operational databases such as Amazon Aurora [28] which don’t participate in acking commits for durability, as
don’t use blob storage for system-of-record data at all, instead shown in figure 2. Read-only workspaces can be used to scale
using their own separated storage or log services to make data out the cluster on demand to handle an increase in read query
durable and available. Blob storage is only used for backups. As a concurrency by directing some of the read workload to the
result, the maximum database size that Aurora can support is new workspace. Depending on the number of hosts in the
limited by what its storage service can support, currently 128 TB workspace and the amount of data being actively queried,
(it’s not unlimited). It also means the expense associated with they can often be created within minutes. They also create an
storing and accessing data is higher (Aurora storage is about 4 isolated environment to run heavy analytical workloads
times as expensive as S3).This trade-off makes sense for a database without impacting a more mission-critical read/write
targeting OTLP workloads, as the data sizes they deal with are workload running on the primary workspace. Data files other
typically smaller and prioritizing efficient availability and than the recently written ones that are replicated to the
durability features is more important for OLTP. workspace are read from the blob store directly rather than
from the primary workspace, so that each workspace can
3.2 Capabilities Enabled by Separated Storage cache its own set of data independently.
S2DB’s separated storage design gives it many of the benefits In conclusion, S2DB’s design for separation of storage and
expected of systems using shared remote storage. Even though compute gives it many of the durability and elasticity benefits of
S2DB’s data separation relies on storing the tail of the log on local traditional cloud data warehouse designs. Specifically, flexible
disk/memory, the typical capabilities of having the bulk of the options for pausing, resuming and scaling compute, as well as
database’s data on remote storage still apply. Some example access to practically unlimited durable storage that scales
capabilities enabled by remote storage are: independently of compute and that can be used for consistent
point in time restores. However, S2DB’s integrated durability and
• S2DB uses faster ephemeral SSDs for local storage instead of
compute means it does not sacrifice write latency, making our
more expensive and slower network block storage (EBS)
storage design suitable for both analytic workloads and
often used by other cloud databases [21]. Most of the data
transactional workloads.
stored on local disks is cached, frequently-accessed data that
is persisted in the blob store. The local disks are not
responsible for persisting this data. The local disks are only 4 Unified table storage
responsible for persisting the tail of the log not yet in blob As a storage engine built for HTAP, S2DB table storage needs to
storage as described in section 3 and 3.1. Ephemeral SSDs can work well in a wide range of workloads. In many situations, we
have multiple orders of magnitude higher IOPS than network observed that choosing between rowstore and columnstore storage
block storage depending on the particular disks used, so this formats was a hard decision for users. It required the users to
is a considerable boost in performance for workloads doing a identify whether the access patterns of each table lean OLTP or
lot of concurrent reads and writes. OLAP, creating friction when developing applications. This was
• S2DB can keep months of history since it is cheap to store especially true for workloads having both OLTP and OLAP aspects
data at rest in blob storage. This history is used by a point in on the same tables, like real-time analytic use cases running
time restore (PITR) command to restore a database back to analytics concurrently with high-concurrency point reads and
the state it was in at a given time in the past without the writes. Therefore, we designed the unified table storage with the
need to have taken an explicit backup at that time. The blob foremost goal of providing a unified table type that works well
store acts as a continuous backup of the database. A PITR to both for OLTP and OLAP access. This eliminates the burden on
a target time in the past runs by inspecting the versioning users to choose the data layout suitable for their particular
metadata stored in the log files in blob storage to find a workload. Furthermore, it allows demanding HTAP workloads to
transactionally consistent point in the log (called LP) for each work efficiently without the complexity of managing data
partition that maps as closely as possible to the given PITR movements across tables serving different parts of the workloads.
target wall clock time. It then drops the existing local state of To quantify some of the benefits we wanted to achieve with
the database, and does a restore up until the log position LP unified table storage, we sought to allow customers who were
for each partition in the same fashion as when recovering using a UNION ALL view of a rowstore (storing recent data for
data from blob storage on a process restart. That is, it fetches uniqueness enforcement) and a columnstore (storing older data for
and replays the data from the first snapshot file before LP in efficient analytics) to just use one unified storage table. This
the log stream and then fetches and replays any logs after the replaces three DDL statements with just one, and eliminates
snapshot until LP is reached. Note that today S2DB doesn’t application code to move data from rowstore to columnstore as it
support querying at a specific point in time, sometimes called ages. Analytical query performance also improves, often by several
2345
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
2346
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
S2DB stores only the hashes, not the column values in the global columns by merging the postings lists from the individual per-
hash tables. The column values are instead stored in the per- column indexes.
segment inverted indexes. This reduces the write cost Since the most selective filtering on a multi-column index
significantly in cases with wide columns (e.g. when indexing on happens on queries filtering on all indexed columns, we build an
strings) since the per-segment inverted indexes don’t go through extra global index (3) to speed up those queries by skipping
the merges on the global index. Furthermore, the global index segments without a row matching all indexed columns. This is
only stores information about the unique values in each also important for uniqueness enforcement (section 4.1.2), since
segment, so its write cost is minimal when the index column the unique key check is by definition always matching all
contains only a few distinct values. columns in the unique key.
Compared with the per-segment filtering structure approach, The need to merge per-column postings lists in this design
this implementation has significant advantage on point reads, introduces a higher index lookup cost compared to the
since the number of lookups required is O(log(N)) (checking each alternative design of building a postings list specific to each
hash table in the global index) instead of O(N) (checking the unique tuple of indexed columns. This difference is more
index or bloom filter per segment). The drawback is having an significant if each individual column has a large number of
extra O(log(N)) factor in the write cost, which we found to be an matches, since the merging cost increases with the length of the
acceptable tradeoff for efficient index lookups. postings lists, while the seeking cost remains constant. Despite
Compared with the external index structure approach, this the higher index lookup cost in non-selective cases, we believe
implementation has advantage on reads, since it avoids the cost that the flexibility of filtering on a partial index match is more
of performing a LSM tree lookup per matched row for finding important. Note that the total cost of the read includes also the
the row in the primary LSM tree storage. The main difference cost of decoding the matched rows, which is similarly
here is that this implementation stores the physical row offsets proportional to the number of matched rows, and it often
instead of the primary key value. This advantage is particularly outweighs the index lookup cost in non-selective cases.
significant when there are many matched rows for the same
4.1.2 Uniqueness enforcement
secondary key value. The drawback is having extra write cost
when merging happens on the primary LSM tree, since merges Most columnstore implementations don’t support the
change the physical row offsets, which then creates a new hash enforcement of uniqueness constraints. For the few that do, it’s
table in the global index. On the other hand, this extra write cost usually done by either making the LSM tree sort on the unique
is minimal when the primary LSM tree rarely needs to perform key [16], or duplicating the data into a rowstore table or index
merges, e.g. when the table has no sort key, or the rows are [14]. Using the secondary index structure described above, S2DB
inserted in the sort key order. columnstore supports uniqueness constraint enforcement
The per-segment inverted index in this design allows the without forcing the sort key to be the unique key columns or
simultaneous use of multiple indexes when filtering on a boolean duplicating the data. The idea is simple - each newly inserted
expression of multiple indexed columns. Lookup results from row checks the secondary index for duplicates before inserting
different indexes can be combined efficiently by merging the into the table. As an optimization, each batch of ingested rows is
postings lists [43], so that only the exact set of rows passing all checked together to amortize the metadata access cost of the
index filters gets scanned. S2DB’s postings list format supports global indexes. The following procedure is used
forward seeking, so that sections in a long postings list can be 1. Take locks on the unique key values for each row in the
skipped during the merge, if postings lists from the other indexes batch. An in-memory lock manager is used here to avoid
already guarantee that no match is present in the section. concurrent inserts of the same unique key value.
2. Perform secondary index lookups on the unique key values.
4.1.1 Multi-column secondary index 3. When there are duplicated values, depending on the user-
To support multi-column secondary indexes while minimizing specified unique-key handling option, either report an error
storage costs, S2DB builds a secondary index for each indexed (default), skip the new row (SKIP DUPLICATE KEY
column, and allows the single-column indexes to be shared ERRORS option), delete and then replace the conflicting
across multiple indexes referring to the same columns. For rows (REPLACE command), or update the conflicting rows
example, a secondary index on columns (a, b, c) builds the (ON DUPLICATE KEY UPDATE option).
following data structures: In the typical case when there’s no duplicate value found during
1. Per-segment inverted indexes on each of the columns a, b, c the secondary index lookup, the secondary index lookup only
2. Global indexes on each of the columns a, b, c. needs to access the global hash tables (and rarely the per-
3. A global index on the tuple of indexed columns (a, b, c), segment inverted indexes on hash collision). When there are
mapping from the hash of each tuple (value_a, value_b, duplicates, the data segments would need to be accessed at the
value_c) to the starting locations of the corresponding per- row offsets matched by the index in the REPLACE and ON
column postings lists for value_a, value_b, and value_c. DUPLICATE KEY UPDATE cases.
The per-column data structures (1) and (2) are the same as
single-column secondary indexes. Building inverted indexes per-
column allows S2DB to answer queries on any subset of indexed
2347
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
4.2 Row-level locking and the encodings used. Instead, S2DB adopts adaptive query
execution to make the data access decisions dynamically.
S2DB columnstore storage represents deleted rows in a segment
Data access on S2DB unified table storage has 3 high-level
as a bit vector in the segment metadata. While this
steps: (1) finding the list of segments to read, (2) running filters
representation is optimized for vectorized access during
to find the rows to read from each segment, and (3) selectively
analytical queries, a naive implementation would introduce a
decoding and outputting the rows. Each step outputs in a
source of contention when modifying the segment metadata: a
consistent format, which serves as the common interface to be
user transaction running update or delete operations would
used across different data access methods. This section focuses
acquire the lock on the metadata row of a modified segment to
on the first two steps to discuss how they incorporate dynamic
install a new version of the deleted bit vector, blocking other
decisions to work efficiently in HTAP workloads.
modifications on the same segment (1 million rows) until the
user transaction commits or rolls back. Furthermore, the 5.1 Segment skipping
background merging process described in section 2.1.2 runs
segment merge transactions, which can also block modifications Segments can be skipped using either the global secondary index
on the segments being merged. structures or the min/max values stored in the segment
Instead of the naive implementation of having update and metadata. The secondary index check is done first, because it
delete queries update the bit vector directly, S2DB implements a only requires probing O(log(N)) times, and its result can reduce
row-level locking mechanism to avoid blocking during the number of segments to check for the min/max values. On the
transactional workloads. Rows to be updated or deleted are first other hand, there can be multiple keys to look up when the index
moved to the in-memory rowstore part of the table in an is used in cases like an IN-list or multiple filters connected by
autonomous transaction, which we refer to as a “move OR, which increases the index probing cost proportionally.
transaction”. Since moving the row doesn’t change the logical Therefore, S2DB dynamically disables the use of a secondary
table content, the move transaction can be committed index if the number of keys to look up is too high relative to the
immediately, so that the user transaction only needs to lock and table size.
modify the in-memory row. With this approach, the primary key Using secondary indexes adaptively is important when
of the in-memory rowstore acts as the lock manager, where running joins, since the number of keys used for join probing
inserting a copy of the row locks the row preventing concurrent can have large variations. Instead of the typical representation of
modifications. To ensure that the locked rows aren’t modified a nested loop join on the index, S2DB models a secondary index
before inserting their copies, an extra scanning pass on newly join as a "join index filter": similar to bloom filters used in hash
created segments is performed after locking to find the latest joins, it filters the larger table using the smaller of the joined
versions of the locked rows. tables. Compared to a bloom filter, the join index filter has no
Since a move transaction doesn’t change the logical table false positives, and it runs much faster (with a small joined table)
content, it can be reordered with other move or segment merge by performing index probes instead of a table scan. This model
transactions. Reordering move and segment merge transactions allows the join index filter to be dynamically disabled, in which
allows segment merges to happen without blocking update or case the execution falls back to a hash join, scanning the larger
delete queries. Furthermore, concurrent move transactions are table and probing the hash table built from the smaller table.
combined and committed as a single transaction as an
optimization. To make sure that deleted bits set by move 5.2 Filtering
transactions reflect the latest segment metadata, the commit For each clause (e.g. col1 = val1) in the filter condition, there are
process applies all segment merges between the scan timestamp up to four different ways to evaluate the filter, each with
and the commit timestamp of the move transaction to the different tradeoffs:
deleted bits modified as part of the move. Regular filter selectively decodes col1 for rows that passed
previous filters, then executes filter on the decoded values.
5 Adaptive query execution Encoded filter executes directly on the compressed values. For
Unified table storage supports multiple data access methods for example, when dictionary encoding is used, it evaluates the filter
transaction and analytical processing. Since hybrid workloads on all possible values in the dictionary for col1, then looks up the
blur the boundary between transaction and analytical results based on the dictionary index without decoding the
processing, for those workloads it becomes important for the column. Compared with a regular filter, this strategy is ideal
query execution engine to combine different access methods and with a small set of possible values, but it can be worse if the
apply them in the optimal order. For example, a query may be dictionary size is greater than the number of rows that passed
able to use a secondary index for one filter and encoded the previous filters.
execution for another. In which case, the optimal order of Group filter decodes all filtered columns and runs the entire
applying those filters would depend on the selectivity and the filter condition, instead of running the filter clauses separately.
evaluation cost of each filter. Static decisions made by the query Compared with a regular filter, running a group filter is better if
optimizer don’t always work well for selecting data access most rows pass each individual filter clause since it avoids the
methods, since the cost depends highly on the query parameters cost of combining results from individual clauses. On the other
2348
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
hand, a regular filter is better if some clauses can filter out and support data warehousing and cannot run TPC-C. CDB can run
skip further filter evaluation on most of the rows. both benchmarks, but our results as well as previous results [47]
Secondary index filter reads the postings list for val1 stored in show it performs orders of magnitude worse than the cloud data
the index to find the filtered row offsets. Using a secondary index warehouses on TPC-H.
is usually better compared to a regular filter. However, it can still We ran both benchmarks on S2DB’s unified table storage
be worse if the other clauses already filtered the result down to a described in section 4. That is, both benchmarks were run using
few rows. the same underlying table storage (which is the default out-of-
To select the optimal evaluation strategy, S2DB costs each the-box configuration; we did not force rowstore or columnstore
different method of filter evaluation by timing it on a small batch table storage). We used indexes, sort keys, and shard keys
of data at the beginning of each segment. Doing the costing per- appropriate for each benchmark, and used similar features across
segment ensures that the cost is aware of the data encoding and all products where those options were available. The schemas,
the data correlation with the sort key. For filters using an index, data loading commands, and queries used for testing are
costing is done using the postings list size stored in the index, published online[17].
since there’s no per-row evaluation cost beyond reading the We used the schemas, data, and queries of the benchmarks
postings list. Costing is skipped if the filter condition is a as defined by the TPC. However, this is not an official TPC
conjunction with a selective index filter, since costing in this benchmark.
case would be more expensive than running the filters on rows We compared the products on TPC-H at the 1TB scale factor,
output by the index. and TPC-C at 1,000 warehouses. Note that we have previously
Furthermore, S2DB dynamically reorders filter clauses using published results on these benchmarks as well as TPC-DS at
estimated per-row evaluation costs and filter selectivities. much larger scale factors [19, 50], demonstrating that S2DB
Consider a filtering condition A AND B with two clauses. Let scales well. Here, we chose to run TPC-H instead of TPC-DS
cost(X) be the cost of evaluating clause X, and P(X) be its because it required fewer modifications to run the same
selectivity. It’s better to evaluate A first if the following benchmark on CDW1 and CDW2.
inequality holds:
𝑐𝑜𝑠𝑡(𝐴) + 𝑃(𝐴) ∗ 𝑐𝑜𝑠𝑡(𝐵) ≤ 𝑐𝑜𝑠𝑡(𝐵) + 𝑃(𝐵) ∗ 𝑐𝑜𝑠𝑡(𝐴) Product vCPU Size Throughput Throughput
The above inequality is equivalent to the following (by dividing (warehouses) (tpmC) (% of max)
both sides by cost(A) * cost(B) and rearranging the terms): CDB 32 1000 12,582 97.8%
1 − 𝑃(𝐵) 1 − 𝑃(𝐴) S2DB 32 1000 12,556 97.7%
≤
𝑐𝑜𝑠𝑡(𝐵) 𝑐𝑜𝑠𝑡(𝐴) S2DB 256 10000 121,432 94.4%
Therefore, the optimal evaluation order can be found by sorting Table 1: TPC-C results (higher is better, up to the limit of
the clauses by (1 - P(X)) / cost(X), under the assumption that 12.86 tpmC/warehouse)
filter clauses are independent. Similar reordering can be done for
clauses connected by OR by tracking the ratio of rows not Product Cluster TPC-H TPC-H TPC-H
selected by the filter clause instead of the selected rows. S2DB price per geomean geomean throughput
represents the filter condition as a tree and reorders each hour (sec) (cents) (QPS)
intermediate AND/OR node in the tree separately. The ordering S2DB $16.50 8.57 s 3.92 ¢ 0.078
decision is made per-block using the selectivities from previous CDW1 $16.00 10.31 s 4.58 ¢ 0.069
blocks, to ensure that the selectivity estimates reflect the data CDW2 $16.30 10.06 s 4.55 ¢ 0.082
distribution in the nearby blocks. CDB $13.92 Did not finish within 24 hours
Table 2: Summary of TPC-H (1TB) results
6 Experimental Results
We did comparisons on Amazon EC2 on cluster sizes that
We used benchmarks derived from the industry-standard TPC-H
were chosen to be as similar in price as possible. Information
and TPC-C benchmarks to evaluate S2DB compared to other
about the cluster configurations we used are in Table 1 and 2.
leading cloud databases and data warehouses. The results below
For TPC-H, we measured the runtime of each query, and
show S2DB achieves leading-edge performance on both TPC-H,
computed the cost of each query by multiplying the runtime by
an OLAP benchmark, and TPC-C, an OLTP benchmark. We also
the price per second of the product configuration. We then
ran CH-BenCHmark against S2DB which runs a mixed workload
computed the geomean of the results across the queries. The
derived from running TPC-C and TPC-H simultaneously.
results are shown in Table 2 and Figures 4. We performed one
We compared S2DB with three other products: two cloud
cold run of each query to allow for query compilation and data
data warehouses we refer to as CDW1 and CDW2, and a cloud
caching, and then measured the average runtime of multiple
operational database we refer to as CDB. As the results below
warm runs of each query, with results caching disabled. S2DB,
show, S2DB had good performance and cost-performance on the
CDW1, and CDW2 all have competitive performance. We also
analytic benchmark TPC-H compared to the cloud data
tested CDB on the largest available size ($13.92 per hour), and it
warehouses, and on the transactional benchmark TPC-C
performed orders of magnitude worse: most queries failed to
compared to CDB. On the other hand, CDW1 and CDW2 only
complete within 1 hour, and running all the benchmark queries
2349
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
once failed to complete within 24 hours (compared to about 5 each other. Test case 4 introduces a read-only workspace with 2
minutes for the cloud data warehouses). leaves in it that is used to run AWs. This new workspace
For TPC-C, we compared S2DB against CDB, as shown in replicates the workload from the primary writable workspace
Table 1. Note all S2DB results were on our columnar-based that runs TWs as described in section 3.1, effectively doubling
unified table storage, and were competitive with CDB which is a the compute available to the cluster. This new configuration
rowstore-based operational database. CDW1 and CDW2 do not doesn’t impact TWs throughput when compared to test case 1
support running TPC-C. We measured the throughput (tpmC), as without the read-only workspace. AWs throughput is
defined by the TPC-C benchmark. We compared against results dramatically improved vs case 3 where it shared a single
previously published by CDB. Note that TPC-C specifies a workspace with TWs. This is not too surprising as the AWs
maximum possible tpmC of 12.86 per warehouse, and both S2DB have their own dedicated compute resources in test case 4. The
and CDB are essentially reaching this maximum at 1,000 AWs QPS was impacted by ~20% compared to running the AWs
warehouses, with similarly priced clusters. We also tested S2DB workload without any TWs at all (test case 2) as S2DB needed to
on TPC-C at 10,000 warehouses and it continues to scale linearly. do some extra work to replicate the live TWs transactions in this
These results, summarized in Figure 5, demonstrate that case which used up some CPU. Regarding the replication lag, the
S2DB’s unified table storage is able to achieve state-of-the-art AWs workspace had on average less than 1 ms of lag, being only
performance competitive with leading operational databases as a handful of transactions behind the TWs workspace. Test case 5
well as analytical databases on benchmarks specific to each was run with blob storage disabled (all data is stored on local
workload. In contrast, cloud operational databases like CDB have disks) and the performance was very close to the equivalent test
orders of magnitude worse performance on TPC-H, because of case with blob storage enabled (test case 4). This shows that
the use of a row-oriented storage format and single-host query asynchronously uploading to blob storage doesn’t use up
execution on complex query operations; cloud warehouses like noticeable hardware resources.
CDW1 and CDW2 are unable to support TPC-C, due to the lack
of enforced unique constraints, granular locking, and efficient
seeks under high concurrency. S2DB can meet workload
requirements that previously required using multiple specialized
database systems.
2350
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
integrated into their columnstore than S2DB. For example, SQL latency, high throughput transactional write workloads. These
Server requires a mapping index to map between secondary key systems also don’t support the fine-grained indexing and
rows and the position of the row in the clustered columnstore. seekable compression schemes of S2DB’s unified table storage
S2DB stores offsets directly in its secondary keys avoiding this that are needed to run low latency point queries for OLTP.
indirection and improving performance of secondary key Wildfire [29] is a database that adds HTAP capabilities to
lookups. SQL Server also needs to do secondary B-tree index Apache Spark. It commits transactions on local SSDs before
maintenance during bulk loading. S2DB’s secondary indexes are converting the data into columnstore format and moving it to
broken into segments similar to how columns are stored, and are blob storage asynchronously. Unlike S2DB, it doesn’t ship its
merged in the background which improves load performance by transaction logs to blob storage so it doesn’t support point in
moving most of the index maintenance work out of the bulk data time restores from blob storage. Wildfire also builds an LSM Tree
loading code path. during data ingestion to allow index lookups and can spill this
TiDB[23] was initially built as a distributed, highly available index to the blob store if needed. The performance of wildfire on
rowstore and later added the capability to transparently replicate point updates and deletes was not evaluated. The WiSER project
data into columnstore format to improve the performance of later added stronger transactional guarantees to Wildfire [38].
analytical queries. This design allows for OLTP queries to target Procella [11] is a database system built by Google that
the rowstore and OLAP queries to target the columnstore powers the real-time analytical features of YouTube. It has many
replicas at the cost of having to store the data twice in two similar design goals to S2DB as far as support for low-latency
different formats. Using OLAP replicas in this fashion has some selective queries over columnstore data via inverted indexes and
limitations that S2DB’s unified table design doesn’t have. The seekable compression schemes. It also supports separated storage
replica design can’t support both OLTP writes and OLAP reads by making use of Google’s internal blob storage service.
within the same transaction because the OLTP writes won’t be Although Procella is designed for low latency analytics and
replicated to the OLAP store yet. Forcing all writes through an streaming ingestion, it doesn’t support OLTP. It has special APIs
OLTP optimized store also means TiDB is unable to gain the bulk for data ingestion but no support for low latency point read and
data loading performance benefits of keeping the data only in write queries with millisecond latencies.
highly compressed columnstore format. TiDB has separate
storage and query nodes within a cluster, but it doesn’t make use 8 Conclusion
of blob storage as a shared disk which means it misses out on the
S2DB was designed to handle transactional and analytical
durability, elasticity and cost benefits mentioned in section 3.3.
workloads with strong performance. Its use of blob storage
Janus[48] also uses a write optimized rowstore for OLTP and
enables the cost, durability and elasticity benefits of shared-disk
read optimized column store for OLAP with a transactionally
databases such as cloud data warehouses without impacting its
consistent data movement pipeline moving data from the
ability to run low-latency, high-throughput write transactions.
rowstore to the columnstore. Janus is unique in that it allows the
S2DB only stores cold data in blob storage. It never writes to the
read and write optimized stores to have different partitioning
blob store to commit transactions. S2DB’s unified table storage
schemes and its data movement pipeline batches up transactions
uses a combination of an in-memory rowstore and an on-disk
so they can more efficiently be applied on the columnstore. This
columnstore that supports secondary and unique keys via
design picks up most of the same limitations mentioned above
inverted indexes. This design has the fast scan performance of a
for TiDB as far as requiring data to be stored in two different
traditional columnstore while enabling efficient point queries via
formats.
indexing. The set of design trade-offs we have chosen has been
There are several databases built from scratch to support
validated by the successful use of S2DB for a varied set of
HTAP. Hyper [20, 25] supports a high performance in-memory
workloads by our customers, often meeting application
hybrid rowstore and columnstore engine. It supports running
requirements that previously required using multiple specialized
OLAP queries on a snapshot of the OLTP data, but only operates
databases.
on datasets that fit into main memory. SAP HANA[24] supports
in-memory rowstores and in-memory columnstore tables among
other storage engines. Developers choose which table type they ACKNOWLEDGMENTS
prefer for each table. It also supports replicating data from a We would like to thank the SingleStore engineering team for
rowstore into a columnstore so the same data can be stored in their efforts over the years in making the various ideas in this
both formats. This has similar disadvantages to the TiDB OLAP paper a reality. Their ingenuity and hard work are a key part of
replica design mentioned above. Neither Hyper nor SAP HANA the success of S2DB. We also need to call out our customers who
support using blob storage as a shared disk. took the time to share their feedback, expectations, and uses
Most cloud data warehouses today use a blob store for cases with us. They played a large role in molding the product
persistent storage and keep only frequently queried data cached into what it is today.
on the hosts they use to run queries. Snowflake [26], Redshift
[27] and Databricks [30] all follow this pattern. These systems REFERENCES
force new data to be written to blob storage before it is [1] AWS Cloud Databases (2021). https://aws.amazon.com/products/databases/
considered durable which limits their ability to support low
2351
Industrial Track Paper SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA
[2] Michael Stonebraker and Ugur Cetintemel. 2005. "One Size Fits All": An Idea Munir, A., Pelley, S., Povinec, P., Rahn, G., Triantafyllis, S., & Unterbrunner, P.
Whose Time Has Come and Gone. In Proceedings of the 21st International (2016). The Snowflake Elastic Data Warehouse. Proceedings of the 2016
Conference on Data Engineering (ICDE ‘05). IEEE Computer Society, USA, 2– International Conference on Management of Data.
11. DOI:https://doi.org/10.1109/ICDE.2005.1 [27] Ippokratis Pandis: The evolution of Amazon Redshift. Proc. VLDB Endow.
[3] Amazon S3 (2021). https://aws.amazon.com/s3/ 14(12): 3162-3163 (2021)
[4] Amazon EC2 (2021). https://aws.amazon.com/ec2 [28] Verbitski, A., Gupta, A., Saha, D., Brahmadesam, M., Gupta, K.K., Mittal, R.,
[5] A. Skidanov, A. J. Papito and A. Prout, "A column store engine for real-time Krishnamurthy, S., Maurice, S., Kharatishvili, T., & Bao, X. (2017). Amazon
streaming analytics," 2016 IEEE 32nd International Conference on Data Aurora: Design Considerations for High Throughput Cloud-Native Relational
Engineering (ICDE), 2016, pp. 1287-1297, doi: 10.1109/ICDE.2016.7498332. Databases. Proceedings of the 2017 ACM International Conference on
[6] A. Prout, The Story Behind SingleStore’s Skiplist Indexes (2019). Management of Data.
https://www.singlestore.com/blog/what-is-skiplist-why-skiplist-index-for- [29] Shekar, K. & Bhoomeshwar, B.. (2020). Evolving Database for New Generation
memsql/ Big Data Applications. 10.1007/978-981-15-1632-0_26.
[7] Michal Nowakiewicz, Eric Boutin, Eric Hanson, Robert Walzer, and Akash [30] Armbrust, M., Das, T., Paranjpye, S., Xin, R., Zhu, S., Ghodsi, A., Yavuz, B.,
Katipally. 2018. BIPie: Fast Selection and Aggregation on Encoded Data using Murthy, M., Torres, J., Sun, L., Boncz, P.A., Mokhtar, M., Hovell, H.V., Ionescu,
Operator Specialization. In Proceedings of the 2018 International Conference A., Luszczak, A., Switakowski, M., Ueshin, T., Li, X., Szafranski, M., Senster, P.,
on Management of Data (SIGMOD ‘18). Association for Computing & Zaharia, M. (2020).Delta Lake: High-Performance ACID Table Storage over
Machinery, New York, NY, USA, 1447–1459. Cloud Object Stores. Proceedings of the VLDB Endowment, 13, 3411 - 3424.
DOI:https://doi.org/10.1145/3183713.3190658 [31] Stonebraker, M., Abadi, D.J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M.,
[8] O’Neil, P., Cheng, E., Gawlick, D. et al. The log-structured merge-tree (LSM- Lau, E., Lin, A., Madden, S., O’Neil, E.J., O’Neil, P.E., Rasin, A., Tran, N., &
tree). Acta Informatica 33, 351–385 (1996). Zdonik, S.B. (2005). C-Store: A Column-oriented DBMS. VLDB.
https://doi.org/10.1007/s002360050048 [32] Optimizing Schema Design for Cloud Spanner.
[9] Peter A Boncz, Marcin Zukowski, and Niels Nes. MonetDB/X100: Hyper- https://cloud.google.com/spanner/docs/whitepapers/optimizing-schema-
Pipelining Query Execution, Proc. of the 2005 CIDR Conf. design
[10] Dong, S., Callaghan, M.D., Galanis, L., Borthakur, D., Savor, T., & Strum, M. [33] Avinash Lakshman, Prashant Malik. Cassandra: a decentralized structured
(2017). Optimizing Space Amplification in RocksDB. CIDR. storage system. ACM SIGOPS Oper. Syst. Rev. 44(2): 35-40 (2010)
[11] Chattopadhyay, B., Dutta, P., Liu, W., Tinn, O., McCormick, A., Mokashi, A., [34] Chang, F.W., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M.,
Harvey, P., Gonzalez, H., Lomax, D., Mittal, S., Ebenstein, R., Mikhaylin, N., Chandra, T., Fikes, A., & Gruber, R.E. (2008). Bigtable: A Distributed Storage
Lee, H., Zhao, X., Xu, T., Perez, L., Shahmohammadi, F., Bui, T., Mckay, N., System for Structured Data. TOCS.
Aya, S., Lychagina, V., & Elliott, B. (2019). Procella: Unifying serving and [35] Stonebraker, M. (1985). The Case for Shared Nothing. IEEE Database Eng. Bull..
analytical data at YouTube. Proc. VLDB Endow., 12, 2022-2034. [36] WiredTiger: Schema, Columns, Column Groups, Indices and Projections.
[12] Luo, C., & Carey, M.J. (2019). LSM-based storage techniques: a survey. The https://source.wiredtiger.com/2.5.2/schema.html
VLDB Journal, 29, 393-418. [37] Luo, C., & Carey, M.J. (2019). LSM-based storage techniques: a survey. The
[13] Indexing with SSTable attached secondary indexes (SASI). VLDB Journal, 29, 393-418.
https://docs.datastax.com/en/dse/5.1/cql/cql/cql_using/useSASIIndexConcept. [38] Barber, Ronald & Garcia-Arellano, Christian & Grosman, Ronen & Lohman,
html Guy & Mohan, C. & Mueller, Rene & Pirahesh, Hamid & Raman, Vijayshankar
[14] P. Larson, A. Birka, E. N. Hanson, W. Huang, M. Nowakiewicz, and V. & Sidle, Richard & Storm, Adam & Tian, Yuanyuan & Tozun, Pinar & Wu,
Papadimos. Real-Time Analytical Processing with SQL Server. PVLDB, Yingjun. (2019). WiSer: A Highly Available HTAP DBMS for IoT Applications.
8(12):1740–1751, 2015. [39] E. Hanson, SingleStore’s Patented Universal Storage, 2021.
[15] InnoDB Clustered and Secondary Indexes https://www.singlestore.com/blog/singlestore-universal-storage-episode-4/
https://dev.mysql.com/doc/refman/8.0/en/innodb-index-types.html [40] Lattner, C., & Adve, V.S. (2004). LLVM: a compilation framework for lifelong
[16] Lipcon, Todd et al. “Kudu : Storage for Fast Analytics on Fast Data ∗.” (2016). program analysis & transformation. International Symposium on Code
[17] Bench marking code. https://github.com/memsql/benchmarks-tpc Generation and Optimization, 2004. CGO 2004., 75-86.
[18] DB-Engines Ranking. https://db-engines.com/en/ranking [41] Neumann, T. (2011). Efficiently Compiling Efficient Query Plans for Modern
[19] Singlestore Unofficial TPC Benchmarking. Hardware. Proc. VLDB Endow., 4, 539-550.
https://www.singlestore.com/blog/memsql-tpc-benchmarks / [42] Özcan, F., Tian, Y., & Tözün, P. (2017). Hybrid Transactional/Analytical
[20] Kemper, A., & Neumann, T. (2011). HyPer: A hybrid OLTP&OLAP main Processing: A Survey. Proceedings of the 2017 ACM International Conference
memory database system based on virtual memory snapshots. 2011 IEEE 27th on Management of Data.
International Conference on Data Engineering, 195-206. [43] Sanders, P., & Transier, F. (2007). Intersection in Integer Inverted Indices.
[21] Amazon RDS DB instance storage. ALENEX.
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Storage.h [44] Amazon Elastic Block Store (EBS). https://aws.amazon.com/ebs/
tml [45] Amazon S3 FAQs. https://aws.amazon.com/s3/faqs/
[22] T. Lahiri, S. Chavan, M. Colgan, D. Das, A. Ganesh, et al. Oracle Database In- [46] Chen, Jack & Jindel, Samir & Walzer, Robert & Sen, Rajkumar &
Memory: A dual format in-memory database. In ICDE, pages 1253–1258. IEEE Jimsheleishvilli, Nika & Andrews, Michael. (2016). The MemSQL query
Computer Society, 2015. optimizer: a modern optimizer for real-time analytics in a distributed database.
[23] Huang, D., Liu, Q., Cui, Q., Fang, Z., Ma, X., Xu, F., Shen, L., Tang, L., Zhou, Y., Proceedings of the VLDB Endowment. 9. 1401-1412. 10.14778/3007263.3007277.
Huang, M., Wei, W., Liu, C., Zhang, J., Li, J., Wu, X., Song, L., Sun, R., Yu, S., [47] Performance comparison of HeatWave with Snowflake, Amazon Redshift,
Zhao, L., Cameron, N., Pei, L., & Tang, X. (2020). TiDB: A Raft-based HTAP Amazon Aurora, and Amazon RDS for MySQL.
Database. Proc. VLDB Endow., 13, 3072-3084. https://www.oracle.com/mysql/heatwave/performance
[24] J. Lee, S. Moon, K. H. Kim, D. H. Kim, S. K. Cha, W. Han, C. G. Park, H. J. Na, [48] Arora, Vaibhav & Nawab, Faisal & Agrawal, Divyakant & Abbadi, Amr. (2017).
and J. Lee. Parallel Replication across Formats in SAP HANA for Scaling Out Janus: A Hybrid Scalable Multi-Representation Cloud Datastore. IEEE
Mixed OLTP/OLAP Workloads. PVLDB, 10(12):1598–1609, 2017. Transactions on Knowledge and Data Engineering. PP. 1-1.
[25] Lang, H., Mühlbauer, T., Funke, F., Boncz, P.A., Neumann, T., & Kemper, A. 10.1109/TKDE.2017.2773607
(2016). Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using [49] Eric Boutin, How Careful Engineering Led to Processing Over a Trillion Rows
both Vectorization and Compilation. Proceedings of the 2016 International Per Second (2018). https://www.singlestore.com/blog/how-to-process-trillion-
Conference on Management of Data. rows-per-second-ad-hoc-analytic-queries/
[26] Dageville, B., Cruanes, T., Zukowski, M., Antonov, V.N., Avanes, A., Bock, J., [50] G. LaLonde, J. Cheng, S. Wang, TPC Benchmarking Results (2021).
Claybaugh, J., Engovatov, D., Hentschel, M., Huang, J., Lee, A.W., Motivala, A., https://www.singlestore.com/blog/tpc-benchmarking-results/
2352