0% found this document useful (0 votes)
17 views

POLARIS-Distributed Query Engine in Synapse

Uploaded by

RaheelKhursheed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

POLARIS-Distributed Query Engine in Synapse

Uploaded by

RaheelKhursheed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

POLARIS: The Distributed SQL Engine in Azure Synapse

Josep Aguilar-Saborit, Raghu Ramakrishnan, Krish Srinivasan


Kevin Bocksrocker, Ioannis Alagiannis, Mahadevan Sankara, Moe Shafiei
Jose Blakeley, Girish Dasarathy, Sumeet Dash, Lazar Davidovic, Maja Damjanic, Slobodan Djunic, Nemanja Djurkic, Charles Feddersen, Cesar
Galindo-Legaria, Alan Halverson, Milana Kovacevic, Nikola Kicovic, Goran Lukic, Djordje Maksimovic, Ana Manic, Nikola Markovic, Bosko Mihic,
Ugljesa Milic, Marko Milojevic, Tapas Nayak, Milan Potocnik, Milos Radic, Bozidar Radivojevic, Srikumar Rangarajan, Milan Ruzic, Milan Simic,
Marko Sosic, Igor Stanko, Maja Stikic, Sasa Stanojkov, Vukasin Stefanovic, Milos Sukovic, Aleksandar Tomic , Dragan Tomic, Steve Toscano,
Djordje Trifunovic, Veljko Vasic, Tomer Verona, Aleksandar Vujic, Nikola Vujic, Marko Vukovic, Marko Zivanovic
Microsoft Corp

ABSTRACT phase of interactive analysis and reporting. While this pattern


In this paper, we describe the Polaris distributed SQL query engine bridges the lake and warehouse paradigms and allows enterprises
in Azure Synapse. It is the result of a multi-year project to re- to benefit from their complementary strengths, we believe that the
architect the query processing framework in the SQL DW parallel two approaches are converging, and that the full relational SQL tool
data warehouse service, and addresses two main goals: (i) converge chain (spanning data movement, catalogs, business analytics and
data warehousing and big data workloads, and (ii) separate compute reporting) must be supported directly over the diverse and large
and state for cloud-native execution. datasets stored in a lake; users will not want to migrate all their
investments in existing tool chains.
From a customer perspective, these goals translate into many useful
features, including the ability to resize live workloads, deliver In this paper, we present the Polaris interactive relational query
predictable performance at scale, and to efficiently handle both engine, a key component for converging warehouses and lakes in
relational and unstructured data. Achieving these goals required Azure Synapse [1], with a cloud-native scale-out architecture that
many innovations, including a novel “cell” data abstraction, and makes novel contributions in the following areas:
flexible, fine-grained, task monitoring and scheduling capable of • Cell data abstraction: Polaris builds on the abstraction of
handling partial query restarts and PB-scale execution. Most a data “cell” to run efficiently on a diverse collection of data
importantly, while we develop a completely new scale-out formats and storage systems. The full SQL tool chain can now
framework, it is fully compatible with T-SQL and leverages be brought to bear over files in the lake with on-demand
decades of investment in the SQL Server single-node runtime and interactive performance at scale, eliminating the need to move
query optimizer. The scalability of the system is highlighted by a files into a warehouse. This reduces costs, simplifies data
1PB scale run of all 22 TPC-H queries; to our knowledge, this is governance, and reduces time to insight. Additionally, in
the first reported run with scale larger than 100TB. conjunction with a re-designed storage manager (Fido [2]) it
PVLDB Reference Format: supports the full range of query and transactional performance
Josep Aguilar-Saborit, Raghu Ramakrishnan et al. needed for Tier 1 warehousing workloads.
VLDB Conferences. PVLDB, 13(12): 3204 – 3216, 2020. • Fine-grained scale-out: The highly-available micro-
DOI: https://doi.org/10.14778/3415478.3415545 service architecture is based on (1) a careful packaging of data
and query processing into units called “tasks” that can be
1. INTRODUCTION readily moved across compute nodes and re-started at the task
Relational data warehousing has long been the enterprise approach level; (2) widely-partitioned data with a flexible distribution
to data analytics, in conjunction with multi-dimensional business- model; (3) a task-level “workflow-DAG” that is novel in
intelligence (BI) tools such as Power BI and Tableau. The recent spanning multiple queries, in contrast to [3, 4, 5, 6]; and (4) a
explosion in the number and diversity of data sources, together with framework for fine-grained monitoring and flexible
the interest in machine learning, real-time analytics and other scheduling of tasks.
advanced capabilities, has made it necessary to extend traditional • Combining scale-up and scale-out: Production-ready
relational DBMS based warehouses. In contrast to the traditional scale-up SQL systems offer excellent intra-partition
approach of carefully curating data to conform to standard parallelism and have been tuned for interactive queries with
enterprise schemas and semantics, data lakes focus on rapidly deep enhancements to query optimization and vectorized
ingesting data from many sources and give users flexible analytic processing of columnar data partitions, careful control flow,
tools to handle the resulting data heterogeneity and scale. and exploitation of tiered data caches. While Polaris has a new
A common pattern is that data lakes are used for data preparation, scale-out distributed query processing architecture inspired by
and the results are then moved to a traditional warehouse for the big data query execution frameworks, it is unique in how it
combines this with SQL Server’s scale-up features at each
This work is licensed under the Creative Commons Attribution- node; we thus benefit from both scale-up and scale-out.
NonCommercial-NoDerivatives 4.0 International License. To view a copy of
this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any • Flexible service model: Polaris has a concept of a session,
use beyond those covered by this license, obtain permission by emailing which supports a spectrum of consumption models, ranging
[email protected]. Copyright is held by the owner/author(s). Publication rights from “serverless” ad-hoc queries to long-standing pools or
licensed to the VLDB Endowment. clusters. Leveraging the Polaris session architecture, Azure
Proceedings of the VLDB Endowment, Vol. 13, No. 12
ISSN 2150-8097.
Synapse is unique among cloud services in how it brings
DOI: https://doi.org/10.14778/3415478.3415545 together serverless and reserved pools with online scaling. All
data (e.g., files in the lake, as well as managed data in Fido
[2]) are accessible from any session, and multiple sessions can

3204
access all underlying data concurrently. Fido supports utilization and concurrency. In future, we plan to build
efficient transactional updates with data versioning. on this global view with autonomous workload
management features. See Section 6.
1.1 Related Systems
The most closely related cloud services are AWS Redshift [7], • Multi-layered data caching model. Hive LLAP [14]
Athena [8], Google Big Query [9, 10], and Snowflake [11]. Of showed the value of caching and pre-fetching of column
course, on-premise data warehouses such as Exadata [12] and store data for big data workloads. Caching is especially
Teradata [13] and big data systems such as Hadoop [3, 4, 14, 15], important in cloud-native architectures that separate state
Presto [16, 17] and Spark [5] target similar workloads (increasingly from compute (Section 2), and Polaris similarly leverages
migrating to the cloud) and have architectural similarities. SQL Server buffer pools and SSD caching. Local nodes
cache columnar data in buffer pools, complemented by
• Converging data lakes and warehouses. Polaris
caching of distributed data in SSD caches.
represents data using a “cell” abstraction with two
dimensions: distributions (data alignment) and partitions
(data pruning). Each cell is self-contained with its own 2. SEPARATING COMPUTE AND STATE
statistics, used for both global and local QO. This Figure 1 shows the evolution of data warehouse architectures over
abstraction is the key building block enabling Polaris to the years, illustrating how state has been coupled with compute.
abstract data stores. Big Query and Snowflake support a COMPUTE SEPARATED FROM COMPUTE
sort key (partitions) but not distribution alignment; we
discuss this further in Section 4. Caches Caches Caches

• Metadata Metadata Metadata

State
Service form factor. On one hand, we have reserved-
capacity services such as AWS Redshift, and on the other Xact Xact Xact
serverless offerings such as Athena and Big Query. Data Data Data
Snowflake and Redshift Spectrum are somewhere in the On-prem arch Storage separation arch State separation arch
middle, with support for online scaling of the reserved
capacity pool size. Leveraging the Polaris session (a) Stateful Compute (b) Stateless Compute
architecture, Azure Synapse is unique in supporting both
Figure 1. Decoupling state from compute.
serverless and reserved pools with online scaling; the
pool form factor represents the next generation of the To drive the end-to-end life cycle of a SQL statement with
current Azure SQL DW service, which is subsumed as transactional guarantees and top tier performance, engines maintain
part of Synapse. The same data can simultaneously be state, comprised of cache, metadata, transaction logs, and data. On
operated on from both serverless SQL and SQL pools. the left side of Figure 1, we see the typical shared-nothing on-
premises architecture where all state is in the compute layer. This
• Distributed cost-based query optimization over the data
approach relies on small, highly stable and homogenous clusters
lake. Related systems such as Snowflake [11], Presto [17,
with dedicated hardware for Tier-1 performance, and is expensive,
18] and LLAP [14] do query optimization, but they have
hard to maintain, and cluster capacity is bounded by machine sizes
not gone through the years of fine-tuning of SQL Server,
because of the fixed topology; hence, it has scalability limits.
whose cost-based selection of distributed execution plans
goes back to the Chrysalis project [19]. A novel aspect of The shift to the cloud moves the dial towards the right side of Figure
Polaris is how it carefully re-factors the optimizer 1 and brings key architectural changes. The first step is the
framework in SQL Server and enhances it to be cell- decoupling of compute and storage, providing more flexible
aware, in order to fully leverage the Query Optimizer resource scaling. Compute and storage layers can scale up and
(QO), which implements a rich set of execution strategies down independently adapting to user needs; storage is abundant
and sophisticated estimation techniques. We discuss and cheaper than compute, and not all data needs to be accessed at
Polaris query optimization in Section 5; this is key to the all times. The user does not need compute to hold all data, and only
performance reported in Section 10. pays for the compute needed to query a working subset of it.
• Massive scale-out of state-of-the-art scale-up query Decoupling of compute and storage is not, however, the same as
processor. Polaris has the benefit of building on one of decoupling compute and state. If any of the remaining state held in
the most sophisticated scale-up implementations in SQL compute cannot be reconstructed from external services, then
Server, and the scale-out framework is designed compute remains stateful. In stateful architectures, state for in-
expressly to achieve this—tasks at each node are flight transactions is stored in the compute node and is not hardened
delegated to SQL Server instances—by carefully re- into persistent storage until the transaction commits. As such, when
factoring SQL Server code. a compute node fails, the state of non-committed transactions is
lost, and there is no alternative but to fail in-flight transactions.
• Global resource-aware scheduling. The fine-grained Stateful architectures often also couple metadata describing data
representation of tasks across all queries in the Polaris distributions and mappings to compute nodes, and thus, a compute
workflow-graph is inspired by big data task graphs [3, 4, node effectively owns responsibility for processing a subset of the
5, 6], and enables much better resource utilization and data and its ownership cannot be transferred without a cluster re-
concurrency than traditional data warehouses. Polaris start. In summary, resilience to compute node failure and elastic
advances existing big data systems in the flexibility of its assignment of data to compute are not possible in stateful
task orchestration framework, and in maintaining a architectures. Several cloud services and on-prem data warehouse
global view of multiple queries to do resource-aware architectures fall into this category, including Red Shift, SQL DW,
cross-query scheduling. This improves both resource Teradata, Oracle, etc.

3205
Stateless compute architectures require that compute nodes hold no The partitioning function p(r) is a user-defined function that takes
state information, i.e., all data, transactional logs and metadata need as input an object r and returns the partition i in which r is
to be externalized. This allows the application to partially restart positioned. This is useful for aggressive partition pruning when
the execution of queries in the event of compute node failures, and range or equality predicates are defined over the partitioning key.
to adapt to online changes of the cluster topology without failing (If the user does not specify p for a dataset, the partition pruning
in-flight transactions. Caches need to be as close to the compute as optimization is not applicable.)
possible, and since they can be lazily reconstructed from persisted Cells can be grouped physically in storage however we choose
data they don’t necessarily need to be decoupled from compute. (examples of groupings are shown as dotted rectangles in Figure 2),
Therefore, the coupling of caches and compute does not make the so long as we can efficiently access Cij. Queries can selectively
architecture stateful. reference either cell dimension or even individual cells depending
Polaris is a cloud-native distributed analytics system that follows a on predicates and type of operations present in the query.
stateless architecture. In the remainder of the paper we go through
the technical highlights of the architecture, and finally, we present
Cell Data Cells
grouping
results of running all 22 TPC-H queries at 1PB scale on Azure.
1 C11 C1N

3. THE POLARIS DATA ABSTRACTION

User partitions. P(r)


A key objective for Polaris is to be a scale-out query engine for
relational data as well as heterogeneous datasets stored in
distributed file systems such as HDFS. The Polaris data model is
therefore designed with the following considerations in mind: M CM1 CMN

1 N
• Abstraction from the data format. Polaris, as an
analytical query engine over the data lake, must be able Figure 2. Polaris Data Model
to query any data, relational or unstructured, whether in
a transactionally updatable managed store or an Flexible Assignment of Cells to Compute
unmanaged file system. Hence, we need a clean Query processing across thousands of machines requires query
abstraction over the underlying data type and format, resilience to node failures. For this, the data model needs to support
capturing just what’s needed for efficiently parallelizing a flexible allocation of cells to compute, such that upon node failure
data processing. A dataset in Polaris is logically or topology change, we can re-assign cells of the lost node to the
abstracted as a collection of cells that can be arbitrarily remainder of the topology. This flexible assignment of cells to
assigned to compute nodes to achieve parallelism. The compute is ensured by maintaining metadata state (specifically, the
Polaris distributed query processing framework (DQP), assignment of cells to compute nodes at any given time) in a
operates at the cell level and is agnostic to the details of durable manner outside the compute nodes.
the data within a cell. Data extraction from a cell is the
responsibility of the (single node) query execution Polaris Pool
engine, which is primarily SQL Server, and is extensible
for new data types. DISTRIBUTED QUERY
PROCESSING
Data Set = Collection of data cells
• Wide distribution. For scale-out processing, each
dataset must be distributed across thousands of buckets,
or subsets of data objects, such that they can be processed
Partitions

in parallel across nodes. In Polaris, this can be expressed


as the requirement that a dataset must be uniformly
distributed across a large number of cells.
Hash Distributions
3.1 Data Cells
As shown in Figure 2, a collection (e.g., table) of data objects (e.g., STORE ABSTRACTION

rows) in Polaris can be logically abstracted as a collection of cells (DATA CELLS)

Cij containing all objects r such that p(r) = i and h(r) = j.


ADLS Fido U-Parquet Socrates Cosmos DB
The hash-distribution h(r) is a system-defined function applied to
(a user-defined composite key c of) r that returns the hash bucket Analytical Stores Transactional Stores
number, or distribution, that r belongs to. The hash-distribution h
is used to map cells to compute nodes, and the system chooses h to Figure 3. Store Abstraction via Data Cells
hash datasets across a large number of buckets so that cells (and
thus, computation) can be distributed across as many compute
Storage Abstraction
Polaris abstracts distributed query processing from the underlying
nodes as needed. Further, computationally expensive operations
store via data cells. As shown in Figure 3, any dataset can be
such as joins and vector aggregation can be performed at the cell
mapped to a collection of cells, which allows Polaris to do
level without incurring data movement if either the join keys or
distributed query processing over data in diverse formats, and in
grouping keys are aligned on the hash-distribution key.
any underlying store, as long as efficient access to individual cells
is provided by the storage server. As such, Polaris can perform

3206
highly scalable distributed query processing over analytical stores physical distributed plans in the search space, the DQO uses
such as ADLS [20], Fido [2], and Delta [21], as well as required distribution properties on operators to discard alternatives.
transactional stores such as Socrates [22] and Cosmos DB [23]. Of The list of required properties for each relational algebra operation
course, when data is stored in columnar formats tailored for is listed in the appendix of this paper.
vectorized processing, this further improves relational query
performance. Distribution Properties as “Interesting Properties”
System R [24] introduced the concept of interesting properties,
A Note on Queries namely physical properties (e.g., sort order) such that the best plan
In this paper, we mostly focus on relational queries (with the for producing (intermediate) tables with each interesting property
exception of Section 10.4). Data objects are assumed to have is saved during the enumeration of the search space. Thus, the
attributes required by relational operators to which they are input. cheapest plan for producing an intermediate table in sorted order by
That said, the generality of the data abstraction underlying Polaris’s the first column would be saved even if there is a cheaper plan to
query processing means that we can handle datasets represented in produce the same table unsorted or in a different sort order.
diverse formats and stored in different repositories. For example, Similarly, in the distributed search space, the Polaris DQO uses the
Polaris can run directly over data in HDFS and in managed required distribution properties of relational algebra operators as
transactional stores. Further, different objects in a dataset could interesting properties. When enumerating the physical plan
differ in the attributes attached to them, and objects could have alternatives bottom-up, the best plan for each property and the best
additional uninterpreted attributes. plan overall based on cost are kept.

4. MAPPING CELLS TO COMPUTE 4.2 Data Move Enforcers


A fundamental aspect in distributed execution is how we map cells Polaris provides physical operators called data move enforcers that
(of source datasets as well as intermediate results) to compute can read data from a source dataset and produce a target dataset
nodes for various operations involved in the execution of a query. with different distribution properties:
As noted above, we map cells to nodes using the hash-distribution • Hash operator, Hd. Re-distributes every object (in every
h. We now discuss this in more detail. cell of the dataset) by hashing on column d. The number
of cells in the output dataset can differ from the input.
4.1 Distribution Properties 𝐻𝑑 (𝑃[𝑐] ) = 𝑃 [𝑑]
As discussed above, data objects (e.g., tuples or rows) in a cell are 𝐻𝑑 (𝑃1 ) = 𝑃 [𝑑]
hash aligned, i.e., if c is the composite key, all objects that hash to
𝐻𝑑 (𝑃∅ ) = 𝑃 [𝑑]
the same cell have the same hash value or distribution h(c).
Further, if two objects hash to different distribution values, they • Broadcast operator, B. Maps the input dataset to a
must differ on the composite key c. As degenerate cases, objects single cell and replicates it across multiple locations.
may be distributed round-robin or mapped to a single cell. We 𝐵(𝑃 [𝑐] ) = 𝑃1
introduce the following notation for how objects in a dataset are 𝐵(𝑃 ∅ ) = 𝑃1
hashed (or not) across cells: a=b
1. ℎ[𝑐]: Objects in a dataset P are mapped to cells using a P Q
hash-distribution on column c. Also denoted as: 𝑃 [𝑐] . [a] [b] [a] 1 1 [c]
a=b
2. All objects in the dataset are hashed to the same value, P Q P a=b
Q P a=b
Q
i.e., there is a single hash-bucket: 𝑃1 a=b a=b a=b
3. Objects in dataset P are not hash-distributed across
cells; this situation arises sometimes for intermediate [a] [a] [c]
Hb (Q) B (Q) B (P)
results. Also denoted as: 𝑃 ∅ P P Q
[c] [c] [a]
Q Q P
The above distribution properties are used by the Polaris
Distributed Query Optimizer (DQO) for two fundamental 1 1 [a] [c] 1 [b]
a=b a=b a=b
purposes: (1) to guarantee functional correctness of parallel P Q P Q P Q
execution of operations such as joins and vector aggregations, and a=b a=b a=b
(2) they are used as interesting properties by the DQO while
enumerating physical distributed alternatives in the search space. [a] [c]
B (P) B (Q) P Q B (P) Hb (Q)
Distribution Properties as Correctness Filters
[a] [c] [a] [c]
The input distribution properties of a relational operator are used to P Q P Q
guarantee functional correctness when enumerating the physical
execution alternatives across multiple compute nodes. For instance,
an inner join requires both of its inputs to be hash aligned on the P Q
[a] 1
join column, or one input to be mapped to a single hash-bucket, in [c] 1 [b]
order to return the correct results while operating only on input cells P P Q Q Q
[a]
P B (P) [c]
available locally at each node: Q B (Q) Hb (Q)

𝑃 ⋈𝑎=𝑏 𝑄: {{𝑃[𝑎] ∧ 𝑄 [𝑏] } ∨ {𝑃1 } ∨ {𝑄1 }} P


[a]
[c] [c]
Q Q
We refer to such correctness criteria on inputs as required
distribution properties. During the enumeration of the alternative Figure 4. Enumeration of the search space for inner join

3207
Task generation in the DQP Cells to Tasks mapping
MEMO
[a] a=b [b]
P Q Ti= Pi Qi
Cells (P) Cells (Q)
[a] [b]
Qi
User partitions (M)

Pi

T1 T2 TN
P1 Q1 P2 Q2 PN QN
Hash Distributions(N)

Cells (P) Cells (Q)


Logical Distributed Task

User partitions (M)


expression plan
N

ꓴP
a=b [a] [b] [a] a=b [b]
P Q i
a=b
Qi Ti= Pi Qi
i=1
Hash Distributions(N)
[a] [b]
Pi Qi
Figure 5. Execution Model
As an example, Figure 4 shows the enumeration of the alternative
5.1 Polaris Tasks
distributed physical execution plans for an inner join, 𝑃 ⋈𝑎=𝑏 𝑄
A key challenge in Polaris was how to essentially re-architect
where P and Q are (say, files in a data lake or tables in a managed
distributed query processing while leveraging as much of existing
distributed relational store) hashed on a and c respectively (𝑃 [𝑎] and SQL Server capabilities as possible and ensuring that the resulting
𝑄 [𝑐] ). The enumeration of physical alternatives starts with the system was a faithful implementation of all user-visible semantics.
scans of P and Q, shown in the bottom-most part of the figure. Q is
hash distributed on column c, hence, 𝑄 [𝑐] is the first alternative To this end, all incoming queries in Polaris are compiled in two
generated. Replication and hash distribution on b are interesting phases. The first phase of the compilation stage leverages SQL
properties pushed top-down, leading to the enumeration of sub- Server Cascades QO to generate the logical search space, or
MEMO [25, 26]. The MEMO contains all logical equivalent
plans 𝑄1 and 𝑄 [𝑐] respectively. P is hash distributed on column a,
alternative plans to execute the query. A second phase performs
generating 𝑃 [𝑎] as the first alternative. Replication and hash
distributed cost-based QO to enumerate all physical distributed
distribution on a are also interesting properties pushed top-down.
implementations of these logical plans and picks one with the least
Since we already satisfy hash distribution on a via 𝑃 [𝑎] , we only estimated cost. The outcome is a good distributed query plan that
need to produce 𝑃1 . The plan node in the top half of Figure 4 shows takes data movement cost into account, as explained in [19].
the enumeration of plans for the join operation; this is a permutation
of the alternatives produced by its children at the bottom of the When enumerating the physical space during the second phase of
figure. During the enumeration, correctness filters are applied, the QO process, a query plan in the MEMO is seen as a directed
thereby eliminating 𝑃 [𝑎] ⋈𝑎=𝑏 𝑄 [𝑐] from the search space, since acyclic graph (DAG) of physical operators, each corresponding to
it does not satisfy any of the distribution properties required by an an algebraic sub-expression E in the query. For simplicity, we use
inner join. For the remaining alternatives, only the best plan for E to denote both the expression and its instantiation as an operator
each interesting property is kept: in the MEMO. Operator E has a degree of partitioned parallelism N
that defines the number of instances of E that run in parallel, each
𝑃 [𝑎] ∧ 𝑄 [𝑏] : 𝑃 [𝑎] ⋈𝑎=𝑏 𝑄 [𝑏] on a partition of the input. We denote the distributed execution of
𝑃1 : 𝑃1 ⋈𝑎=𝑏 𝑄 [𝑐] E as ⋃𝑁 𝑖=1 𝐸𝑖 , where 𝐸𝑖 represents the execution of E over the i
th

hash-distribution of its inputs, and N is the degree of parallelism.


𝑄1 : 𝑃 [𝑎] ⋈𝑎=𝑏 𝑄1
We illustrate the notation by means of an example. Figure 5 depicts
Finally, the best distributed query plan will be chosen based on the an expression that consists of a hash aligned join between two input
cheapest of the three options. Data move enforcers are expensive relations, P and Q. As shown on the left, the cell representation of
operators due to the cost of data re-distribution; hence, the cheapest user files over the lake is captured during MEMO generation by
plan is the one that minimizes data movement, as explained in [19]. SQL Server—the first stage of QO pulls metadata from external
services such as remote meta-stores that contain information on the
5. FROM QUERIES TO TASK DAGS collection of files/tables, partitions and distributions.
A fundamentally new aspect of Polaris is its fine-grained
representation and tracking of query execution. In this section, we For this example, the input data cells are N-way hash-distributed
describe how a query is compiled and optimized into an executable such that the parallel distributed query plan is represented through
DAG of tasks that correspond to units of distributed execution. the union of the join operation on each hash-distribution pair; (in
contrast to the example of the previous section) P and Q are already
hash-aligned on the join column, satisfying the required

3208
distribution properties of the join operator. The same notation can cells are persisted in local storage before they can be processed by
be extended to represent more complex relational expressions and the consumer task.
distribution variations, but we omit the details. Tasks in the DAG without precedence constraints can execute in
Next, we introduce the notion of a task Ti as the physical execution parallel, thereby achieving independent parallelism between
of an operator E on the ith hash-distribution of its inputs. Tasks are different tasks of a query. Figure 6 expands on the example in
instantiated templates of (the code executing) expression E that run Figure 4 with an additional join. The left hand side of the figure
in parallel across N hash-distributions of the inputs, as illustrated in illustrates the physical distributed query plan that has two move
Figure 5 with blue triangles. A task has three components: enforcers such that the join between the three relations are hash
aligned into a final task, resulting into a query DAG with a total of
• Inputs. Collections of cells for each input’s data partition. three tasks.
These cells can be stored either in highly available
remote storage, or in temporary local disks. 5.3 SQL Server Scale-up for Task Execution
• Task template. Code to execute on the compute nodes, The example in Figure 5 also illustrates an additional optimization
representing the operator expression E. carried out in the second phase of cost-based distributed query
• Output. Output dataset represented as a collection of cells optimization. Observe how vertexes in the MEMO corresponding
produced by the task. The output of a task is either an to two join operators have been combined into a single vertex that
intermediate result for another task to consume or the carries out both joins—this is because all three input datasets (P, Q,
final results to return to the user, and is distributed across and R) are hash aligned on the same column by the preceding move
several nodes corresponding to the consuming task’s enforcer operations. Thus, in general, the template for a task can
degree of parallelism. include code for an algebra expression involving multiple
operators.
5.2 The Query Task DAG While we could perform the three-way join in this example in two
In general, the distributed query plan is represented as a directed sequential tasks, we intentionally seek to make tasks be maximal
acyclic graph (DAG) (of operators or tasks) rather than a single units of work. This allows us to more effectively leverage the
node to capture the structure of sub-expressions in the query, sophisticated scale-up columnar query processor in SQL Server. At
including data-flow dependencies and required distribution each compute node, the task template of the algebraic expression E
properties of corresponding operators. corresponding to the task is encoded back into T-SQL and executed
Physical query plan Query Task DAG natively in SQL Server. In this approach, the blocking nature of the
boundaries of a task actually help SQL Server to optimize the
T1
a=c template code of a task with fresh stats from intermediate inputs.
[a] a=b [b] a=c [c]
Pi Qi Ri
[c] 6. TASK ORCHESTRATION
Ri Arguably the biggest engineering challenge in Polaris is
a=b [a] [c] [b] orchestration of tasks.
Hc (R) Pi Ri Qi
▪ The scale is daunting—the amount of data could be petabytes,
[b]
Qi T2 T3 leading to millions of cells; the number of compute nodes used
[d] [c] in a single query could be in the thousands; and the number of
[a]
Hb (Q) [d] Hc (Ri ) Hb (Qi ) tasks could be in the millions.
P R ▪ Execution must be robust to transient failures of nodes,
network, storage, and other components (e.g., metadata micro-
[d] [c] services), and must guarantee that all precedence constraints
[c] Ri Qi
Q are satisfied, and all distributed decisions have quorum.
▪ Tasks must be automatically re-startable on any node, for
Figure 6. The Query Task DAG auto-scaling and fault-tolerance.
In Polaris, we introduce a model of the execution of a query as a
Each vertex contains an operator corresponding to an expression E novel hierarchical composition of finite state machines. As
in the query and has a corresponding task template, instantiated explained in previous sections, at run time, a query is transformed
across multiple nodes over hash-distributions of the inputs for the into a query task DAG, which consists of a set of tasks with
vertex. Edges represent dataflow dependencies, and if the precedence constraints.
consuming vertex E does not support pipelining, induce precedence
constraints over “consumer tasks” created by instantiating E across We refer to each of the following aspects of a query as an entity:
compute nodes over hash-distributed inputs of E. That is, the query DAG, the task templates and tasks. A leaf-level task
“consumer” tasks cannot start until the corresponding tasks of the template can be instantiated into tasks on its hash-distributed
producer vertexes of the edge have completed. inputs; in this case, we say that the task template entity is composed
Precedence constraints are inherently blocking and define changes of the instantiated task entities. A non-leaf task template has
of the distribution properties of the data cells consumed by parent precedence constraints on other task templates; in this case, the
tasks. As explained earlier, the DQO injects changes of distribution non-leaf task template entity is composed of the entities for the task
properties via data move enforcers to achieve correctness, or a templates on which it depends. For each entity, we refer to the
better distributed alternative plan to speed up query execution. entities of which it is composed as its dependencies.
Therefore, the subtree of physical operators rooted on a move The execution state of each entity is tracked using an associated
enforcer defines the input and output boundaries of a task. Data state machine with a finite set of states and state transitions. The
move enforcers are blocking operators such that all their output data state of an entity is a composition of the state of the entities of which

3209
it is composed. States can be either composite or simple. Simple Ready since they have no dependencies, and they are eventually
states are used to denote success, failure, or readiness of a task picked to run by the scheduler.
template. Composite states denote (1) an instantiated task template, The state machines for task templates T2 and T3 are instantiated
or (2) a blocked task template. (Note that an instantiated task will and initialized to the Run state. This in turn instantiates tasks for the
succeed or fail but cannot be blocked; tasks are only instantiated task templates. If any of these tasks fail, their state machine
when their inputs are ready.) transitions to the Failed state, the failure is detected and the failed
A composite state differs from a simple state in that its transition to task is restarted automatically if the reason is a transient failure (as
another state is defined by the result of the execution of its indicated by the task state machine transition in Figure 7);
dependencies. It has a collection of peer states, one for each otherwise the parent state machine retries at a coarser granularity.
dependency, and a termination policy intent aggregates meta-data The state for T2 and T3 becomes Success when all its task
on execution of dependencies and captures how to interpret the dependencies succeed. When both move enforcer entities succeed,
outcome of dependencies, and how to act on other peer states. the root entity T1 is unblocked and placed in the scheduler queue.
The Polaris state machine through its hierarchical state machine When it is picked to run, i.e., becomes active, and it is instantiated
composition captures the execution intent and it is in this aspect that as join tasks on the hash-partitioned inputs.
it differs from other distributed query engines. In other DAG In more detail, a state machine in Failure triggers an analysis of the
execution frameworks [5, 6, 14], composition is inherent in the type of failure for all dependencies that we classify as retriable, e.g.,
execution. In Polaris, the state machine provides a template that is transient failures caused by node failure. If retriable, then it can
used to orchestrate the execution. The advantage it offers is the transition back to Blocked, otherwise, the state machine with
ability to formalize how we recover from failures and use the state Failure returns control to the state machine for its parent, which
machine recorder (a log) to observe and reply execution history. will try to re-schedule execution using additional resources or in
Further, for a given a set of workloads in the system, the execution turn propagate the failure up the control chain. This is an example
history combined with the rules governing legal transitions can be of how, in contrast to other systems such as [10, 15, 18], Polaris
used to reorder workload executions and explore different orchestration gives us flexibility in handling different types of
execution sequences by forking and resuming execution from failures by allowing us to specify behavior on termination of a
selected points in the recorded history; this is future work. composite state.
Figure 7 illustrates the entities and state machines for the example To summarize, when in Ready state, a task template waits in the
in Figure 6. As we can see, the distributed query execution of the queue for the scheduler to pick its turn to execute, then it transitions
query task DAG is modelled as a hierarchical set of state machines. to Run. This is when task entities are instantiated, and the task’s
The root query DAG entity starts in the Run composite state and state machines are executed. The task template transition from Run
instantiates the state machine for the entity corresponding to the to a terminating state (Failed or Success), depends on the resulting
(task template T1 representing the) join of P, Q and R. This state execution of the instantiated tasks. Note that any entity can
machine starts in a (composite) Blocked state because it has transition from Failed to Run if the failure is transient. The failure
dependencies on the entities corresponding to (task templates T2 is propagated to higher entities only if it is deemed not retriable
and T3 for) the move enforcers on Q and R; these task templates within the entity’s state machine.
are now placed in the scheduler queue. Their state is initialized to
HIERARCHICAL STATE MACHINES STATE MACHINE DRIVEN QUERY EXECUTION

run

Composite
States
1 2 3 4
blocked

Query DAG

ready

failed
Task Template (T2) Tasks
Simple
States
success
5 6 7 8

Task Template (T1)


State Tasks
Task Template (T3)
Transition

Execution of
composite entities

Tasks

Figure 7. Hierarchical composition of state machines for distributed query execution

3210
Modelling the distributed query execution of queries via The resource demand for each task is computed as a function of
hierarchical state machines has the following goals: inputs and outputs of each physical operator in the template code
for the task. Analogously, Polaris also models each compute node
▪ Satisfy precedence constraints. The execution of the query
as a d-dimensional bin of resources such that placement of tasks
task DAG is carried out top-down in a topological sort order
to containers is based on policies that can be autonomously tuned
such that every task with precedence constrains is blocked
based on resource consumption profiles across all nodes.
on completion of its input tasks. For example, as shown in
the right-hand side of Figure 7, the root task is blocked (Step Workload Graph (WG)
1) until its two dependencies are completed (Step 6).
Resource
▪ Reliable execution. We use the state machines to have fine 5 5 X demand
grain control at task level and define a predictable model for
T10
recovering from failures. Completion and failure T9

propagation are done bottom-up using the compositional


nature of states. In Step 3 illustrates a case where on 8 5 5
container failure during the execution of a task, the error
T7 T6
propagates to the parent task template, which retries its T8

execution.
▪ Reproducibility at scale. States and transitions are logged 15 15 10 15 20
by all entities. This allows for predictability and
T5 T4 T3 T2 T1
reproducibility regardless of the complexity of the workload
and the scale. This is also a fundamental building block for
debugging and resumable execution upon failover. Workload Scheduler

▪ Concurrency. Fine grain control at large scale often comes Resource Governance
with large memory requirements and thread contention due
to many subroutines running concurrently. Hierarchical
state machines allow us to track the state of all entities in the
workload with a low memory overhead: there is only one
state machine for a task entity and all instantiations run
through its states and transitions. Also, the Polaris query
processor has been built from scratch using .NET’s task
asynchronous programming model to eliminate the need for
blocking synchronization primitives across subroutines, thus
minimizing thread contention and maximizing OS thread
utilization. The gains are seen in Section 10.2.
7. WORKLOAD AWARE SCHEDULING
Polaris must handle highly concurrent workloads ranging from Figure 8. Workload aware resource scheduling algorithm
dashboarding scenarios running thousands of light weight
queries, to reporting scenarios executing a set of highly complex The representation of the workload as a global graph of tasks with
analytical queries. There are potentially millions of tasks to be resource demands allows us to redefine the multi-query
orchestrated for execution by the Polaris DQP. In the previous scheduling problem as a task scheduling problem with precedence
section we described how hierarchical state machines enable us constraints: the goal is scheduling d-dimensional tasks on d-
to efficiently handle distributed task orchestration at very large dimensional containers to complete in the minimum amount of
scale. In this section we cover how Polaris schedules tasks for time possible while ensuring that at all times, we are within all d
high concurrency. dimensions of resources available to us. Figure 8 shows the
representation of the workload graph for two query DAGs. In
Task scheduling in Polaris is based on a global view of all active green circles we represent the resource demand for each task
queries called the workload graph, generalizing the template. For simplicity, in this example we normalize to just one
representation of a single query as a DAG of tasks to represent number, and not the multi-dimensional resource vector used in
the entire workload by combining task DAGs of all active queries. Polaris. The workload scheduler and the resource governor
Each task in the workload graph has an associated resource operate on the workload graph.
demand that is an extension of the model in Ganguly [27] to d- The pseudocode of the scheduler is shown on the bottom of the
dimensional preemptable resources proposed in [28, 29]. We figure. The scheduler is asynchronously waiting for work, and
define a d-dimensional resource vector that has time and space when awoken it adds all task templates in the workload graph that
shared constraints where each dimension specifies an aspect of are in Ready state to the scheduler queue. Task templates are then
resource consumption. Fungible resources such as memory and dequeued in order specified by the scheduling policy. Currently
CPU can be sliced across tasks at a low cost, and each task’s supported policies include (combinations of): FIFO, sorted by
requirement for a given resource can be stretched at execution resource demand (min to max or max to min), and sorted by
time. On the other hand, more rigid resources such as temp space proximity to the root. Intuitively, sorting by proximity to the root
on local disks must also be satisfied. Stretching temp space across biases towards tasks from jobs that are closer to completion (so
independent tasks is prohibitively expensive since it would that their shared resources can be released sooner).
require swapping pages in and out from/to remote storage.

3211
For the next task template in order, the resource governor (We do not go into the architecture of the centralized
examines each task to be instantiated. If all these tasks fit in their services in detail; briefly, they are built for HA and
target location (i.e., each task’s resource demand can be performance using Azure SQL DB.)
accommodated given current local capacity), then the task ▪ Multiple-pools. Placing the state in centralized services
template is removed from the scheduler queue and transitioned to coupled with a stateless micro-service architecture within a
the Run state. Otherwise, we break out of the loop and wait for pool means multiple compute pools can transactionally
other tasks to complete so the task template can fit. Note that the access the same logical database.
target location of a task is fixed by data affinity to exploit cache Centralized Services
locality. This novel approach to multi-query workload— Metadata Transactions

generalizing task scheduling from big data systems to consider


tasks across all active queries—can improve concurrency for the
following reasons:
▪ A task template is the unit of scheduling. The scheduling SQL Server SQL Server
Front End Front End
order applies to the task template entity and not a query. A
finer grain unit of scheduling allows for better packing

Data Channel
Data Channel

Polaris Pool
Polaris Pool
strategies, helping maximizing resource utilization. Distributed QP Distributed QP
▪ Weighted policies for resource governance. The
Control Flow Control Flow
placement of the task in the target compute server is based
Compute
on resource fit to maximize load while avoiding over- Compute
servers servers
Compute
provisioning. For this we use a weighted policy to pack tasks Server
into the compute capacity available at a node. The policy has Execution
Service
two variations, one that caps the amount of resources that SQL Server
can be granted to a task, and another one that does not. If the Cache
task does not fit in the available compute, it is put back into Data Set = Collection of data cells
the queue till tasks complete and capacity is freed.
▪ Increased flexibility in task ordering. Scheduling policies Partitions
define the order in which tasks are executed as they become
ready for execution. By looking at ready tasks across all
queries, taking into account resource pressure in the system, Hash Distributions
we are able to pick orderings that would not be permissible
otherwise. For instance, consider the example in Figure 8, Figure 9. Polaris service architecture
applying a max to min scheduling policy. The scheduler
queue SchQ starts with {T1, T2, T3, T4, T5} with scheduling
8.1 Stateless micro-service architecture
A Polaris pool consists of a set of micro-services each with well-
order {T1, T2, T4, T5, T3}. As we go through the loop, T3
defined responsibilities. The SQL Server Front End (SQL-FE) is
does not fit, so only four out of the five task templates
the service responsible for compilation, authorization,
transition to Run state. Next all T4 and T5 complete and the
authentication, and metadata. Metadata is used by the compiler to
scheduler is awoken. SchQ now contains {T8, T3} with
generate the search space (the MEMO) for incoming queries and
scheduling order {T3, T8}. Now, if the workload manager
bind metadata to data cells. The Distributed Query Processor
detects pressure in the system because of disk resources held
(DQP) is responsible for distributed query optimization,
by previously completed task templates, it can choose to
distributed query execution, query execution topology
swap the scheduling policy to sort by proximity to the root
management and workload management (WLM). Finally, a
to release pressure in the system. In this example, the
Polaris pool consists of a set of compute servers that are, simply,
scheduling order would change to be {T8, T3}. The study of
an abstraction of a host provided by the compute fabric, each with
scheduling and resource management policies to consider
a dedicated set of resources (disk, CPU and memory). Each
SLAs and avoid starvation is out of scope for this paper and
compute server runs two micro-services: (a) an Execution Service
will be addressed in future work. (ES) that is responsible for tracking the life span of tasks assigned
▪ Resource driven query admission control. Back pressure to a compute container by the DQP, and (b) a SQL Server instance
can be driven by a ratio of capacity (demand vs. available). that is used as the back-bone for execution of the template query
Concurrency is only limited by available capacity, and the for a given task and holding a cache on top of local SSDs (in
admission of a query is only denied when we cannot addition to in-memory caching of hot data). Data can be
guarantee SLAs due to a capacity crunch. transferred from one compute server to another via dedicated data
8. SERVICE ARCHITECTURE channels. The data channel is also used by the compute servers to
Figure 9 illustrates the architecture of the query service for two send results to the SQL FE that returns the results to the user. The
Polaris pools sharing the same centralized (metadata and life cycle of a query is tracked via control flow channels from the
transaction) services. There are two important aspects to note: SQL FE to the DQP, and the DQP to the ES.
▪ Stateless architecture within a pool. The Polaris As explained in Section 2, no essential state is held by any micro-
architecture falls into the stateless service architecture from service in Polaris. While caches are stored by the compute
Figure 1.b, as discussed in Section 2. All services within a servers, upon fail-over, they can be easily re-constructed.
pool are stateless: (i) data is stored durably in remote storage,
and is abstracted via data cells, and (ii) the metadata and
transactional log state is off-loaded to centralized services.

3212
Control Control Control
channel channel channel

Data Data Data


channel channel channel

Control Control Control


channel channel channel

Data Data Data


channel channel channel

Auto-scale Resilience to failure Hot spot recovery


Figure 10. Elastic Compute Scenarios

8.2 Service form factors Resilience to Node Failures


The separation of state and compute coupled with the auto-scaling Figure 10 also illustrates how the Polaris DQP recovers from node
capabilities of a pool (explained in the next section) allow us to failures while tasks are running. If a server fails, the DQP
support very high concurrency levels within each pool, as well as rebalances the tasks in the failed node across the rest of the
enabling all of the following user-facing service form-factors: healthy topology. The fault tolerance model is built into the
hierarchical state machine discussed in Section 6. A node failure
▪ Serverless. One system managed Polaris pool with auto- transitions execution tasks in a container into the Failed state.
scale ranging from 0 to N compute nodes where N is Then the parent task template state machine reacts
constrained only by capacity within Azure compute. appropriately—tasks previously assigned to the faulty node are
▪ Capacity reservation. A dedicated Polaris pool with a restarted on healthy nodes. This feature is essential for scaling to
minimum reservation of capacity and auto-scale capacity very large queries, since the probability of node failure increases
up to a maximum user-specified size. with the number of nodes involved.
▪ Multiple pools. Multiple Polaris pools with capacity
reservation. Pool sizes can either be defined by the user or 9.2 Skewed Computations
they can grow and shrink dynamically. Figure 10 shows how skewed computation or hot spots are
handled. The Polaris DQP and the ES in the compute servers
9. ELASTIC QUERY PROCESSING implement a feedback loop that tracks the life span of execution
The infrastructure of a cloud is inherently elastic in that compute tasks on a node. If the DQP detects that a node is overloaded (e.g.,
containers (e.g., VMs, k8 containers) can be obtained or released the yellow node in the figure), it can decide to re-schedule a subset
nearly instantly. This means nodes can be added and removed of the tasks assigned to that compute node amongst other nodes
from a query processing compute topology in a matter of seconds. where the load is less. If this does not mitigate the hot spot, we
With appropriate telemetry, the system can auto-scale up or down fall back on the auto-scale feature to add more nodes to the
proactively based on workload needs or react to unexpected topology and rebalance the load appropriately. Skewed
events such as faulty nodes and infrastructure upgrades. computations are handled using runtime feedback loops, and our
In any of these scenarios, the Polaris query processor must ensure query optimizer does not currently take data skew into account.
that tasks can be flexibly assigned to compute nodes in How to handle skew during query optimization is future work.
dynamically changing query execution topologies. We achieve 9.3 Affinitizing Tasks to Compute
this objective by leveraging several aspects of the Polaris As explained in Section 8, the SQL Server service in the compute
framework: server extends caching of hot data to its local SSDs. Accessing
▪ Separation of state and compute. data from remote storage is an expensive operation, and therefore
▪ Flexible abstraction of datasets as cells. the elastic features of the Polaris DQP try to minimize the impact
▪ Task inputs defined in terms of cells. on the cache by consistent assignment of data cells to tasks, and
▪ Fine-grained orchestration of tasks using state machines. our scheduler assigning tasks to compute based on data
Figure 10 depicts examples of key scenarios we unlock with this collocation, thus, preserving caches upon topology changes. In
architecture. We explain each one in the following Sections. particular, (a) on topology shrinkage, only the caches of the nodes
that are no longer part of the topology are lost, and (b) on topology
9.1 Auto-Scale growth, all the caches from the existing topology are preserved,
The Polaris DQP requests the underlying compute fabric for more and only the caches for the new nodes need to be populated.
containers to adjust to peaks in the workload and re-distributes
tasks to transparently leverage the new containers. Note that in- 10. PERFORMANCE EVALUATION
flight tasks in the previous topology continue running, while new 10.1 Goals
queries get the new compute power with appropriate load
Our goals are to obtain an understanding of the performance and
balancing. In Figure 10, we show a doubling of compute capacity;
concurrency characteristics of Polaris on a single pool over
however, we can add capacity in increments of just one node. structured and non-structured data. For this we break down our
The Polaris DQP also can autonomously scale down the compute experiments into three dimensions.
node topology (in increments of one or more nodes) when
Concurrency. We want to stress the DQP with a concurrent
utilization drops sufficiently. workload. The global graph in such scenario consists of thousands
of tasks from a thousand different queries such that we showcase

3213
Figure 11. Results for 5k concurrent queries
the resource driven scheduling capabilities of the WLM, and its simultaneously generating a workload graph over 50k task
autonomy around capacity management, access control and templates that as they are scheduled, they expand to an aggregated
resource governance under heavy load. For this we use the TPC- total of ~550k instantiated tasks. As task templates are scheduled
DS [30] workload to run a multi-user environment executing five for execution, tasks are instantiated; the chart on the left of the
thousand of queries simultaneously. Figure shows the number of actively executing tasks and
Single query performance at PB scale over the data lake. We aggregated completed tasks at any given point in time for the
ran all TPC-H [31] queries at one PB scale across hundreds of duration of the test. The chart on the right of the Figure shows the
machines on Azure public compute. The goal of this experiment average resource utilization of the compute server for CPU and
is to stress the scalability, elasticity, and fault tolerance Memory dimensions. As we can see, we do have a good
capabilities of the service. Note that this is not a validated TPC- utilization of the cluster for the duration of the tests. For this
H benchmark, the only intent is to demonstrate we can run all experiment we have used FIFO scheduling order of task
queries at a scale that has not been done before. templates, and we think both the resource utilization and time can
be improved by using more sophisticated policies; experiments
Querying heterogeneous data. To illustrate that Polaris can run using different scheduling order policies are out of the scope of
on heterogeneous data, we ran all TPC-H [31] queries at 1 TB this paper, and to be carried out in the near future. The main thing
scale on a dataset consisting of a variety of data files ranging from to observe is that Polaris is able to handle high concurrency for a
raw CSV files to Parquet files with nested attributes. The test was complex workload such as TPC-DS packing up to 9k tasks on the
executed using less than 100 cores in Azure public compute. The 10 compute servers available, and completing approximately
experiment emphasizes raw file parsing and query optimization 550k tasks.
capabilities over joins between plain text files and Parquet files
with nested attributes. 10.3 Query Performance at Petabyte Scale
The set-up
10.2 Concurrency
The setup We used the TPC-H dbgen utility to generate a PB of raw data
and then converted it into parquet files that were stored in WASB.
We used the TPC-DS dbgen utility to generate a 1TB of raw data Parquet files were organized using the data model from Section 3
and then converted it into parquet files that were stored into with both hash partitions and user partitions. The total number of
Windows Azure Storage Blob (WASB), the Azure Data Lake. parquet files is ~120k with total compressed size of 360TB.
Rows do not follow any distribution since we are not focusing on
single query performance but stress on concurrency. The The compute topology
application spans five thousand concurrent sessions executing We deployed a Polaris pool on Azure, consisting of one SQL FE
one distinct TPC-DS query each. For this we generated TPC-DS compute instance, one DQP and 420 compute execution services
queries 50 times with different predicate ranges and assigned one (ES). Each node is a 2x12 cores Intel processor with 192GB of
to each session. RAM and 4 SSDs of 480GB. The network topology is 40Gb
The compute topology throughout; 40Gb NIC, 40Gb TOR, and 40Gb CSP.

Since the goal of this experiment is to stress the DQP component Results
we choose a rather small compute topology with 10 compute Figure 12 shows execution time for all 22 TPC-H queries at 1PB
nodes. The hardware configuration of each node consists of 2x20 scale. To the best of our knowledge, this is the first time results
cores intel processors, 520GB of RAM and 4 SSDs of 1TB each. have been published at a PB scale. Remarkably, some queries
The network topology is 40Gb throughout; 40Gb NIC, 40Gb (Q6, Q12, Q15 and Q16) run extremely fast, through partition
TOR, 40Gb CSP. elimination and distribution alignment of expensive joins, taking
Results advantage of the Polaris data model (Section 3). TPC-H has a few
queries that stress the processing limits of any system since they
Figure 11 shows the task execution summary and the resource join across all sources with low selectivity and very heavy joins
utilization in the backend nodes. The 5k queries run between large dimension and the fact table: Q9 and Q21 are good

3214
examples. Polaris manages to process these queries at PB scale types, and Parquet with nested types). This demonstrates the
under two hours across 420 machines, demonstrating scalability robustness of the system in handling heterogeneous data sources.
and resilience.
11. Conclusions
In this paper, we presented Polaris, a novel distributed query
processing framework in Azure Synapse that seeks to support
both big data and relational warehouse workloads, going beyond
current systems of either kind in its flexibility and scalability. The
architecture is inspired by scale-out techniques from big data
systems. It extends these techniques in many ways, notably in the
cell abstraction of data, flexible task orchestration framework,
and global workload task graph. Polaris also is notable for how
it carefully refactors SQL Server’s complex codebase in order to
leverage its query optimizer and scale-up single-node engine—
both of which reflect many years of refinement—while
completely rewriting the distributed execution framework.
Polaris is also cloud-native, completely separating compute from
both storage and transactional state in order to support agile
provisioning and scaling of compute pools. Azure Synapse is
Figure 12. 1PB TPC-H single query performance
unique among cloud services in supporting both serverless and
10.4 Querying Heterogeneous Data provisioned form factors, with multiple serverless and
The set-up provisioned SQL sessions able to concurrently operate on the
We used the TPC-H dbgen utility to generate a TB of raw data in same datasets, across both lake and managed data.
CSV format and then converted the files for the Appendix
lineitem, customer, supplier, and nation tables into Parquet
Required properties
files. Conversion into Parquet for customer and supplier files was
done by organizing contact information (name, address, The following table contains the required properties for the most
nationkey, and phone columns) as nested types in Parquet. The common algebraic operators. The columns are treated as
lineitem and nation Parquet files were organized with simple equivalence classes (transitive closures) when testing the required
types, without nested structure. Files for orders, partsupp, part properties for algebraic correctness. When join predicates have
and region were kept in raw CSV format. All files for a single multiple equality conjuncts, correctness holds if hash key of each
entity were stored in a single folder in WASB. input is a subset of the columns in the conjuncts from that input.
For "Group-By", correctness holds if hash key of input is a subset
of the grouping columns. Distributed query processor also
supports decomposing aggregations and Top-N into local-global
forms, which allows the optimizer to push selective local
operators before data movement enforcers. P [a] subsumes P ∅ .

Operator Required Properties


Inner Join P ⋈a=b Q: {{P [a] ∧ Q[b] } ∨ {P1 } ∨ {Q1 }}
Outer Join P →a=b Q: {{P [a] ∧ Q[b] } ∨ {Q1 }}
Semi-Join P ⋉a=b Q: {{P [a] ∧ Q[b] } ∨ {P1 ∧ Q[b] }
∨ {Q1 }}
Anti-Join P−a=b Q: {{P [a] ∧ Q[b] } ∨ {Q1 }}
Figure 13. 1TB TPC-H querying heterogeneous data. Group-By GB(P, a): {{P[a] } ∨ {P1 }}
The compute topology Project Π(P): {true}
We deployed a Polaris pool with one SQL FE compute Select σ(P): {true}
instance, one DQP and several execution services (ES).
Top Top(P): {P1 }
Results Union-All P ⊎ Q: {{P ∅ ∧ Q∅ } ∨ {P1 ∧ Q1 }}
Figure 13 shows the query execution time for all 22 TPC-
H queries at 1TB scale that combines querying CSV Union P ∪ Q: {{P [a] ∧ Q[b] } ∨ {P1 ∧ Q1 }}
files for some entities and Parquet files with and without nested Apply P Apply Q: {Q1 }
types for other entities. Polaris executes all 22 queries and
produces good plans even for the most complex queries, which
do joins across a variety of files (CSV, Parquet with simple

3215
12. REFERENCES [20]. Ramakrishnan, Raghu et al. Azure Data Lake Store: A
[1]. Azure Synapse Analytics. [https://azure.microsoft.com/en- Hyperscale Distributed File Service for Big Data Analytics.
us/services/synapse-analytics/] Chicago, IL, USA : SIGMOD Conference, 2017.

[2]. Report, Microsoft. FIDO: A Cloud-Native Versioned Store [21]. Delta Lake. [https://delta.io/]
With Concurrent Transactional Updates. 2020. [22]. Antonopoulos, Panagiotis et al. Socrates: The New SQL
[3]. Ashish Thusoo et al, Hive – A Petabyte Scale Data Server in the Cloud. Amsterdam, Netherlands : SIGMOD
Warehouse Using Hadoop. Long Beach, California, USA : Conference, 2019.
ICDE Conference, 2010. [23]. Shukla, Dharma et al. Schema-Agnostic Indexing with
[4]. R. Chaiken, et al. SCOPE: easy and efficient parallel Azure DocumentDB. Kohala Coast, Hawaii : VLDB Conference,
processing of massive data sets. Auckland, New Zealand : 2015.
VLDB Conference, 2008. [24]. Astrahan, Morton M. et al. System R: relational approach
[5]. Michael et al. Spark SQL: Relational data processing in to database management. s.l. : ACM Transactions on Database
Spark. Armbrust, Melbourne, Victoria, Australia : Proc. ACM Systems, 1976, ACM Transactions on Database Systems, pp.
SIGMOD, 2015. SIGMOD. 16-36.

[6]. Michael Isard et al. Dryad: distributed data-parallel [25]. Graefe, Goetz and McKenna, William J. The Volcano
programs from sequential building blocks. Lisboa, Portugal : optimizer generator: extensibility and efficient search. Vienna,
Eurosys, 2007. Austria : IEEE International Conference on Data Engineering,
1993.
[7]. Gupta, Anurag et al. Amazon Redshift and the Case for
Simpler Data Warehouses. Melbourne, Victoria, Australia : [26]. Graefe, Goetz. The Cascades Framework for Query
SIGMOD Conference, 2015. Optimization. s.l. : Data Engineering Bulletin, 1995, Vol. 18.

[8]. AWS Athena. [https://aws.amazon.com/athena/] [27]. Hochbaum, Dorit S. and Shmoys, David B. Using dual
approximation algorithms for scheduling problems: Theoretical
[9]. An Inside Look at Google BigQuery. and practical results. Portland, OR, USA : IEEE, 1985. 26th
[https://cloud.google.com/files/BigQueryTechnicalWP.pdf] Annual Symposium on Foundations of Computer Science (scfs
[10]. Melnik, Sergey, et al. Dremel: Interactive Analysis of Web- 1985).
Scale Datasets. s.l. : VLDB Endowment, 2010. [28]. Garofalakis, Minos N. and Ioannidis, Yannis E. Parallel
[11]. Dageville, Benoit, et al. The Snowflake Elastic Data Query Scheduling and Optimization with Time- and Space-
Warehouse. San Francisco, California, USA : SIGMOD Shared Resources. Athens, Greece : VLDB Conference, 1997.
Conference, 2016. VLDB.

[12]. Oracle Exadata. [29]. Garofalakis, Minos N. and Ioannidis, Yannis E. Multi-
[https://www.oracle.com/technetwork/database/exadata/exadata- dimensional Resource Scheduling for Parallel Queries.
technical-whitepaper-134575.pdf] Montreal, Canada : SIGMOD Conference, 1996. SIGMOD.

[13]. Teradata. [https://www.teradata.com/] [30]. The TPC-DS Benchmark. [Online]


http://www.tpc.org/tpcds/.
[14]. Apache Hive LLAP.
[https://cwiki.apache.org/confluence/display/Hive/LLAP] [31]. The TPC-H Benchmark. [Online] http://www.tpc.org/tpch/.

[15]. Marcel Kornacker et al. Impala: A Modern, Open-Source


SQL Engine for Hadoop. Asilomar, California, USA : CIDR,
2015.
[16]. Presto. [https://prestodb.io/]
[17]. R. Sethi et al. Presto: SQL on Everything. Macao, Macao :
ICDE Conference, 2019.
[18]. Introduction To Presto Cost Based Optimizer.
[https://prestosql.io/blog/2019/07/04/cbo-introduction.html]
[19]. Shankar, Srinath, et al. Query optimization in microsoft
SQL server PDW. Scottsdale, Arizona, USA : SIGMOD
Conference, 2012.

3216

You might also like