POLARIS-Distributed Query Engine in Synapse
POLARIS-Distributed Query Engine in Synapse
3204
access all underlying data concurrently. Fido supports utilization and concurrency. In future, we plan to build
efficient transactional updates with data versioning. on this global view with autonomous workload
management features. See Section 6.
1.1 Related Systems
The most closely related cloud services are AWS Redshift [7], • Multi-layered data caching model. Hive LLAP [14]
Athena [8], Google Big Query [9, 10], and Snowflake [11]. Of showed the value of caching and pre-fetching of column
course, on-premise data warehouses such as Exadata [12] and store data for big data workloads. Caching is especially
Teradata [13] and big data systems such as Hadoop [3, 4, 14, 15], important in cloud-native architectures that separate state
Presto [16, 17] and Spark [5] target similar workloads (increasingly from compute (Section 2), and Polaris similarly leverages
migrating to the cloud) and have architectural similarities. SQL Server buffer pools and SSD caching. Local nodes
cache columnar data in buffer pools, complemented by
• Converging data lakes and warehouses. Polaris
caching of distributed data in SSD caches.
represents data using a “cell” abstraction with two
dimensions: distributions (data alignment) and partitions
(data pruning). Each cell is self-contained with its own 2. SEPARATING COMPUTE AND STATE
statistics, used for both global and local QO. This Figure 1 shows the evolution of data warehouse architectures over
abstraction is the key building block enabling Polaris to the years, illustrating how state has been coupled with compute.
abstract data stores. Big Query and Snowflake support a COMPUTE SEPARATED FROM COMPUTE
sort key (partitions) but not distribution alignment; we
discuss this further in Section 4. Caches Caches Caches
State
Service form factor. On one hand, we have reserved-
capacity services such as AWS Redshift, and on the other Xact Xact Xact
serverless offerings such as Athena and Big Query. Data Data Data
Snowflake and Redshift Spectrum are somewhere in the On-prem arch Storage separation arch State separation arch
middle, with support for online scaling of the reserved
capacity pool size. Leveraging the Polaris session (a) Stateful Compute (b) Stateless Compute
architecture, Azure Synapse is unique in supporting both
Figure 1. Decoupling state from compute.
serverless and reserved pools with online scaling; the
pool form factor represents the next generation of the To drive the end-to-end life cycle of a SQL statement with
current Azure SQL DW service, which is subsumed as transactional guarantees and top tier performance, engines maintain
part of Synapse. The same data can simultaneously be state, comprised of cache, metadata, transaction logs, and data. On
operated on from both serverless SQL and SQL pools. the left side of Figure 1, we see the typical shared-nothing on-
premises architecture where all state is in the compute layer. This
• Distributed cost-based query optimization over the data
approach relies on small, highly stable and homogenous clusters
lake. Related systems such as Snowflake [11], Presto [17,
with dedicated hardware for Tier-1 performance, and is expensive,
18] and LLAP [14] do query optimization, but they have
hard to maintain, and cluster capacity is bounded by machine sizes
not gone through the years of fine-tuning of SQL Server,
because of the fixed topology; hence, it has scalability limits.
whose cost-based selection of distributed execution plans
goes back to the Chrysalis project [19]. A novel aspect of The shift to the cloud moves the dial towards the right side of Figure
Polaris is how it carefully re-factors the optimizer 1 and brings key architectural changes. The first step is the
framework in SQL Server and enhances it to be cell- decoupling of compute and storage, providing more flexible
aware, in order to fully leverage the Query Optimizer resource scaling. Compute and storage layers can scale up and
(QO), which implements a rich set of execution strategies down independently adapting to user needs; storage is abundant
and sophisticated estimation techniques. We discuss and cheaper than compute, and not all data needs to be accessed at
Polaris query optimization in Section 5; this is key to the all times. The user does not need compute to hold all data, and only
performance reported in Section 10. pays for the compute needed to query a working subset of it.
• Massive scale-out of state-of-the-art scale-up query Decoupling of compute and storage is not, however, the same as
processor. Polaris has the benefit of building on one of decoupling compute and state. If any of the remaining state held in
the most sophisticated scale-up implementations in SQL compute cannot be reconstructed from external services, then
Server, and the scale-out framework is designed compute remains stateful. In stateful architectures, state for in-
expressly to achieve this—tasks at each node are flight transactions is stored in the compute node and is not hardened
delegated to SQL Server instances—by carefully re- into persistent storage until the transaction commits. As such, when
factoring SQL Server code. a compute node fails, the state of non-committed transactions is
lost, and there is no alternative but to fail in-flight transactions.
• Global resource-aware scheduling. The fine-grained Stateful architectures often also couple metadata describing data
representation of tasks across all queries in the Polaris distributions and mappings to compute nodes, and thus, a compute
workflow-graph is inspired by big data task graphs [3, 4, node effectively owns responsibility for processing a subset of the
5, 6], and enables much better resource utilization and data and its ownership cannot be transferred without a cluster re-
concurrency than traditional data warehouses. Polaris start. In summary, resilience to compute node failure and elastic
advances existing big data systems in the flexibility of its assignment of data to compute are not possible in stateful
task orchestration framework, and in maintaining a architectures. Several cloud services and on-prem data warehouse
global view of multiple queries to do resource-aware architectures fall into this category, including Red Shift, SQL DW,
cross-query scheduling. This improves both resource Teradata, Oracle, etc.
3205
Stateless compute architectures require that compute nodes hold no The partitioning function p(r) is a user-defined function that takes
state information, i.e., all data, transactional logs and metadata need as input an object r and returns the partition i in which r is
to be externalized. This allows the application to partially restart positioned. This is useful for aggressive partition pruning when
the execution of queries in the event of compute node failures, and range or equality predicates are defined over the partitioning key.
to adapt to online changes of the cluster topology without failing (If the user does not specify p for a dataset, the partition pruning
in-flight transactions. Caches need to be as close to the compute as optimization is not applicable.)
possible, and since they can be lazily reconstructed from persisted Cells can be grouped physically in storage however we choose
data they don’t necessarily need to be decoupled from compute. (examples of groupings are shown as dotted rectangles in Figure 2),
Therefore, the coupling of caches and compute does not make the so long as we can efficiently access Cij. Queries can selectively
architecture stateful. reference either cell dimension or even individual cells depending
Polaris is a cloud-native distributed analytics system that follows a on predicates and type of operations present in the query.
stateless architecture. In the remainder of the paper we go through
the technical highlights of the architecture, and finally, we present
Cell Data Cells
grouping
results of running all 22 TPC-H queries at 1PB scale on Azure.
1 C11 C1N
1 N
• Abstraction from the data format. Polaris, as an
analytical query engine over the data lake, must be able Figure 2. Polaris Data Model
to query any data, relational or unstructured, whether in
a transactionally updatable managed store or an Flexible Assignment of Cells to Compute
unmanaged file system. Hence, we need a clean Query processing across thousands of machines requires query
abstraction over the underlying data type and format, resilience to node failures. For this, the data model needs to support
capturing just what’s needed for efficiently parallelizing a flexible allocation of cells to compute, such that upon node failure
data processing. A dataset in Polaris is logically or topology change, we can re-assign cells of the lost node to the
abstracted as a collection of cells that can be arbitrarily remainder of the topology. This flexible assignment of cells to
assigned to compute nodes to achieve parallelism. The compute is ensured by maintaining metadata state (specifically, the
Polaris distributed query processing framework (DQP), assignment of cells to compute nodes at any given time) in a
operates at the cell level and is agnostic to the details of durable manner outside the compute nodes.
the data within a cell. Data extraction from a cell is the
responsibility of the (single node) query execution Polaris Pool
engine, which is primarily SQL Server, and is extensible
for new data types. DISTRIBUTED QUERY
PROCESSING
Data Set = Collection of data cells
• Wide distribution. For scale-out processing, each
dataset must be distributed across thousands of buckets,
or subsets of data objects, such that they can be processed
Partitions
3206
highly scalable distributed query processing over analytical stores physical distributed plans in the search space, the DQO uses
such as ADLS [20], Fido [2], and Delta [21], as well as required distribution properties on operators to discard alternatives.
transactional stores such as Socrates [22] and Cosmos DB [23]. Of The list of required properties for each relational algebra operation
course, when data is stored in columnar formats tailored for is listed in the appendix of this paper.
vectorized processing, this further improves relational query
performance. Distribution Properties as “Interesting Properties”
System R [24] introduced the concept of interesting properties,
A Note on Queries namely physical properties (e.g., sort order) such that the best plan
In this paper, we mostly focus on relational queries (with the for producing (intermediate) tables with each interesting property
exception of Section 10.4). Data objects are assumed to have is saved during the enumeration of the search space. Thus, the
attributes required by relational operators to which they are input. cheapest plan for producing an intermediate table in sorted order by
That said, the generality of the data abstraction underlying Polaris’s the first column would be saved even if there is a cheaper plan to
query processing means that we can handle datasets represented in produce the same table unsorted or in a different sort order.
diverse formats and stored in different repositories. For example, Similarly, in the distributed search space, the Polaris DQO uses the
Polaris can run directly over data in HDFS and in managed required distribution properties of relational algebra operators as
transactional stores. Further, different objects in a dataset could interesting properties. When enumerating the physical plan
differ in the attributes attached to them, and objects could have alternatives bottom-up, the best plan for each property and the best
additional uninterpreted attributes. plan overall based on cost are kept.
3207
Task generation in the DQP Cells to Tasks mapping
MEMO
[a] a=b [b]
P Q Ti= Pi Qi
Cells (P) Cells (Q)
[a] [b]
Qi
User partitions (M)
Pi
T1 T2 TN
P1 Q1 P2 Q2 PN QN
Hash Distributions(N)
ꓴP
a=b [a] [b] [a] a=b [b]
P Q i
a=b
Qi Ti= Pi Qi
i=1
Hash Distributions(N)
[a] [b]
Pi Qi
Figure 5. Execution Model
As an example, Figure 4 shows the enumeration of the alternative
5.1 Polaris Tasks
distributed physical execution plans for an inner join, 𝑃 ⋈𝑎=𝑏 𝑄
A key challenge in Polaris was how to essentially re-architect
where P and Q are (say, files in a data lake or tables in a managed
distributed query processing while leveraging as much of existing
distributed relational store) hashed on a and c respectively (𝑃 [𝑎] and SQL Server capabilities as possible and ensuring that the resulting
𝑄 [𝑐] ). The enumeration of physical alternatives starts with the system was a faithful implementation of all user-visible semantics.
scans of P and Q, shown in the bottom-most part of the figure. Q is
hash distributed on column c, hence, 𝑄 [𝑐] is the first alternative To this end, all incoming queries in Polaris are compiled in two
generated. Replication and hash distribution on b are interesting phases. The first phase of the compilation stage leverages SQL
properties pushed top-down, leading to the enumeration of sub- Server Cascades QO to generate the logical search space, or
MEMO [25, 26]. The MEMO contains all logical equivalent
plans 𝑄1 and 𝑄 [𝑐] respectively. P is hash distributed on column a,
alternative plans to execute the query. A second phase performs
generating 𝑃 [𝑎] as the first alternative. Replication and hash
distributed cost-based QO to enumerate all physical distributed
distribution on a are also interesting properties pushed top-down.
implementations of these logical plans and picks one with the least
Since we already satisfy hash distribution on a via 𝑃 [𝑎] , we only estimated cost. The outcome is a good distributed query plan that
need to produce 𝑃1 . The plan node in the top half of Figure 4 shows takes data movement cost into account, as explained in [19].
the enumeration of plans for the join operation; this is a permutation
of the alternatives produced by its children at the bottom of the When enumerating the physical space during the second phase of
figure. During the enumeration, correctness filters are applied, the QO process, a query plan in the MEMO is seen as a directed
thereby eliminating 𝑃 [𝑎] ⋈𝑎=𝑏 𝑄 [𝑐] from the search space, since acyclic graph (DAG) of physical operators, each corresponding to
it does not satisfy any of the distribution properties required by an an algebraic sub-expression E in the query. For simplicity, we use
inner join. For the remaining alternatives, only the best plan for E to denote both the expression and its instantiation as an operator
each interesting property is kept: in the MEMO. Operator E has a degree of partitioned parallelism N
that defines the number of instances of E that run in parallel, each
𝑃 [𝑎] ∧ 𝑄 [𝑏] : 𝑃 [𝑎] ⋈𝑎=𝑏 𝑄 [𝑏] on a partition of the input. We denote the distributed execution of
𝑃1 : 𝑃1 ⋈𝑎=𝑏 𝑄 [𝑐] E as ⋃𝑁 𝑖=1 𝐸𝑖 , where 𝐸𝑖 represents the execution of E over the i
th
3208
distribution properties of the join operator. The same notation can cells are persisted in local storage before they can be processed by
be extended to represent more complex relational expressions and the consumer task.
distribution variations, but we omit the details. Tasks in the DAG without precedence constraints can execute in
Next, we introduce the notion of a task Ti as the physical execution parallel, thereby achieving independent parallelism between
of an operator E on the ith hash-distribution of its inputs. Tasks are different tasks of a query. Figure 6 expands on the example in
instantiated templates of (the code executing) expression E that run Figure 4 with an additional join. The left hand side of the figure
in parallel across N hash-distributions of the inputs, as illustrated in illustrates the physical distributed query plan that has two move
Figure 5 with blue triangles. A task has three components: enforcers such that the join between the three relations are hash
aligned into a final task, resulting into a query DAG with a total of
• Inputs. Collections of cells for each input’s data partition. three tasks.
These cells can be stored either in highly available
remote storage, or in temporary local disks. 5.3 SQL Server Scale-up for Task Execution
• Task template. Code to execute on the compute nodes, The example in Figure 5 also illustrates an additional optimization
representing the operator expression E. carried out in the second phase of cost-based distributed query
• Output. Output dataset represented as a collection of cells optimization. Observe how vertexes in the MEMO corresponding
produced by the task. The output of a task is either an to two join operators have been combined into a single vertex that
intermediate result for another task to consume or the carries out both joins—this is because all three input datasets (P, Q,
final results to return to the user, and is distributed across and R) are hash aligned on the same column by the preceding move
several nodes corresponding to the consuming task’s enforcer operations. Thus, in general, the template for a task can
degree of parallelism. include code for an algebra expression involving multiple
operators.
5.2 The Query Task DAG While we could perform the three-way join in this example in two
In general, the distributed query plan is represented as a directed sequential tasks, we intentionally seek to make tasks be maximal
acyclic graph (DAG) (of operators or tasks) rather than a single units of work. This allows us to more effectively leverage the
node to capture the structure of sub-expressions in the query, sophisticated scale-up columnar query processor in SQL Server. At
including data-flow dependencies and required distribution each compute node, the task template of the algebraic expression E
properties of corresponding operators. corresponding to the task is encoded back into T-SQL and executed
Physical query plan Query Task DAG natively in SQL Server. In this approach, the blocking nature of the
boundaries of a task actually help SQL Server to optimize the
T1
a=c template code of a task with fresh stats from intermediate inputs.
[a] a=b [b] a=c [c]
Pi Qi Ri
[c] 6. TASK ORCHESTRATION
Ri Arguably the biggest engineering challenge in Polaris is
a=b [a] [c] [b] orchestration of tasks.
Hc (R) Pi Ri Qi
▪ The scale is daunting—the amount of data could be petabytes,
[b]
Qi T2 T3 leading to millions of cells; the number of compute nodes used
[d] [c] in a single query could be in the thousands; and the number of
[a]
Hb (Q) [d] Hc (Ri ) Hb (Qi ) tasks could be in the millions.
P R ▪ Execution must be robust to transient failures of nodes,
network, storage, and other components (e.g., metadata micro-
[d] [c] services), and must guarantee that all precedence constraints
[c] Ri Qi
Q are satisfied, and all distributed decisions have quorum.
▪ Tasks must be automatically re-startable on any node, for
Figure 6. The Query Task DAG auto-scaling and fault-tolerance.
In Polaris, we introduce a model of the execution of a query as a
Each vertex contains an operator corresponding to an expression E novel hierarchical composition of finite state machines. As
in the query and has a corresponding task template, instantiated explained in previous sections, at run time, a query is transformed
across multiple nodes over hash-distributions of the inputs for the into a query task DAG, which consists of a set of tasks with
vertex. Edges represent dataflow dependencies, and if the precedence constraints.
consuming vertex E does not support pipelining, induce precedence
constraints over “consumer tasks” created by instantiating E across We refer to each of the following aspects of a query as an entity:
compute nodes over hash-distributed inputs of E. That is, the query DAG, the task templates and tasks. A leaf-level task
“consumer” tasks cannot start until the corresponding tasks of the template can be instantiated into tasks on its hash-distributed
producer vertexes of the edge have completed. inputs; in this case, we say that the task template entity is composed
Precedence constraints are inherently blocking and define changes of the instantiated task entities. A non-leaf task template has
of the distribution properties of the data cells consumed by parent precedence constraints on other task templates; in this case, the
tasks. As explained earlier, the DQO injects changes of distribution non-leaf task template entity is composed of the entities for the task
properties via data move enforcers to achieve correctness, or a templates on which it depends. For each entity, we refer to the
better distributed alternative plan to speed up query execution. entities of which it is composed as its dependencies.
Therefore, the subtree of physical operators rooted on a move The execution state of each entity is tracked using an associated
enforcer defines the input and output boundaries of a task. Data state machine with a finite set of states and state transitions. The
move enforcers are blocking operators such that all their output data state of an entity is a composition of the state of the entities of which
3209
it is composed. States can be either composite or simple. Simple Ready since they have no dependencies, and they are eventually
states are used to denote success, failure, or readiness of a task picked to run by the scheduler.
template. Composite states denote (1) an instantiated task template, The state machines for task templates T2 and T3 are instantiated
or (2) a blocked task template. (Note that an instantiated task will and initialized to the Run state. This in turn instantiates tasks for the
succeed or fail but cannot be blocked; tasks are only instantiated task templates. If any of these tasks fail, their state machine
when their inputs are ready.) transitions to the Failed state, the failure is detected and the failed
A composite state differs from a simple state in that its transition to task is restarted automatically if the reason is a transient failure (as
another state is defined by the result of the execution of its indicated by the task state machine transition in Figure 7);
dependencies. It has a collection of peer states, one for each otherwise the parent state machine retries at a coarser granularity.
dependency, and a termination policy intent aggregates meta-data The state for T2 and T3 becomes Success when all its task
on execution of dependencies and captures how to interpret the dependencies succeed. When both move enforcer entities succeed,
outcome of dependencies, and how to act on other peer states. the root entity T1 is unblocked and placed in the scheduler queue.
The Polaris state machine through its hierarchical state machine When it is picked to run, i.e., becomes active, and it is instantiated
composition captures the execution intent and it is in this aspect that as join tasks on the hash-partitioned inputs.
it differs from other distributed query engines. In other DAG In more detail, a state machine in Failure triggers an analysis of the
execution frameworks [5, 6, 14], composition is inherent in the type of failure for all dependencies that we classify as retriable, e.g.,
execution. In Polaris, the state machine provides a template that is transient failures caused by node failure. If retriable, then it can
used to orchestrate the execution. The advantage it offers is the transition back to Blocked, otherwise, the state machine with
ability to formalize how we recover from failures and use the state Failure returns control to the state machine for its parent, which
machine recorder (a log) to observe and reply execution history. will try to re-schedule execution using additional resources or in
Further, for a given a set of workloads in the system, the execution turn propagate the failure up the control chain. This is an example
history combined with the rules governing legal transitions can be of how, in contrast to other systems such as [10, 15, 18], Polaris
used to reorder workload executions and explore different orchestration gives us flexibility in handling different types of
execution sequences by forking and resuming execution from failures by allowing us to specify behavior on termination of a
selected points in the recorded history; this is future work. composite state.
Figure 7 illustrates the entities and state machines for the example To summarize, when in Ready state, a task template waits in the
in Figure 6. As we can see, the distributed query execution of the queue for the scheduler to pick its turn to execute, then it transitions
query task DAG is modelled as a hierarchical set of state machines. to Run. This is when task entities are instantiated, and the task’s
The root query DAG entity starts in the Run composite state and state machines are executed. The task template transition from Run
instantiates the state machine for the entity corresponding to the to a terminating state (Failed or Success), depends on the resulting
(task template T1 representing the) join of P, Q and R. This state execution of the instantiated tasks. Note that any entity can
machine starts in a (composite) Blocked state because it has transition from Failed to Run if the failure is transient. The failure
dependencies on the entities corresponding to (task templates T2 is propagated to higher entities only if it is deemed not retriable
and T3 for) the move enforcers on Q and R; these task templates within the entity’s state machine.
are now placed in the scheduler queue. Their state is initialized to
HIERARCHICAL STATE MACHINES STATE MACHINE DRIVEN QUERY EXECUTION
run
Composite
States
1 2 3 4
blocked
Query DAG
ready
failed
Task Template (T2) Tasks
Simple
States
success
5 6 7 8
Execution of
composite entities
Tasks
3210
Modelling the distributed query execution of queries via The resource demand for each task is computed as a function of
hierarchical state machines has the following goals: inputs and outputs of each physical operator in the template code
for the task. Analogously, Polaris also models each compute node
▪ Satisfy precedence constraints. The execution of the query
as a d-dimensional bin of resources such that placement of tasks
task DAG is carried out top-down in a topological sort order
to containers is based on policies that can be autonomously tuned
such that every task with precedence constrains is blocked
based on resource consumption profiles across all nodes.
on completion of its input tasks. For example, as shown in
the right-hand side of Figure 7, the root task is blocked (Step Workload Graph (WG)
1) until its two dependencies are completed (Step 6).
Resource
▪ Reliable execution. We use the state machines to have fine 5 5 X demand
grain control at task level and define a predictable model for
T10
recovering from failures. Completion and failure T9
execution.
▪ Reproducibility at scale. States and transitions are logged 15 15 10 15 20
by all entities. This allows for predictability and
T5 T4 T3 T2 T1
reproducibility regardless of the complexity of the workload
and the scale. This is also a fundamental building block for
debugging and resumable execution upon failover. Workload Scheduler
▪ Concurrency. Fine grain control at large scale often comes Resource Governance
with large memory requirements and thread contention due
to many subroutines running concurrently. Hierarchical
state machines allow us to track the state of all entities in the
workload with a low memory overhead: there is only one
state machine for a task entity and all instantiations run
through its states and transitions. Also, the Polaris query
processor has been built from scratch using .NET’s task
asynchronous programming model to eliminate the need for
blocking synchronization primitives across subroutines, thus
minimizing thread contention and maximizing OS thread
utilization. The gains are seen in Section 10.2.
7. WORKLOAD AWARE SCHEDULING
Polaris must handle highly concurrent workloads ranging from Figure 8. Workload aware resource scheduling algorithm
dashboarding scenarios running thousands of light weight
queries, to reporting scenarios executing a set of highly complex The representation of the workload as a global graph of tasks with
analytical queries. There are potentially millions of tasks to be resource demands allows us to redefine the multi-query
orchestrated for execution by the Polaris DQP. In the previous scheduling problem as a task scheduling problem with precedence
section we described how hierarchical state machines enable us constraints: the goal is scheduling d-dimensional tasks on d-
to efficiently handle distributed task orchestration at very large dimensional containers to complete in the minimum amount of
scale. In this section we cover how Polaris schedules tasks for time possible while ensuring that at all times, we are within all d
high concurrency. dimensions of resources available to us. Figure 8 shows the
representation of the workload graph for two query DAGs. In
Task scheduling in Polaris is based on a global view of all active green circles we represent the resource demand for each task
queries called the workload graph, generalizing the template. For simplicity, in this example we normalize to just one
representation of a single query as a DAG of tasks to represent number, and not the multi-dimensional resource vector used in
the entire workload by combining task DAGs of all active queries. Polaris. The workload scheduler and the resource governor
Each task in the workload graph has an associated resource operate on the workload graph.
demand that is an extension of the model in Ganguly [27] to d- The pseudocode of the scheduler is shown on the bottom of the
dimensional preemptable resources proposed in [28, 29]. We figure. The scheduler is asynchronously waiting for work, and
define a d-dimensional resource vector that has time and space when awoken it adds all task templates in the workload graph that
shared constraints where each dimension specifies an aspect of are in Ready state to the scheduler queue. Task templates are then
resource consumption. Fungible resources such as memory and dequeued in order specified by the scheduling policy. Currently
CPU can be sliced across tasks at a low cost, and each task’s supported policies include (combinations of): FIFO, sorted by
requirement for a given resource can be stretched at execution resource demand (min to max or max to min), and sorted by
time. On the other hand, more rigid resources such as temp space proximity to the root. Intuitively, sorting by proximity to the root
on local disks must also be satisfied. Stretching temp space across biases towards tasks from jobs that are closer to completion (so
independent tasks is prohibitively expensive since it would that their shared resources can be released sooner).
require swapping pages in and out from/to remote storage.
3211
For the next task template in order, the resource governor (We do not go into the architecture of the centralized
examines each task to be instantiated. If all these tasks fit in their services in detail; briefly, they are built for HA and
target location (i.e., each task’s resource demand can be performance using Azure SQL DB.)
accommodated given current local capacity), then the task ▪ Multiple-pools. Placing the state in centralized services
template is removed from the scheduler queue and transitioned to coupled with a stateless micro-service architecture within a
the Run state. Otherwise, we break out of the loop and wait for pool means multiple compute pools can transactionally
other tasks to complete so the task template can fit. Note that the access the same logical database.
target location of a task is fixed by data affinity to exploit cache Centralized Services
locality. This novel approach to multi-query workload— Metadata Transactions
Data Channel
Data Channel
Polaris Pool
Polaris Pool
strategies, helping maximizing resource utilization. Distributed QP Distributed QP
▪ Weighted policies for resource governance. The
Control Flow Control Flow
placement of the task in the target compute server is based
Compute
on resource fit to maximize load while avoiding over- Compute
servers servers
Compute
provisioning. For this we use a weighted policy to pack tasks Server
into the compute capacity available at a node. The policy has Execution
Service
two variations, one that caps the amount of resources that SQL Server
can be granted to a task, and another one that does not. If the Cache
task does not fit in the available compute, it is put back into Data Set = Collection of data cells
the queue till tasks complete and capacity is freed.
▪ Increased flexibility in task ordering. Scheduling policies Partitions
define the order in which tasks are executed as they become
ready for execution. By looking at ready tasks across all
queries, taking into account resource pressure in the system, Hash Distributions
we are able to pick orderings that would not be permissible
otherwise. For instance, consider the example in Figure 8, Figure 9. Polaris service architecture
applying a max to min scheduling policy. The scheduler
queue SchQ starts with {T1, T2, T3, T4, T5} with scheduling
8.1 Stateless micro-service architecture
A Polaris pool consists of a set of micro-services each with well-
order {T1, T2, T4, T5, T3}. As we go through the loop, T3
defined responsibilities. The SQL Server Front End (SQL-FE) is
does not fit, so only four out of the five task templates
the service responsible for compilation, authorization,
transition to Run state. Next all T4 and T5 complete and the
authentication, and metadata. Metadata is used by the compiler to
scheduler is awoken. SchQ now contains {T8, T3} with
generate the search space (the MEMO) for incoming queries and
scheduling order {T3, T8}. Now, if the workload manager
bind metadata to data cells. The Distributed Query Processor
detects pressure in the system because of disk resources held
(DQP) is responsible for distributed query optimization,
by previously completed task templates, it can choose to
distributed query execution, query execution topology
swap the scheduling policy to sort by proximity to the root
management and workload management (WLM). Finally, a
to release pressure in the system. In this example, the
Polaris pool consists of a set of compute servers that are, simply,
scheduling order would change to be {T8, T3}. The study of
an abstraction of a host provided by the compute fabric, each with
scheduling and resource management policies to consider
a dedicated set of resources (disk, CPU and memory). Each
SLAs and avoid starvation is out of scope for this paper and
compute server runs two micro-services: (a) an Execution Service
will be addressed in future work. (ES) that is responsible for tracking the life span of tasks assigned
▪ Resource driven query admission control. Back pressure to a compute container by the DQP, and (b) a SQL Server instance
can be driven by a ratio of capacity (demand vs. available). that is used as the back-bone for execution of the template query
Concurrency is only limited by available capacity, and the for a given task and holding a cache on top of local SSDs (in
admission of a query is only denied when we cannot addition to in-memory caching of hot data). Data can be
guarantee SLAs due to a capacity crunch. transferred from one compute server to another via dedicated data
8. SERVICE ARCHITECTURE channels. The data channel is also used by the compute servers to
Figure 9 illustrates the architecture of the query service for two send results to the SQL FE that returns the results to the user. The
Polaris pools sharing the same centralized (metadata and life cycle of a query is tracked via control flow channels from the
transaction) services. There are two important aspects to note: SQL FE to the DQP, and the DQP to the ES.
▪ Stateless architecture within a pool. The Polaris As explained in Section 2, no essential state is held by any micro-
architecture falls into the stateless service architecture from service in Polaris. While caches are stored by the compute
Figure 1.b, as discussed in Section 2. All services within a servers, upon fail-over, they can be easily re-constructed.
pool are stateless: (i) data is stored durably in remote storage,
and is abstracted via data cells, and (ii) the metadata and
transactional log state is off-loaded to centralized services.
3212
Control Control Control
channel channel channel
3213
Figure 11. Results for 5k concurrent queries
the resource driven scheduling capabilities of the WLM, and its simultaneously generating a workload graph over 50k task
autonomy around capacity management, access control and templates that as they are scheduled, they expand to an aggregated
resource governance under heavy load. For this we use the TPC- total of ~550k instantiated tasks. As task templates are scheduled
DS [30] workload to run a multi-user environment executing five for execution, tasks are instantiated; the chart on the left of the
thousand of queries simultaneously. Figure shows the number of actively executing tasks and
Single query performance at PB scale over the data lake. We aggregated completed tasks at any given point in time for the
ran all TPC-H [31] queries at one PB scale across hundreds of duration of the test. The chart on the right of the Figure shows the
machines on Azure public compute. The goal of this experiment average resource utilization of the compute server for CPU and
is to stress the scalability, elasticity, and fault tolerance Memory dimensions. As we can see, we do have a good
capabilities of the service. Note that this is not a validated TPC- utilization of the cluster for the duration of the tests. For this
H benchmark, the only intent is to demonstrate we can run all experiment we have used FIFO scheduling order of task
queries at a scale that has not been done before. templates, and we think both the resource utilization and time can
be improved by using more sophisticated policies; experiments
Querying heterogeneous data. To illustrate that Polaris can run using different scheduling order policies are out of the scope of
on heterogeneous data, we ran all TPC-H [31] queries at 1 TB this paper, and to be carried out in the near future. The main thing
scale on a dataset consisting of a variety of data files ranging from to observe is that Polaris is able to handle high concurrency for a
raw CSV files to Parquet files with nested attributes. The test was complex workload such as TPC-DS packing up to 9k tasks on the
executed using less than 100 cores in Azure public compute. The 10 compute servers available, and completing approximately
experiment emphasizes raw file parsing and query optimization 550k tasks.
capabilities over joins between plain text files and Parquet files
with nested attributes. 10.3 Query Performance at Petabyte Scale
The set-up
10.2 Concurrency
The setup We used the TPC-H dbgen utility to generate a PB of raw data
and then converted it into parquet files that were stored in WASB.
We used the TPC-DS dbgen utility to generate a 1TB of raw data Parquet files were organized using the data model from Section 3
and then converted it into parquet files that were stored into with both hash partitions and user partitions. The total number of
Windows Azure Storage Blob (WASB), the Azure Data Lake. parquet files is ~120k with total compressed size of 360TB.
Rows do not follow any distribution since we are not focusing on
single query performance but stress on concurrency. The The compute topology
application spans five thousand concurrent sessions executing We deployed a Polaris pool on Azure, consisting of one SQL FE
one distinct TPC-DS query each. For this we generated TPC-DS compute instance, one DQP and 420 compute execution services
queries 50 times with different predicate ranges and assigned one (ES). Each node is a 2x12 cores Intel processor with 192GB of
to each session. RAM and 4 SSDs of 480GB. The network topology is 40Gb
The compute topology throughout; 40Gb NIC, 40Gb TOR, and 40Gb CSP.
Since the goal of this experiment is to stress the DQP component Results
we choose a rather small compute topology with 10 compute Figure 12 shows execution time for all 22 TPC-H queries at 1PB
nodes. The hardware configuration of each node consists of 2x20 scale. To the best of our knowledge, this is the first time results
cores intel processors, 520GB of RAM and 4 SSDs of 1TB each. have been published at a PB scale. Remarkably, some queries
The network topology is 40Gb throughout; 40Gb NIC, 40Gb (Q6, Q12, Q15 and Q16) run extremely fast, through partition
TOR, 40Gb CSP. elimination and distribution alignment of expensive joins, taking
Results advantage of the Polaris data model (Section 3). TPC-H has a few
queries that stress the processing limits of any system since they
Figure 11 shows the task execution summary and the resource join across all sources with low selectivity and very heavy joins
utilization in the backend nodes. The 5k queries run between large dimension and the fact table: Q9 and Q21 are good
3214
examples. Polaris manages to process these queries at PB scale types, and Parquet with nested types). This demonstrates the
under two hours across 420 machines, demonstrating scalability robustness of the system in handling heterogeneous data sources.
and resilience.
11. Conclusions
In this paper, we presented Polaris, a novel distributed query
processing framework in Azure Synapse that seeks to support
both big data and relational warehouse workloads, going beyond
current systems of either kind in its flexibility and scalability. The
architecture is inspired by scale-out techniques from big data
systems. It extends these techniques in many ways, notably in the
cell abstraction of data, flexible task orchestration framework,
and global workload task graph. Polaris also is notable for how
it carefully refactors SQL Server’s complex codebase in order to
leverage its query optimizer and scale-up single-node engine—
both of which reflect many years of refinement—while
completely rewriting the distributed execution framework.
Polaris is also cloud-native, completely separating compute from
both storage and transactional state in order to support agile
provisioning and scaling of compute pools. Azure Synapse is
Figure 12. 1PB TPC-H single query performance
unique among cloud services in supporting both serverless and
10.4 Querying Heterogeneous Data provisioned form factors, with multiple serverless and
The set-up provisioned SQL sessions able to concurrently operate on the
We used the TPC-H dbgen utility to generate a TB of raw data in same datasets, across both lake and managed data.
CSV format and then converted the files for the Appendix
lineitem, customer, supplier, and nation tables into Parquet
Required properties
files. Conversion into Parquet for customer and supplier files was
done by organizing contact information (name, address, The following table contains the required properties for the most
nationkey, and phone columns) as nested types in Parquet. The common algebraic operators. The columns are treated as
lineitem and nation Parquet files were organized with simple equivalence classes (transitive closures) when testing the required
types, without nested structure. Files for orders, partsupp, part properties for algebraic correctness. When join predicates have
and region were kept in raw CSV format. All files for a single multiple equality conjuncts, correctness holds if hash key of each
entity were stored in a single folder in WASB. input is a subset of the columns in the conjuncts from that input.
For "Group-By", correctness holds if hash key of input is a subset
of the grouping columns. Distributed query processor also
supports decomposing aggregations and Top-N into local-global
forms, which allows the optimizer to push selective local
operators before data movement enforcers. P [a] subsumes P ∅ .
3215
12. REFERENCES [20]. Ramakrishnan, Raghu et al. Azure Data Lake Store: A
[1]. Azure Synapse Analytics. [https://azure.microsoft.com/en- Hyperscale Distributed File Service for Big Data Analytics.
us/services/synapse-analytics/] Chicago, IL, USA : SIGMOD Conference, 2017.
[2]. Report, Microsoft. FIDO: A Cloud-Native Versioned Store [21]. Delta Lake. [https://delta.io/]
With Concurrent Transactional Updates. 2020. [22]. Antonopoulos, Panagiotis et al. Socrates: The New SQL
[3]. Ashish Thusoo et al, Hive – A Petabyte Scale Data Server in the Cloud. Amsterdam, Netherlands : SIGMOD
Warehouse Using Hadoop. Long Beach, California, USA : Conference, 2019.
ICDE Conference, 2010. [23]. Shukla, Dharma et al. Schema-Agnostic Indexing with
[4]. R. Chaiken, et al. SCOPE: easy and efficient parallel Azure DocumentDB. Kohala Coast, Hawaii : VLDB Conference,
processing of massive data sets. Auckland, New Zealand : 2015.
VLDB Conference, 2008. [24]. Astrahan, Morton M. et al. System R: relational approach
[5]. Michael et al. Spark SQL: Relational data processing in to database management. s.l. : ACM Transactions on Database
Spark. Armbrust, Melbourne, Victoria, Australia : Proc. ACM Systems, 1976, ACM Transactions on Database Systems, pp.
SIGMOD, 2015. SIGMOD. 16-36.
[6]. Michael Isard et al. Dryad: distributed data-parallel [25]. Graefe, Goetz and McKenna, William J. The Volcano
programs from sequential building blocks. Lisboa, Portugal : optimizer generator: extensibility and efficient search. Vienna,
Eurosys, 2007. Austria : IEEE International Conference on Data Engineering,
1993.
[7]. Gupta, Anurag et al. Amazon Redshift and the Case for
Simpler Data Warehouses. Melbourne, Victoria, Australia : [26]. Graefe, Goetz. The Cascades Framework for Query
SIGMOD Conference, 2015. Optimization. s.l. : Data Engineering Bulletin, 1995, Vol. 18.
[8]. AWS Athena. [https://aws.amazon.com/athena/] [27]. Hochbaum, Dorit S. and Shmoys, David B. Using dual
approximation algorithms for scheduling problems: Theoretical
[9]. An Inside Look at Google BigQuery. and practical results. Portland, OR, USA : IEEE, 1985. 26th
[https://cloud.google.com/files/BigQueryTechnicalWP.pdf] Annual Symposium on Foundations of Computer Science (scfs
[10]. Melnik, Sergey, et al. Dremel: Interactive Analysis of Web- 1985).
Scale Datasets. s.l. : VLDB Endowment, 2010. [28]. Garofalakis, Minos N. and Ioannidis, Yannis E. Parallel
[11]. Dageville, Benoit, et al. The Snowflake Elastic Data Query Scheduling and Optimization with Time- and Space-
Warehouse. San Francisco, California, USA : SIGMOD Shared Resources. Athens, Greece : VLDB Conference, 1997.
Conference, 2016. VLDB.
[12]. Oracle Exadata. [29]. Garofalakis, Minos N. and Ioannidis, Yannis E. Multi-
[https://www.oracle.com/technetwork/database/exadata/exadata- dimensional Resource Scheduling for Parallel Queries.
technical-whitepaper-134575.pdf] Montreal, Canada : SIGMOD Conference, 1996. SIGMOD.
3216