2021 - Adaptive Code Generation for Data-Intensive Analytics
2021 - Adaptive Code Generation for Data-Intensive Analytics
929
taken advantage of (or avoided) by the compiler that generates • Our system can invoke SIMD optimizations for code, even
machine code from this loop. Performance diversity can be caused though they do not always improve performance. In our
by variations in the data distribution and/or physical data ordering, system, the SIMD version will be used if it is better, and the
influencing branch-related and cache-related stalls. Different algo- scalar version will be used otherwise.
rithmic choices (reordering of operations, use of SIMD instructions) • The system can select a small but representative set of plans
can improve or worsen performance. Details of these opportunities that cover the search space well enough to respond to various
and pitfalls will be discussed at more length later in Section 3. parameter combinations that may not have been known at
Our work builds on the dynamic query execution scheme pio- query compilation time.
neered by Vectorwise [55] (discussed in Section 2). Multiple plans • It is possible to dynamically achieve a balance between ex-
are precompiled for a particular operation. As the operation pro- ploration (trying out a variety of plans) and exploitation
gresses over a very large data set, performance information from (maximally employing the best plan).
the early stages of execution can be used to guide the choice of plan
for later stages. Plan switching allows for robustness in the face of 2 BACKGROUND
errors in query cost estimation, and also allows a dynamic change
2.1 Prior Work on Compiling Query Plans
of plans if the data distribution changes within the dataset. Details
of our adaptive code generation are given in Section 4. Our work builds upon the dynamic query execution scheme de-
Existing query optimization techniques for in-memory process- veloped as part of the Vectorwise system [55]. The Vectorwise
ing are limited in several ways: (a) they are not extensively used implementers observed that query performance could vary signifi-
outside relational database management systems; (b) they are lim- cantly due to low-level performance effects. Different query plans
ited to a handful of relational operators, and do not cover access pat- might perform best under different regions of parameter space, yet
terns or dynamically-defined functions found in other data-analysis the parameter values may not be known at compile time. Different
scenarios; (c) they treat the underlying compiler as a black-box, compilers for the same programming language might give better
with unpredictable performance depending on which compiler is or worse results, depending on the query. Data distribution effects
used with which compiler settings; (d) they often bake-in design (that may change as the system progresses through the data) may
choices that may be appropriate for usage within a particular DBMS, affect query performance, so that one plan is best for parts of the
but not for more general cases. We address these challenges by op- data, while another plan is best for other parts.
timizing data-analysis style queries expressed as tight loops in a The Vectorwise team also observed that it is hard to estimate
conventional imperative programming language. the cost function, not just because of the data distribution effects
We extend an open-source compiler (GraalVM compiler [65] and and parameter estimation inaccuracies mentioned above. Different
Truffle [64]) with both known and novel optimization techniques run-time platforms may have different performance characteristics,
that can automatically be applied whenever the compiler identifies such as the relative cost of a SIMD instruction to a scalar instruc-
that a loop is time-consuming. GraalVM is an ecosystem and shared tion or the relative impact of a branch misprediction. Further, the
runtime offering performance advantages for a variety of program- overlapping of various latencies (e.g., cache misses) makes it hard
ming languages [45]. Interpreted code is automatically transformed to identify their true impact on elapsed time. Rather than estimate
into compiled code when the system detects a performance hot-spot. the cost, Vectorwise chose to measure the actual cost.
The GraalVM Compiler is a dynamic just-in-time (JIT) compiler In Vectorwise, data chunks of about 1000 rows are processed as
that performs sophisticated code analysis and optimization. The a unit. A key innovation in Vectorwise is the analysis of the actual
Truffle API allows programming languages to be combined in a running time over recent chunks of data using several different
shared runtime using an abstract syntax tree representation. Inter- candidate query plans in turn [55]. Each plan contributes to the
preted code is associated with nodes in the abstract syntax tree, and final result, but might take more or less time depending on data and
the Graal compiler automatically compiles the performance-critical machine parameters. The plan that takes the least time is scheduled
parts of the code to speed up execution. Details of the Graal/Truffle to run for an extended number of chunks. After that, all candidate
implementation are provided in Section 5. plans are run again within a certain window to see if the data has
Integration into the compiler enables many applications to ef- changed to the point that a different plan is best. The best plan is
ficiently process large data sets. The system supports dynamic then scheduled for an extended period, and the process repeats.
queries involving user-defined functions and arbitrary access pat- To summarize, the advantages of the approach pioneered by
terns. Database-style and compiler optimizations co-exist, eliminat- Vectorwise are: (a) optimization happens on the basis of actual time
ing some of the mismatches that happen when the compiler is used rather than predicted time, reducing the reliance on complex and
as a black-box by a DBMS. The system tunes a variety of run-time potentially inaccurate cost modeling; (b) most of the execution will
execution parameters automatically, with minimal guidance from use the best plan among the candidates; (c) over time, as the data
the programmer. changes, the chosen plan can adapt to those changes. Despite these
We evaluate our system using the TPC-H benchmark, weather advantages, the Vectorwise approach has several limitations that
visualization, and microbenchmark queries, over datasets with vari- we will discuss next.
ous kinds of ordering/clustering properties. The experimental eval-
uation (Section 6) shows that: 2.2 Limitations of the Vectorwise Approach
• Our system can dynamically respond to changes in the data The first and most obvious limitation of the Vectorwise approach is
distribution, choosing the best plan for the current data. that the implementation effort has no wider impact beyond uses of
930
the Vectorwise system itself. It might be possible for a competing it in interpreted mode. We describe some of the performance-related
DBMS to mimic the implementation described by Vectorwise, but choices that need to be made below.
applications of the techniques beyond in-memory relational DBMSs Condition Ordering and Non-Branching Plans. Selection
are unclear. In contrast, our approach embeds the optimization/exe- condition order is important for in-memory query processing [54].
cution decision making at the programming language level, making Branch misprediction effects contribute significantly to query pro-
the techniques broadly applicable to a wide variety of applications. cessing costs. Among the plans considered are plans that avoid
A second limitation is that the Vectorwise approach uses a few branches altogether by converting control dependencies to data
hand-crafted code fragments that cover only the essential DBMS op- dependencies. For example, the plan above might be rewritten as
erators. These code fragments are precompiled at DBMS build time. follows to avoid branches:
Code fragments with in-lined user-defined code are not considered.
Access patterns in which there is interaction between consecutive for(i=0;i<total;i++) {
rows are common in applications such as time-series analysis, but // & rather than &&; no branches
are essentially absent in a relational DBMS. We compile code frag- test = (time[i]>start & lat[ID[i]] > 30);
ments at query-time, allowing user-defined code and arbitrary access // -1 = 0xFFFFFFFF; -0 = 0
patterns that might not match a handful of predefined templates. mask = -test;
The paper describing the Vectorwise system describes how they // 0 mask means add 0, i.e., no-op
used several different compilers, with different optimization set- accum[zip[ID[i]]] += (mask & rain[i]);
tings, and observed varying performance results. The results were }
so unpredictable that they were forced to compile multiple variants
of each code fragment: two compilers and two optimization settings While branch-free code eliminates the branch misprediction
would require four compiled code variants to cover all of the cases. overhead, it is not always the best choice. For example, if a condi-
The Vectorwise authors remarked that they resisted the temptation tion is very selective, so that it fails most of the time, then executing
to investigate why the compilers had such different behaviors [55], the condition early is good because (a) it avoids unnecessary work
presumably because they had no control to effect a change even of for most tuples, and (b) conditions that fail most of the time are
they could identify an inefficiency. In our method, the compiler is not relatively well-predicted by modern processors. When several con-
an external black box. Instead, because DB-style optimizations and ditions are present, the best ordering of those conditions depends
traditional compiler optimizations happen in the same framework, both on the selectivity of the condition and the cost of testing the
we can control code generation. If the compiler is unsure whether an condition [23, 54]. These kinds of alternative rewritings are used
optimization helps or not, two variants of the code fragment could be in the hand-generated templates of the Vectorwise system [55].
generated internally, by the compiler itself. Our system automatically generates candidate plans at query-time
The Vectorwise system chooses somewhat arbitrary values for using each kind of rewriting (details in later sections).
parameters such as the window size to run the current best plan, Cache Misses. Accesses to the arrays time, ID and rain are
and the window size within which other candidate plans are run. sequential, and prefetching is likely to be effective in minimizing
While these settings may have been adequate for the limited set of cache latency for those accesses. lat and zip are accessed non-
operators considered by Vectorwise, it is not clear that such choices sequentially, and may generate cache misses whose latency may
would be optimal under the broader contexts considered in this paper. be significant (tens of cycles for an L2 miss, about 100 cycles for
We investigate principled ways for setting such parameters, allowing an L3 miss). These costs also influence the ordering of selections,
them to vary based on performance feedback generated so far. Section 6 since a cache miss might make a condition like lat[ID[i]] > 30
shows an experiment where the choice of window size matters. expensive to test. Whether lat[ID[i]] > 30 generates a cache
miss on lat depends on: (a) How many IDs there are in total and
how compactly they are allocated in the lat array (e.g., are sensor
3 PERFORMANCE DIVERSITY AND IDs re-used when a sensor is taken out of service?); (b) How many
REWRITING OPTIONS IDs are likely to be registering rain at the same time (depends on
Let us return to the loop introduced in Section 1, in which we first sensor placement and weather patterns); (c) How likely it is that a
specialize and in-line the definition of interesting, and combine. sensor that registers rain at time 𝑡 also registers rain at time 𝑡 + 1
The user has specified that an ID is interesting if its latitude is (affects temporal locality, and depends on weather patterns). Given
greater than 30, and has stated that the way to combine rainfall the complexity of predicting cache behavior, we circumvent the
readings is to sum the rain amounts grouped by zipcode. zip[id] problem by considering a very limited number of scenarios. For
represents the zipcode where the sensor having identifier id is example, we might just consider two extreme scenarios, one in
located, and lat[id] and long[id] represent the latitude and lon- which we expect an L3 cache miss and one in which we expect an
gitude of the sensor. L1 cache hit.
In database terminology this query applies two selection condi- SIMD. SIMD instructions can be applied to both the conditions
tions, performs two foreign key joins to the lat and zip “tables”, and actions of the code above. Let 𝑤 be the number of SIMD lanes.
and performs a grouped SUM aggregate of the rainfall. We assume The condition lat[ID[i]] > 30 might be evaluated on 𝑤 consec-
that total is very large, so that optimizing the loop is likely to have utive i values by (a) loading a SIMD register with 𝑤 consecutive ID
a big performance impact. The hot-spot compiler will be triggered values; (b) using a SIMD gather instruction to look up 𝑤 different
relatively quickly to compile the code rather than continuing to run addresses within the lat array; and (c) comparing the results with
931
(a) Varying skew (b) Varying selectivity
a SIMD register pre-loaded with 𝑤 copies of the value 30. The re- This choice allows for more general plans, including in-lined user-
sulting booleans can then be ANDed with other boolean conditions, defined functions that are not known in advance. Nevertheless, this
or used as a mask for other actions. choice is challenging because it makes query compilation itself
The update of the accum array can similarly use SIMD gather part of the observable response time. Our preliminary observations
operations to load the current running sums, SIMD add instructions using the Graal compiler (Section 5) suggest that a plan can be
to perform the updates, and SIMD scatter operations to write out compiled in tens of milliseconds. Thus, if we were performing a
the results. Special SIMD instructions detect conflicts (e.g., updates large scan taking several seconds, say, we could probably not af-
to a common memory address) across SIMD lanes and serialize ford to compile more than 10 plans. Beyond that, the overhead of
them in the same sequence as the input. compilation may outweigh the benefits of adaptive query process-
SIMD processing has the potential to speed up processing if the ing/optimization.
workload is not memory-bound by using fewer instructions to do
the same work. It is not always clear that SIMD optimization is 4.1 On-Line Analysis
desirable because (a) similarly to no-branch plans, it does the entire First, we optimize abstractions of the loop components. For example,
work even if the first condition would have led to a quick rejection; the cost estimate for a SIMD computation may depend on the
(b) under conditions of skew, the conflict resolution step of the skew in the group-by values (Figure 1(a)). We may simply optimize
SIMD scatters may dominate the cost, making the SIMD option under two abstracted conditions: no-skew and high-skew. As a
slower than the scalar option. Rather than trying to estimate skew second example, the cost estimate for a condition-testing plan may
and determine whether the exact cost of the SIMD option is optimal depend on the selectivity (Figure 1(b)) and cache-behavior of the
for the current data, we simply generate SIMD plans as additional data. Rather than estimating a selectivity for a condition, we impose
candidates to be considered at run-time. a selectivity on that condition as a way of making sure we cover an
Performance Diversity. Figure 1 illustrates two cases of per- appropriate subregion of the optimization space. A condition may
formance diversity alluded to in the previous discussion. Figure 1(a) be given selectivities that are “small,” “medium,” or “large” (say 0.05,
shows the performance of a grouped aggregation, where the group- 0.5, 0.95 respectively).
ing column may be skewed according to a Zipf factor shown on the
x-axis. The SIMD code is faster than scalar code under low skew, 4.2 Off-Line Analysis
but slower under high skew due to the high cost of conflict reso-
There is an implicit bias in our on-line analysis, because our rela-
lution as described above [67]. Scalar code is fastest at high skew
tively coarse abstractions of parameters may be far from either (a)
because the grouping cardinality is small and so the aggregates fit
the true parameters, or (b) the critical values of the parameters for
in the L1 cache. Figure 1(b) shows three plans for a query having
which the choice of plans would change. We therefore supplement
two selection conditions, using plans of the kind described in [54].
our on-line analysis with an off-line analysis for common query pat-
Each of the three plans is best in some selectivity range. Because
terns. For example, we imagine that loops containing if-statements
the selectivity may not be known in advance, or may vary within
that test any number of conditions may be common in practice. We
the dataset, our approach will be to include multiple plans and to
therefore perform a more detailed off-line analysis of 𝑐-condition
choose the best plan according to the recent performance history.
loops for all 𝑐 below some moderately large threshold (at least 10).
Although this off-line analysis is expensive, it would happen once
4 ADAPTIVE CODE GENERATION for a target hardware environment before the compiler is released,
So far we have suggested that we will be generating multiple plans, or during a calibration step when the compiler is installed. After the
running each for chunks of data during a testing phase, and then off-line analysis, the system stores the generated candidate plans as
selecting the fastest plan to run for an extended period. Unlike the a summary to use for adaptive code generation online (Section 5).
Vectorwise system, where an arbitrary number of plans might be For each 𝑐, we use a more fine-grained approach to compute a
precompiled in advance, we aim to generate plans at query time. cost estimate of candidate plans for 𝑐 conditions based on the cost
932
Table 1: Candidate plans Table 1 shows how this algorithm performs for 3 conditions
(𝑐 = 3) and up to 5 plans (1 ≤ 𝑘 ≤ 5). For comparison, we also show
# plans (exhaustive) ratio plans (local) ratio the results of an exhaustive search. In general, the best set of 𝑘 − 1
1 { C0 & C1 & C2 } 9.77 IF ( C0 ) { C1 & C2 } 12.64 plans may not be a subset of the best set of 𝑘 candidate plans, but
IF ( C0 ) { C1 & C2 } IF ( C0 ) { C1 & C2 } our heuristic algorithm does choose 𝑘 − 1 plans from among the
2 5.40 12.07
IF ( C1 & C2 ) { C0 } IF ( C1 ) { C0 & C2 } best 𝑘 plans. We observe that the heuristic performs reasonably
IF ( C0 ) { C1 & C2 } IF ( C0 ) { C1 & C2 } well when 𝑘 ≥ 3, which is likely in our application domain.
3 IF ( C1 ) { C0 & C2 } 3.25 IF ( C1 ) { C0 & C2 } 3.25 For small 𝑘, an exhaustive search will be feasible, and it does not
IF ( C2 ) { C0 & C1 } IF ( C2 ) { C0 & C1 } miss the best plans that the above heuristic could prune. Therefore
{ C0 & C1 & C2 } { C0 & C1 & C2 } we use a hybrid approach: Generate the best 10 plans using the
IF ( C0 ) { C1 & C2 } IF ( C0 ) { C1 & C2 } heuristic, and then search exhaustively among them for the best
4 1.97 1.97
IF ( C1 ) { C0 & C2 } IF ( C1 ) { C0 & C2 } pair of plans. This aproach is more accurate for small number of
IF ( C2 ) { C0 & C1 } IF ( C2 ) { C0 & C1 } candidate plans.
{ C0 & C1 & C2 } { C0 & C1 & C2 } We used the maximum performance ratio as our heuristic func-
IF ( C0 ) { C1 & C2 } IF ( C0 ) { C1 & C2 } tion, but we could alternatively have used the average performance
5 IF ( C1 && C0 ) { C2 } 1.79 IF ( C0 & C1 ) { C2 } 1.97 ratio. We argue that the average can be biased depending on how
IF ( C1 & C2 ) { C0 } IF ( C1 ) { C0 & C2 } the selectivity and cost values are chosen. For example, they would
IF ( C2 && C0 ) { C1 } IF ( C2 ) { C0 & C1 } give extra weight to the regions of parameter space that were more
heavily sampled. In contrast, the max ratio is relatively stable, and
focuses the optimization on the part of the parameter space where
it matters most.
formulas of [54]. For example, for 3 conditions, we try all 6 orders
Table 1 shows that there is a diminishing return in reducing the
as well as all logical-and, bitwise-and, and no-branch plans. Since
max ratio metric as we choose more candidate plans. During online
we do not know the selectivity and cost of each condition (and
execution, the best plan among the candidate plans is chosen. As-
the cost of the body part) in advance before query execution, we
suming that there is enough data for exploitation (so the exploration
develop a large number of configurations in an offline analysis. For
cost is negligible), then it is in theory better to choose from more
every condition, we test 20 selectivities ranging multiplicatively
candidate plans, but the marginal benefit is decreasing (as is the
from 0.0001 to 0.9999. We test 10 cost values from 1 to 1024 cycles,
metric). As we demonstrate in the experiments, the performance
again multiplicatively. Then, for each of these 20x10 configurations,
stabilizes as we increase the value of 𝑘. In practice, a reasonable
we compute the cost of all different plans [54].
heuristic for 𝑘 conditions is to use at least 𝑘 candidate plans so that
We then compute a summary of the best plans to use during
every condition can be the first condition in some plan.
online exploration. Suppose we can afford to use 𝑘 plans for explo-
ration. Our metric for evaluating the quality of a set of 𝑘 plans is
based on the worst-case ratio of estimated performance across all 4.3 Measuring Execution
configurations:
We follow the Vectorwise approach by measuring actual times and
choosing plans based on their recent history of execution times.
the best cost among the 𝑘 plans
max The Graal/Truffle system already instruments interpreted code
{configurations} the best cost among all plans
with counters to observe events like a branch being taken. When
Then we would like to choose the set of 𝑘 plans that minimizes the interpreted code is identified as a hot-spot and compiled, that
this ratio. An exhaustive search would be too costly (exponential information is used to inform the subsequent compilation phase.
in 𝑘) and so we propose the following heuristic method. The counter instrumentation is omitted from the compiled code to
(1) Every plan is considered as a valid candidate, and every minimize overhead.
configuration is mapped to the plan that minimizes its cost For the execution of compiled code, we divide the entire execu-
(which we record as the baseline cost for the configuration, to tion into a series of alternative exploration and exploitation periods.
be used in the denominator of the formula above). Any plan During an exploration period, a number of candidate plans is tested
that contains no configurations at this point is eliminated. over input chunks and compared with the execution times. In the
(2) While there are still too many plans, consider each plan 𝑃 following exploitation period, the best plan is maximally employed
in turn as follows: (a) Map each configuration previously over a larger number of chunks. We keep a recent history of chunk
assigned to 𝑃 to the next-best plan, and compute the ratio execution performance, so that the system can react to changes by
of the new estimated cost to the baseline cost. Record the comparing the current execution with previous executions. Two
highest cost ratio as the score for 𝑃. (b) Remove the plan heuristics are used for dynamically setting parameters:
with the lowest score, and re-assign its configurations to • Dynamic exploitation (DE). For consecutive exploration
their next-best plans. periods, if the best plan does not change, then it suggests
We eliminate plans with the lowest ratio because their elimination that the data is behaving consistently, so we double the size
makes the smallest incremental difference to the overall ratio we are of the exploitation period; otherwise, the data distribution is
trying to minimize. In other words, the next-best plans are almost likely to have changed during the two explorations, so we
as good as the elimination candidate. reduce the size to half of the original exploitation period.
933
(a) AND plan (b) Reorder (c) No-branch
• Early exploration (EE). When we observe that the chunk in the for-loop, but defined outside of the for-loop, are stored in a
takes significantly longer to execute (more than double the dictionary; (3) The variable dictionary and the for-loop string are
average of recent chunks), it is a strong indication that the passed to the code generation framework via the Polyglot API.
underlying data has changed, so we start exploration using To control the adaptive code generation, we implement a set of
additional plans starting from the next chunk. AST nodes extending from Truffle Nodes, including value nodes
In practice, combining these two heuristics works well for our (e.g., constants), arithmetic nodes (e.g., Addition), and condition
experimental datasets (Section 6.2). nodes (e.g., LessThan). When the rewritten source code is executed
and the adaptive execution framework is invoked, control is handed
5 IMPLEMENTATION over to the root node, a special Truffle AST node that handles the
We use the Truffle language implementation framework [64] to de- execution of the loop and measures the performance. We use a
velop the adaptive execution framework. Truffle is an open-source custom parser built with ANTLR to parse the for-loop string into the
library that simplifies the development of language execution en- Truffle expression nodes we implemented, and generate multiple
gines and data processing engines using self-optimizing abstract ASTs representing the candidate plans according to the summary
syntax trees (ASTs) in the GraalVM ecosystem. Each node in the obtained from offline analysis (Section 4.2). The variable values
AST represents an operation (e.g., a comparison, an evaluation of stored in the dictionary are written to the procedure stack, so that
an AND condition, an arithmetic computation, etc.) that is com- they can be accessed and modified during the adaptive execution.
piled to machine code by the Graal compiler. During the execution, Under the root node of the loop, a TopLevelCondition node rep-
an AST node can make use of runtime information and change resents the if-statement. For conjunctive conditions (AndCondition),
its internals to specialized versions that have better performance. a candidate plan specifies the ordering of the conditions as well
Node rewriting and JIT compilation are automatically handled by as a mode indicating how the conditions are computed and com-
the Graal compiler. bined together (LogicalAnd, BitwiseAnd, or NoBranch). The order-
In this paper, we focus on JavaScript programs with a for-loop ing and the node properties are stored as internal variables of an
like the example in Section 1. Users can write a pragma directly AndCondition. The body part of the if-statement (true branch) is
above the for-loop they wish to perform adaptive execution on: a generic AST node if all conditions have been evaluated. If there
are remaining conditions to be evaluated as no-branch conditions,
var input0 = ... // initialize data arrays then the body part is rewritten to an AndCondition node with
var input1 = ... NoBranch mode. The body also uses a mask to determine whether
var count = 0; the result is written to output. Multiple assignment statements are
permitted in the body.
" adaptive execution "; // adaptive execution pragma Depending on the number of conditions (i.e., the structure of
for ( i =0; i <1000000000; ++ i )
the code), the root node chooses from a summary with matching
if ( input0 [ i ] <20 && input1 [ i ] <50)
count ++;
conditions a set of candidate plans. For each candidate plan, the root
node constructs an AST as shown in Figure 2. An AndCondition
By using the pragma, the user is (a) certifying that the predicates node has conditions as its child nodes, which are basic conditions
in the if-statement can be reordered, and (b) hinting that adaptive like LessThan comparisons. Figure 2 shows three example plans
execution should be applied to the for-loop. with the same semantics. By reordering the conditions (C0, C1 and
C2), the AST in Figure 2(a) is rewritten to Figure 2(b) and thus
5.1 Preprocessing executed differently. Either logical or bitwise AND can be used
depending on the mode set in the AndCondition. If a no-branch
Upon execution of the JavaScript program written by the user, a
plan is used then an AndCondition with the no-branch mode is
custom script first rewrites the program source code to use the
used to rewrite the plan into Figure 2(c), where only condition C0 is
Polyglot API. Polyglot allows different languages implemented
executed with a branching if-condition. We then invoke the Graal
with Truffle to interoperate with each other. In our implementation,
compiler backend to compile the AST into callable machine code.
we use Polyglot to access variables in JavaScript, and make the
When there are no if-conditions as in the example in Section 6.1,
following changes to the source code: (1) The for-loop itself is
then the body part is just an AST representing the assignment
transformed into a string; (2) Values of all variables that are used
934
Table 2: Time breakdown (s), 109 rows, median of 10 runs
935
data, whereas a sequence of local storms might generate regions
of data skew. A close examination of Figure 4(b) shows that at the
beginning of each skewed region, there is a small period during
which the inferior plan is being run. The system has not yet reached
the next window where it re-evaluates plans; it continues executing
the same plan until that happens.
To understand the impact of the length of the exploitation peri-
ods, we ran several experiments with different exploitation period
sizes. Figure 4(c) shows the elapsed time spent in exploration and
exploitation mode separately. When the exploitation period is too
small, we waste time running suboptimal plans too often: a subop-
timal plan that is 5X worse than optimal and run 3% of the time
will constitute a 12% overhead. When the period is too large, we
do not notice a change in the data distribution until we have been
(a) 𝑧 increasing running a suboptimal plan for a while. For this example, the best
intermediate value for the exploitation period is around 200 chunks,
and the exploration mode takes 4.3% of the time.
936
(a) Varying plans (a) Different plans on unsorted data
chunks, chunk size is 1000 tuples). The selectivity of a dataset is written in a javascript program as an if-statement with five different
changing randomly every 1K (10K, 100K, . . . ) tuples, and the best conditions (range predicates).
exploitation period also changes correspondingly. for ( i =0; i < N ; ++ i )
Figure 5(c) shows that the dynamic heuristics of Section 4.3 are if ( shipdate [ i ] >= DATE_MIN &&
able to achieve the performance of the best exploitation period shipdate [ i ] < DATE_MAX &&
(on the 1M dataset). Using one heuristic alone can avoid the worst discount [ i ] >= DISCOUNT_MIN &&
case exploitation period, in Figure 5(b), and using both heuristics discount [ i ] <= DISCOUNT_MAX &&
together we can achieve similar performance of the best exploitation quantity [ i ] < QUANTITY )
period found. sum += price [ i ]* discount [ i ];
6.3 TPC-H Queries Figure 6 shows the performance of TPC-H query Q6. Figure 6(a)
shows the results on unsorted data, as generated by the benchmark
We now show results using code that implements TPC-H queries Q6
data generator. The running times are clustered by the first condi-
and Q19 [1]. We chose those queries because they have interesting
tion used during evaluation. The two best clusters correspond to
condition structures that might benefit from our approach.
the two range predicates on the shipdate column. From left to right,
6.3.1 Query 6. Query 6 quantifies the amount of revenue increase the clusters are:
that would have resulted from eliminating certain companywide • 6 points: DateMin, DateMax as the first two conditions
discounts in a given percentage range in a given year. The query is • 18 points: DateMin first, other conditions second
937
• 6 points: DateMax, DateMin as the first two conditions Table 3: TPC-H Q19 time (us) on unsorted data
• 18 points: DateMax first, other conditions second
• remainder: neither DateMin nor DateMax first Plan (first condition) Conj1 Conj2 Conj3
Since the first condition has the most impact on performance, BRAND 2767167 2721858 2513244
the compiler uses the following ordering of predicates in the six CONTAINER 2820004 3051187 2902889
candidate plans for adaptive execution: QUANTITY MIN 2968445 3611849 2503752
(1) DateMin, DateMax, DiscountMin, DiscountMax, Quantity QUANTITY MAX 3498422 4088774 3975201
(2) DateMax, DateMin, DiscountMin, DiscountMax, Quantity SIZE 3549345 4767787 3360388
(3) DiscountMin, DateMin, DateMax, DiscountMax, Quantity SHIPMODE 3125625 2654871 2409472
(4) DiscountMax, DateMin, DateMax, DiscountMin, Quantity SHIPINSTRUCT 3826366 2665140 2425727
(5) Quantity, DateMin, DateMax, DiscountMin, DiscountMax
(6) no-branch plan (order unimportant)
Each of the first five plans has a different first condition to be For each of the seven predicates in Conj1, we could include at
evaluated. Under the adaptive execution, the compiler automatically least one plan that checks the predicate first, for a total of seven
chooses the best plan to process most of the data, no matter how plans. The same observation holds for Conj2 and Conj3. Since the
the program is written. conjunctions are quite selective, no-branch plans for evaluating the
Figure 6(b) shows the results on data sorted by shipdate, which is conjunction are excluded because they are likely to perform badly.
likely to be the typical case in real world. We show the performance A naive application of our approach would then need to generate
of each of the six plans, as well as the adaptive execution results. In 73 combined plans in order to cover all of the important cases.
this figure, we plot the average running time per chunk for every Instead, we observe that because the conjunctions are relatively
200 chunks of input data in the exploitation period. Because the selective, and combined by disjunction, all of the conjunctions are
data is sorted by the shipdate column, there is a discontinuity at likely to be executed for most rows. In other words, it is unlikely
around 20,000 input chunks, corresponding to the lower bound that a positive result from testing one of the conditions would
DATE_MIN specified in the query. Before the discontinuity, the adap- be effective at short-circuiting the evaluation to avoid the other
tive execution chooses Plan 1; during the immediately following conditions. (We also verified that the running time of all 6 orders
period, it chooses the no-branch plan, because evaluating the Date of the three conjunctions perform roughly the same.) If all three
conditions is extra work (they always succeed) and Plans 1 and 2 conjunctions are going to be executed almost all of the time anyway,
become the most expensive; after the data crosses the DATE_MAX we should optimize them independently. As a result we get 7 ∗
threshold, the system chooses Plan 2. 3 = 21 plans rather than 73 plans. For this particular query, the
Figure 6(c) shows the total running time on the sorted data for three conjunctions have the same structure, and so 7 plans (with
different plans. The compiled code automatically chooses the best three instances of each) would suffice. However, the compiler does
variant among the six candidate plans, and the total execution time not know that the conjunctions have similar structure, and so it
of the adaptive method is reduced compared with any single fixed cannot share plans in this way. Instead of one exploration period,
plan. we now use three exploration periods separately for each of the
conjunctions. In each exploration period, we select the best plan
6.3.2 Query 19. Query 19 reports the gross discounted revenue at-
for one of the conjunctions.
tributed to the sale of selected parts handled in a particular manner.
Table 3 shows the running time on unsorted data. The table
The query’s where clause is a disjunction of three conjunctions.
shows the first predicate of the plan for each of the three conjunc-
Each of the three conjunctions has the same structure of predi-
tions, when the other two conjunctions each uses BRAND as the
cates but the predicates have different parameter values. For this
first predicate. In adaptive execution, Conj1 chooses the BRAND
experiment, we preprocessed the text data so that the parameters
plan, while Conj2 and Conj3 each choose the SHIPMODE plan. As
are numetic values supported by our current implementation. We
a result the adaptive execution takes about 2.4 seconds to compute
manually implemented adaptive execution for this query because
the revenue loss no matter how the program is written (i.e., the
our current full compilation pipeline currently handles only con-
ordering of the conditions by the programmer does not matter due
junctive expressions. Written as a JavaScript program, the foreign
to adaptive execution). This performance is about twice as good
key join is executed as an index lookup into the referencing array.
as the worst plan in Table 3, which is probably not the worst plan
for ( i =0; i < N ; ++ i ) overall. This experiment shows that the advantage of our approach
if (( brand [ partkey [ i ]] == BRAND1 && includes robust performance even for complex conditions involving
container [ partkey [ i ]] == CONTAINER1 && conjunctions and disjunctions.
quantity [ i ] >= QUANTITY1 && When the data is sorted by SHIPMODE, there is a region of data
quantity [ i ] <= QUANTITY1 + 10 && where the equality predicate on SHIPMODE is always satisfied.
psize [ partkey [ i ]] <= SIZE1 && We find that each conjunction either uses the SHIPMODE plan to
shipmode [ i ] == SHIPMODE1 && quickly filter out the unqualified data, or when the SHIPMODE
shipinstruct [ i ] == SHIPINSTRUCT1 ) || // Conj1 equality predicate is satisfied, uses the BRAND plan since it is
(...) || // Conj2 the most selective and thus has the best performance. Figure 7
(...)) // Conj3 shows the runtime profiling of the BRAND plan, SHIPMODE plan,
sum += price [ i ] * (1 - discount [ i ]); and the adaptive execution. The other plans behave similarly to
938
Figure 7: Performance on Q19 (sorted) Figure 8: Performance of varying TIME_MIN
the BRAND plan, but they are slower; for clarity, their profiles
are omitted in the figure. The profile of the adaptive execution
overlaps with the BRAND plan when the SHIPMODE test is true,
demonstrating that the execution switches to a different plan when
the underlying data is changed. As a result, the adaptive plan takes
2.25 seconds to complete the computation, compared with a fixed
BRAND plan taking 2.81 seconds and a fixed SHIPMODE plan
taking 2.46 seconds.
939
FPS is up to 1.4x. Changing TIME_MAX or DATE_MAX conditions Database and programming language compilers have a common
result in similar observations. goal, namely to generate efficient machine code for queries/pro-
Since the dataset is not globally ordered by any single dimension grams written in a high-level language. Recent query compilers
of the nine conditions, during the adaptive execution of the program resemble programming-language compilers, sharing some of the
in the experiments of Figures 8 and 9, seven out of the nine candidate low-level infrastructure such as LLVM [12, 44]. The programming
plans were exploited at least once. This observation emphasizes the language community has built hot-spot compilers [46] that ini-
need for a diversity of plans to handle runtime configurations that tially interpret (and profile) code sections. When the interpreter
are difficult to predict, and the ability of our system to dynamically determines that the code section is a hot-spot, it pauses, compiles
choose an appropriate plan. the code section in real time, and executes the remainder of the
code section using the compiled code. This choice balances compi-
lation and execution time, and similar innovations have recently
7 RELATED WORK been described for database query compilation [36]. While data-
Column-oriented execution [42] and cache-conscious operators base compilers have adopted programming-language innovations
[43] were proposed before the advent of multi-core CPUs. Block- such as LLVM and hot-spot compilation, our method shows that
at-a-time execution [8] and query-dependent code generation [21, there is also an opportunity for technology transfer in the opposite
38, 44] are both state-of-the-art designs for analytical query en- direction.
gines [33]. The present work has features from both block-at-a-time Our system extends the Truffle framework [64] and the Graal
execution and query-dependent code generation. compiler [65]. Using Graal as the host compiler, Truffle is particu-
SIMD optimizations have been applied to a variety of database larly well-suited for languages with very dynamic semantics and
operators including joins [3, 4, 6, 27, 34, 58], sorting [10, 26, 50, 56], whose execution depends heavily on the size, layout and contents
scans [69] and compression [40, 52, 63]. Advanced SIMD optimiza- of the input data. Truffle offers numerous primitives for collecting
tions [49, 51] include non-linear-access operators. SIMD optimiza- information about the observed data types and program behavior.
tions work best when data is cache-resident [68], the there are Additionally, so-called assumptions allow for non-local optimiza-
trade-offs between scalar and SIMD code as we demonstrated in tions where the point that uses optimized code based on a specific
Section 3. assumption is only loosely connected to the points that potentially
Adaptive query processing aims to refine a query plan at run- invalidate this assumption. Leveraging this speculative just-in-time
time on the basis of statistics gathered at intermediate stages of the compilation based on implicit schemas that are discovered at run-
query computation [2, 14]. Multiple sub-plans could be compiled time, Truffle has also been used to develop efficient parsers for JSON
into a query, with a choice to be determined based on partial compu- and CSV data [9], and to accelerate data de-serialization [57]. The
tations such as the size of an intermediate table. Alternatively, when existing profiling and assumption mechanism in Truffle are based
a departure from the predicted behavior occurs, another round of on heuristics; they are local, behavior-centric, and strictly stabiliz-
query optimization could be performed at run-time. Early work ing (always moving towards the most generic version). This paper
on this topic instrumented query code with counters to gather extends them with a dynamic mechanism, directly observing the
statistics that inform such choices [13, 28]. More recent work us- actual performance of different but semantically equal algorithms.
ing in-memory databases uses hardware performance counters to
gather such statistics without any performance overhead [66]. 8 CONCLUSIONS
We use a limited number of query plans based on an analysis of We studied optimization techniques for data-analysis style queries
regions of parameter space. The Picasso database query optimizer expressed as tight loops in a conventional imperative program-
visualizer allows one to visually inspect optimal plan choices for ming language. Since the data distribution often strongly affects
different regions of the parameter space [15, 22]. Our choice of a query performance, it is important to make the code generation
small number of plans is analogous to how Picasso would create a and execution adaptive to the underlying data. To adapt to this
“reduced diagram” with a bounded reduction in overall performance. performance diversity, we built upon an open-source compiler to
Empirically, the authors find that ten plans is almost always suffi- generate code that efficiently processes large data sets with varying
cient to cover the parameter space with at most a 20% degradation data distributions and predicate selectivities. By using a learning
in the plan cost at any point in the space [15]. PlanBouquets [17] framework with alternative exploration and exploitation periods,
incrementally discovers actual selectivity at runtime in order to we enabled code generation using different plans and SIMD op-
identify appropriate plan to execute, and recent work [31] has im- tions. We showed that the system could tune run-time execution
proved its significant compile-time overheads. Our plans are likely parameters automatically, with minimal guidance from the pro-
to be simpler than the ones considered by Picasso, so fewer then grammer. As a result, we achieved robust query performance in
ten plans may typically be sufficient. both microbenchmark and TPC-H queries. When the underlying
To deal with arbitrary user defined functions, [12] compiles a data changes, the adaptive code generation and execution can in
high-level query workflow into a distributed program. UDFs are fact achieve better performance.
compiled with LLVM into intermediate representations and then
linked with the workflow program into binary executables. A dif- ACKNOWLEDGMENTS
ferent approach proposed recently is to compile UDFs into plain This research was supported in part by a gift to Columbia University
SQL queries [16, 24], where arbitrary control flows are translated from Oracle Corp, and by NSF grant IIS-2008295.
into recursive expressions.
940
REFERENCES [27] Saurabh Jha, Bingsheng He, Mian Lu, Xuntao Cheng, and Huynh Phung Huynh.
[1] [n.d.]. The TPC-H Benchmark. http://www.tpc.org/tpch. 2015. Improving Main Memory Hash Joins on Intel Xeon Phi Processors: An
[2] Ron Avnur and Joseph M Hellerstein. 2000. Eddies: Continuously adaptive query Experimental Approach. PVLDB 8, 6 (Feb. 2015), 642–653.
processing. In Proceedings of the 2000 ACM SIGMOD international conference on [28] Navin Kabra and David J. DeWitt. 1998. Efficient Mid-query Re-optimization of
Management of data. 261–272. Sub-optimal Query Execution Plans. In Proceedings of the 1998 ACM SIGMOD
[3] Cagri Balkesen, Gustavo Alonso, Jens Teubner, and M. Tamer Ozsu. 2013. Mul- International Conference on Management of Data (Seattle, Washington, USA)
tiCore, Main-Memory Joins: Sort vs. Hash Revisited. PVLDB 7, 1 (Sept. 2013), (SIGMOD ’98). ACM, New York, NY, USA, 106–117. https://doi.org/10.1145/
85–96. 276304.276315
[4] Cagri Balkesen, Jens Teubner, Gustavo Alonso, and M. Tamer Ozsu. 2013. Main- [29] Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alexander
memory Hash Joins on Multi-core CPUs: Tuning to the Underlying Hardware. Rasin, Stanley Zdonik, Evan P. C. Jones, Samuel Madden, Michael Stonebraker,
In ICDE. 362–373. Yang Zhang, John Hugg, and Daniel J. Abadi. 2008. H-Store: a High-Performance,
[5] Ronald Barber, Peter Bendel, Marco Czech, Oliver Draese, Frederick Ho, Namik Distributed Main Memory Transaction Processing System. Proc. VLDB Endow. 1,
Hrle, Stratos Idreos, Min-Soo Kim, Oliver Koeth, Jae-Gil Lee, Tianchao Tim 2 (2008), 1496–1499. https://doi.org/10.1145/1454159.1454211
Li, Guy M. Lohman, Konstantinos Morfonios, René Müller, Keshava Murthy, [30] Michael Kalloniatis and Charles Luu. [n.d.]. Temporal Resolu-
Ippokratis Pandis, Lin Qiao, Vijayshankar Raman, Richard Sidle, Knut Stolze, and tion. https://webvision.med.utah.edu/book/part-viii-psychophysics-of-
Sandor Szabo. 2012. Business Analytics in (a) Blink. IEEE Data Eng. Bull. 35, 1 vision/temporal-resolution/.
(2012), 9–14. [31] Srinivas Karthik, Jayant R Haritsa, Sreyash Kenkre, and Vinayaka Pandit. 2018. A
[6] Spyros Blanas, Yinan Li, and Jignesh Patel. 2011. Design and Evaluation of Main concave path to low-overhead robust query processing. Proceedings of the VLDB
Memory Hash Join Algorithms for Multi-core CPUs. In SIGMOD. 37–48. Endowment 11, 13 (2018), 2183–2195.
[7] Peter A. Boncz, Martin L. Kersten, and Stefan Manegold. 2008. Breaking the [32] Alfons Kemper and Thomas Neumann. 2011. HyPer: A Hybrid OLTP&OLAP Main
Memory Wall in MonetDB. Commun. ACM 51, 12 (Dec. 2008), 77–85. https: Memory Database System Based on Virtual Memory Snapshots. In Proceedings
//doi.org/10.1145/1409360.1409380 of the 2011 IEEE 27th International Conference on Data Engineering (ICDE ’11).
[8] Peter A. Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: Hyper- IEEE Computer Society, Washington, DC, USA, 195–206. https://doi.org/10.1109/
pipelining query execution. In CIDR. ICDE.2011.5767867
[9] Daniele Bonetta and Matthias Brantner. 2017. FAD.js: fast JSON data access using [33] Timo Kersten, Viktor Leis, Alfons Kemper, Thomas Neumann, Andrew Pavlo,
JIT-based speculative optimizations. Proceedings of the VLDB Endowment 10, 12 and Peter Boncz. 2018. Everything You Always Wanted to Know About Compiled
(2017), 1778–1789. and Vectorized Queries but Were Afraid to Ask. Proc. VLDB Endow. 11, 13 (Sept.
[10] Jatin Chhugani et al. 2008. Efficient implementation of sorting on multi-core 2018), 2209–2222. https://doi.org/10.14778/3275366.3275370
SIMD CPU architecture. In VLDB. 1313–1324. [34] Changkyu Kim et al. 2009. Sort vs. Hash revisited: fast join implementation on
[11] Confluent Inc. 2019. Streaming SQL for Apache Kafka. https://www.confluent. modern multi-core CPUs. PVLDB 2, 2 (Aug. 2009), 1378–1389.
io/product/ksql. [35] Yannis Klonatos, Christoph Koch, Tiark Rompf, and Hassan Chafi. 2014. Building
[12] Andrew Crotty, Alex Galakatos, Kayhan Dursun, Tim Kraska, Carsten Binnig, Efficient Query Engines in a High-level Language. Proc. VLDB Endow. 7, 10 (June
Ugur Çetintemel, and Stan Zdonik. 2015. An Architecture for Compiling UDF- 2014), 853–864. https://doi.org/10.14778/2732951.2732959
centric Workflows. PVLDB 8, 12 (2015), 1466–1477. [36] André Kohn, Viktor Leis, and Thomas Neumann. 2018. Adaptive Execution of
[13] Amol Deshpande and Joseph M. Hellerstein. 2004. Lifting the Burden of History Compiled Queries. In 34th IEEE International Conference on Data Engineering,
from Adaptive Query Processing. In Proceedings of the Thirtieth International ICDE 2018, Paris, France, April 16-19, 2018. 197–208.
Conference on Very Large Data Bases - Volume 30 (Toronto, Canada) (VLDB ’04). [37] Konstantinos Krikellas, Stratis Viglas, and Marcelo Cintra. 2010. Generating code
VLDB Endowment, 948–959. http://dl.acm.org/citation.cfm?id=1316689.1316771 for holistic query evaluation. In Proceedings of the 26th International Conference
[14] Amol Deshpande, Zachary Ives, and Vijayshankar Raman. 2007. Adaptive query on Data Engineering, ICDE 2010, March 1-6, 2010, Long Beach, California, USA.
processing. Now Publishers Inc. 613–624. https://doi.org/10.1109/ICDE.2010.5447892
[15] Harish Doraiswamy, Pooja N. Darera, and Jayant R. Haritsa. 2007. On the Produc- [38] Konstantinos Krikellas, Stratis Viglas, and Marcelo Cintra. 2010. Generating code
tion of Anorexic Plan Diagrams. In Proceedings of the 33rd International Conference for holistic query evaluation. In ICDE. 613–624.
on Very Large Data Bases, University of Vienna, Austria, September 23-27, 2007. [39] Tirthankar Lahiri, Marie-Anne Neimat, and Steve Folkman. 2013. Oracle
1081–1092. http://www.vldb.org/conf/2007/papers/research/p1081-d.pdf TimesTen: An In-Memory Database for Enterprise Applications. IEEE Data
[16] Christian Duta, Denis Hirn, and Torsten Grust. 2019. Compiling PL/SQL Away. Eng. Bull. 36, 2 (2013), 6–13.
arXiv preprint arXiv:1909.03291 (2019). [40] Harald Lang, Tobias Mühlbauer, Florian Funke, Peter A. Boncz, Thomas Neumann,
[17] Anshuman Dutt and Jayant R Haritsa. 2014. Plan bouquets: query processing and Alfons Kemper. 2016. Data Blocks: Hybrid OLTP and OLAP on Compressed
without selectivity estimation. In Proceedings of the 2014 ACM SIGMOD interna- Storage using both Vectorization and Compilation. In Proceedings of the 2016
tional conference on Management of data. 1039–1050. International Conference on Management of Data, SIGMOD Conference 2016, San
[18] Franz Faerber, Alfons Kemper, Per-Åke Larson, Justin J. Levandoski, Thomas Francisco, CA, USA, June 26 - July 01, 2016. 311–326. https://doi.org/10.1145/
Neumann, and Andrew Pavlo. 2017. Main Memory Database Systems. Foundations 2882903.2882925
and Trends in Databases 8, 1-2 (2017), 1–130. https://doi.org/10.1561/1900000058 [41] Per-Åke Larson, Mike Zwilling, and Kevin Farlee. 2013. The Hekaton Memory-
[19] Franz Färber, Sang Kyun Cha, Jürgen Primsch, Christof Bornhövd, Stefan Sigg, Optimized OLTP Engine. IEEE Data Eng. Bull. 36, 2 (2013), 34–40. http://sites.
and Wolfgang Lehner. 2012. SAP HANA Database: Data Management for Modern computer.org/debull/A13june/Hekaton1.pdf
Business Applications. SIGMOD Rec. 40, 4 (Jan. 2012), 45–51. https://doi.org/10. [42] Stefan Manegold, Peter Boncz, and Martin Kersten. 2000. Optimizing database
1145/2094114.2094126 architecture for the new bottleneck: memory access. J. VLDB 9, 3 (2000), 231–246.
[20] Martin Grund, Jens Krüger, Hasso Plattner, Alexander Zeier, Philippe Cudre- [43] Stefan Manegold, Peter Boncz, and Martin Kersten. 2002. Optimizing Main-
Mauroux, and Samuel Madden. 2010. HYRISE: A Main Memory Hybrid Storage Memory Join on Modern Hardware. TKDE 14, 4 (July 2002), 709–730.
Engine. Proc. VLDB Endow. 4, 2 (Nov. 2010), 105–116. http://dl.acm.org/citation. [44] Thomas Neumann. 2011. Efficiently Compiling Efficient Query Plans for Modern
cfm?id=1921071.1921077 Hardware. PVLDB 4, 9 (June 2011), 539–550.
[21] Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Ste- [45] Oracle Corp. 2019. GraalVM. https://www.graalvm.org/.
fano Stefani, and Vidhya Srinivasan. 2015. Amazon Redshift and the Case for [46] Michael Paleczny, Christopher Vick, and Cliff Click. 2001. The Java hotspotTM
Simpler Data Warehouses. In Proceedings of the 2015 ACM SIGMOD International Server Compiler. In Proceedings of the 2001 Symposium on JavaTM Virtual Machine
Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June Research and Technology Symposium - Volume 1 (Monterey, California) (JVM’01).
4, 2015. 1917–1923. https://doi.org/10.1145/2723372.2742795 USENIX Association, Berkeley, CA, USA, 1–1. http://dl.acm.org/citation.cfm?
[22] Jayant R. Haritsa. 2010. The Picasso Database Query Optimizer Visualizer. Proc. id=1267847.1267848
VLDB Endow. 3, 1-2 (Sept. 2010), 1517–1520. https://doi.org/10.14778/1920841. [47] Jignesh M. Patel, Harshad Deshmukh, Jianqiao Zhu, Navneet Potti, Zuyu Zhang,
1921027 Marc Spehlmann, Hakan Memisoglu, and Saket Saurabh. 2018. Quickstep: A
[23] Joseph M. Hellerstein. 1998. Optimization techniques for queries with expensive Data Platform Based on the Scaling-Up Approach. PVLDB 11, 6 (2018), 663–676.
methods. ACM Transactions on Database Systems 23, 2 (June 1998), 113–157. [48] Andrew Pavlo, Gustavo Angulo, Joy Arulraj, Haibin Lin, Jiexi Lin, Lin Ma,
[24] Denis Hirn and Torsten Grust. 2020. PL/SQL Without the PL. In Proceedings of the Prashanth Menon, Todd Mowry, Matthew Perron, Ian Quah, Siddharth San-
2020 ACM SIGMOD International Conference on Management of Data. 2677–2680. turkar, Anthony Tomasic, Skye Toor, Dana Van Aken, Ziqi Wang, Yingjun
[25] InfluxData Inc. 2019. Time series database (TSDB) explained. https://www. Wu, Ran Xian, and Tieying Zhang. 2017. Self-Driving Database Manage-
influxdata.com/time-series-database. ment Systems. In CIDR 2017, Conference on Innovative Data Systems Research.
[26] Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu, and Toshio Nakatani. 2007. http://db.cs.cmu.edu/papers/2017/p42-pavlo-cidr17.pdf
AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors. In [49] Orestis Polychroniou, Arun Raghavan, and Kenneth A. Ross. 2015. Rethinking
PACT. 189–198. SIMD Vectorization for In-Memory Databases. In SIGMOD. 1493–1508.
941
[50] Orestis Polychroniou and Kenneth A. Ross. 2014. A Comprehensive Study of [60] Michael Stonebraker, Paul Brown, Alex Poliakov, and Suchi Raman. 2011. The Ar-
Main-Memory Partitioning and Its Application to Large-scale Comparison- and chitecture of SciDB. In Proceedings of the 23rd International Conference on Scientific
Radix-sort. In SIGMOD. 755–766. and Statistical Database Management (Portland, OR) (SSDBM’11). Springer-Verlag,
[51] Orestis Polychroniou and Kenneth A. Ross. 2014. Vectorized Bloom Filters for Berlin, Heidelberg, 1–16. http://dl.acm.org/citation.cfm?id=2032397.2032399
Advanced SIMD Processors. In DaMoN. Article 6. [61] Tableau Inc. 2019. Tableau. https://www.tableau.com.
[52] Orestis Polychroniou and Kenneth A. Ross. 2015. Efficient Lightweight Compres- [62] Sandeep Tata. 2007. Declarative Querying for Biological Sequences. Ph.D. Disser-
sion Alongside Fast Scans. In DaMoN. Article 9. tation. Ann Arbor, MI, USA. Advisor(s) Patel, Jignesh M. AAI3276308.
[53] Vijayshankar Raman, Gopi Attaluri, Ronald Barber, Naresh Chainani, David [63] Thomas Willhalm et al. 2009. SIMD-scan: ultra fast in-memory table scan using
Kalmuk, Vincent KulandaiSamy, Jens Leenstra, Sam Lightstone, Shaorong Liu, on-chip vector processing units. PVLDB 2, 1 (Aug. 2009), 385–394.
Guy M. Lohman, Tim Malkemus, Rene Mueller, Ippokratis Pandis, Berni Schiefer, [64] Christian Wimmer and Thomas Würthinger. 2012. Truffle: a self-optimizing run-
David Sharpe, Richard Sidle, Adam Storm, and Liping Zhang. 2013. DB2 with time system. In Proceedings of the 3rd annual conference on Systems, programming,
BLU Acceleration: So Much More Than Just a Column Store. Proc. VLDB Endow. and applications: software for humanity. 13–14.
6, 11 (Aug. 2013), 1080–1091. https://doi.org/10.14778/2536222.2536233 [65] Thomas Würthinger, Christian Wimmer, Andreas Wöß, Lukas Stadler, Gilles
[54] Kenneth A. Ross. 2004. Selection Conditions in Main Memory. ACM Transactions Duboscq, Christian Humer, Gregor Richards, Doug Simon, and Mario Wolczko.
on Database Systems 29, 1 (2004), 132–161. 2013. One VM to rule them all. In Proceedings of the 2013 ACM international
[55] Bogdan Răducanu, Peter Boncz, and Marcin Zukowski. 2013. Micro Adaptivity symposium on New ideas, new paradigms, and reflections on programming &
in Vectorwise. In Proceedings of the 2013 ACM SIGMOD International Conference software. 187–204.
on Management of Data (New York, New York, USA) (SIGMOD ’13). ACM, New [66] Steffen Zeuch, Holger Pirk, and Johann-Christoph Freytag. 2016. Non-invasive
York, NY, USA, 1231–1242. https://doi.org/10.1145/2463676.2465292 Progressive Optimization for In-memory Databases. Proc. VLDB Endow. 9, 14
[56] Nadathur Satish et al. 2010. Fast sort on CPUs and GPUs: a case for bandwidth (Oct. 2016), 1659–1670. https://doi.org/10.14778/3007328.3007332
oblivious SIMD sort. In SIGMOD. 351–362. [67] Wangda Zhang and Kenneth A Ross. 2020. Exploiting data skew for improved
[57] Filippo Schiavio, Daniele Bonetta, and Walter Binder. 2020. Dynamic speculative query performance. IEEE Transactions on Knowledge and Data Engineering (2020).
optimizations for SQL compilation in Apache Spark. Proceedings of the VLDB [68] Wangda Zhang and Kenneth A Ross. 2020. Permutation Index: Exploiting Data
Endowment 13, 5 (2020), 754–767. Skew for Improved Query Performance. In 2020 IEEE 36th International Conference
[58] Stefan Schuh, Xiao Chen, and Jens Dittrich. 2016. An Experimental Comparison on Data Engineering (ICDE). IEEE, 1982–1985.
of Thirteen Relational Equi-Joins in Main Memory. In SIGMOD. 1961–1976. [69] Jingren Zhou and Kenneth A. Ross. 2002. Implementing Database Operations
[59] Doug Simon. [n.d.]. libgraal: GraalVM compiler as a precompiled GraalVM Using SIMD Instructions. In Proceedings of SIGMOD Conference.
native image. https://medium.com/graalvm/libgraal-graalvm-compiler-as-a- [70] M. Zukowski, M. van de Wiel, and P. Boncz. 2012. Vectorwise: A Vectorized Ana-
precompiled-graalvm-native-image-26e354bee5c. lytical DBMS. In Data Engineering (ICDE), 2012 IEEE 28th International Conference
on. 1349–1350. https://doi.org/10.1109/ICDE.2012.148
942