0% found this document useful (0 votes)

19 views14 pages

2021 - Adaptive Code Generation for Data-Intensive Analytics

Uploaded by

胡仲义

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views14 pages

2021 - Adaptive Code Generation for Data-Intensive Analytics

Uploaded by

胡仲义

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Adaptive Code Generation for Data-Intensive Analytics

Wangda Zhang Junyoung Kim Kenneth A. Ross

Columbia University Columbia University Columbia University
[email protected] [email protected] [email protected]

Eric Sedlar Lukas Stadler

Oracle Labs Oracle Labs
[email protected] [email protected]
ABSTRACT processing systems [62] and array processing systems [60]. Many
Modern database management systems employ sophisticated query of these applications require DBMS-like functionality, such as scan-
optimization techniques that enable the generation of efficient plans ning, filtering, cross-referencing (joining), and aggregating data.
for queries over very large data sets. A variety of other applications Most of these applications do not use a DBMS as the underlying data
also process large data sets, but cannot leverage database-style storage framework. This choice could be for performance reasons,
query optimization for their code. We therefore identify an oppor- or because relational tables are not the most natural abstraction for
tunity to enhance an open-source programming language compiler the data being modeled by the application. Nevertheless, the low-
with database-style query optimization. Our system dynamically level operations (scanning, aggregating etc.) can still potentially
generates execution plans at query time, and runs those plans on benefit from the kinds of optimizations done in state-of-the-art
chunks of data at a time. Based on feedback from earlier chunks, DBMSs.
alternative plans might be used for later chunks. The compiler Additional applications may be written by programmers who
extension could be used for a variety of data-intensive applica- do not want the overhead of dealing with an external application.
tions, allowing all of them to benefit from this class of performance Instead they simply write direct code to store and process arrays of
optimizations. data. For example, a weather analysis application may store data
about rainfall measurements. Suppose that the application records
PVLDB Reference Format: the output from a large collection of field sensors that each report
Wangda Zhang, Junyoung Kim, Kenneth A. Ross, Eric Sedlar, and Lukas rain accumulations each minute, but only when the measurement
Stadler. Adaptive Code Generation for Data-Intensive Analytics. PVLDB, is nonzero. The data is represented using three arrays: ID[i] that
14(6): 929-942, 2021. represents the identifier of the sensor making the measurement,
doi:10.14778/3447689.3447697 time[i] that represents the time the measurement was taken, and
rain[i] that represents the actual rain measurement. The time[i]
1 INTRODUCTION values are nondecreasing, reflecting incremental appends of new
The increasing main-memory capacity of contemporary hardware measurements over time. When there are many sensors spread
allows query execution in a database management system (DBMS) over a large geographic region, there may be billions of data values
to occur entirely in RAM. Analytical query workloads that are stored per day. While there are alternative (e.g., partitioned/sharded)
typically read-only need no disk access after the initial load. In representations of this data, the given representation is actually
response to this trend, several commercial and research DMBSs well-suited if the common query pattern is something like “where
have been designed (or re-designed) for memory-resident data [18]. and how much has it rained in a given time interval?” Such queries
Examples of recent systems include H-Store/VoltDB [29], Hekaton could drive the generation of real-time animations of recent (or
[41], HyPer [32], IBM BLINK [5], DB2 BLU [53], SAP HANA [19], historical) rainfall. A query coded in the application might have an
Vectorwise [70], Oracle TimesTen [39], MonetDB [7], HYRISE [20], inner loop that looks something like:
Peloton [48], HIQUE [37], LegoBase [35] and Quickstep [47]. A for(i=0;i<total;i++)
variety of advanced query processing and optimization techniques if (time[i]>start && interesting([ID[i]]))
have been developed in these and other systems, several of which combine(accum[],ID[i],rain[i],time[i]);
we will discuss in detail later in this paper.
Other data-intensive applications have also scaled to the point In this code fragment, interesting is a dynamically defined
where they are processing very large RAM-resident data collections. user-defined function that indicates which sensors the user is in-
Examples include data visualization systems [61], stream processing terested in. The user-defined combine function describes how the
systems [11], time-series analysis systems [25] biological sequence rainfall values should be grouped and accumulated/aggregated into
the accum array. This query is asking for the aggregate rainfall for
This work is licensed under the Creative Commons BY-NC-ND 4.0 International sensors that are interesting, over a time window between start
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of
this license. For any use beyond those covered by this license, obtain permission by
and the current time. (These aggregates would be normalized at the
emailing [email protected]. Copyright is held by the owner/author(s). Publication rights end according to the interesting sensor count in each grouping
licensed to the VLDB Endowment. region.)
Proceedings of the VLDB Endowment, Vol. 14, No. 6 ISSN 2150-8097.
doi:10.14778/3447689.3447697 Despite the relative compactness of this code fragment, there
are several performance opportunities (and pitfalls) that might be

929
taken advantage of (or avoided) by the compiler that generates • Our system can invoke SIMD optimizations for code, even
machine code from this loop. Performance diversity can be caused though they do not always improve performance. In our
by variations in the data distribution and/or physical data ordering, system, the SIMD version will be used if it is better, and the
influencing branch-related and cache-related stalls. Different algo- scalar version will be used otherwise.
rithmic choices (reordering of operations, use of SIMD instructions) • The system can select a small but representative set of plans
can improve or worsen performance. Details of these opportunities that cover the search space well enough to respond to various
and pitfalls will be discussed at more length later in Section 3. parameter combinations that may not have been known at
Our work builds on the dynamic query execution scheme pio- query compilation time.
neered by Vectorwise [55] (discussed in Section 2). Multiple plans • It is possible to dynamically achieve a balance between ex-
are precompiled for a particular operation. As the operation pro- ploration (trying out a variety of plans) and exploitation
gresses over a very large data set, performance information from (maximally employing the best plan).
the early stages of execution can be used to guide the choice of plan
for later stages. Plan switching allows for robustness in the face of 2 BACKGROUND
errors in query cost estimation, and also allows a dynamic change
2.1 Prior Work on Compiling Query Plans
of plans if the data distribution changes within the dataset. Details
of our adaptive code generation are given in Section 4. Our work builds upon the dynamic query execution scheme de-
Existing query optimization techniques for in-memory process- veloped as part of the Vectorwise system [55]. The Vectorwise
ing are limited in several ways: (a) they are not extensively used implementers observed that query performance could vary signifi-
outside relational database management systems; (b) they are lim- cantly due to low-level performance effects. Different query plans
ited to a handful of relational operators, and do not cover access pat- might perform best under different regions of parameter space, yet
terns or dynamically-defined functions found in other data-analysis the parameter values may not be known at compile time. Different
scenarios; (c) they treat the underlying compiler as a black-box, compilers for the same programming language might give better
with unpredictable performance depending on which compiler is or worse results, depending on the query. Data distribution effects
used with which compiler settings; (d) they often bake-in design (that may change as the system progresses through the data) may
choices that may be appropriate for usage within a particular DBMS, affect query performance, so that one plan is best for parts of the
but not for more general cases. We address these challenges by op- data, while another plan is best for other parts.
timizing data-analysis style queries expressed as tight loops in a The Vectorwise team also observed that it is hard to estimate
conventional imperative programming language. the cost function, not just because of the data distribution effects
We extend an open-source compiler (GraalVM compiler [65] and and parameter estimation inaccuracies mentioned above. Different
Truffle [64]) with both known and novel optimization techniques run-time platforms may have different performance characteristics,
that can automatically be applied whenever the compiler identifies such as the relative cost of a SIMD instruction to a scalar instruc-
that a loop is time-consuming. GraalVM is an ecosystem and shared tion or the relative impact of a branch misprediction. Further, the
runtime offering performance advantages for a variety of program- overlapping of various latencies (e.g., cache misses) makes it hard
ming languages [45]. Interpreted code is automatically transformed to identify their true impact on elapsed time. Rather than estimate
into compiled code when the system detects a performance hot-spot. the cost, Vectorwise chose to measure the actual cost.
The GraalVM Compiler is a dynamic just-in-time (JIT) compiler In Vectorwise, data chunks of about 1000 rows are processed as
that performs sophisticated code analysis and optimization. The a unit. A key innovation in Vectorwise is the analysis of the actual
Truffle API allows programming languages to be combined in a running time over recent chunks of data using several different
shared runtime using an abstract syntax tree representation. Inter- candidate query plans in turn [55]. Each plan contributes to the
preted code is associated with nodes in the abstract syntax tree, and final result, but might take more or less time depending on data and
the Graal compiler automatically compiles the performance-critical machine parameters. The plan that takes the least time is scheduled
parts of the code to speed up execution. Details of the Graal/Truffle to run for an extended number of chunks. After that, all candidate
implementation are provided in Section 5. plans are run again within a certain window to see if the data has
Integration into the compiler enables many applications to ef- changed to the point that a different plan is best. The best plan is
ficiently process large data sets. The system supports dynamic then scheduled for an extended period, and the process repeats.
queries involving user-defined functions and arbitrary access pat- To summarize, the advantages of the approach pioneered by
terns. Database-style and compiler optimizations co-exist, eliminat- Vectorwise are: (a) optimization happens on the basis of actual time
ing some of the mismatches that happen when the compiler is used rather than predicted time, reducing the reliance on complex and
as a black-box by a DBMS. The system tunes a variety of run-time potentially inaccurate cost modeling; (b) most of the execution will
execution parameters automatically, with minimal guidance from use the best plan among the candidates; (c) over time, as the data
the programmer. changes, the chosen plan can adapt to those changes. Despite these
We evaluate our system using the TPC-H benchmark, weather advantages, the Vectorwise approach has several limitations that
visualization, and microbenchmark queries, over datasets with vari- we will discuss next.
ous kinds of ordering/clustering properties. The experimental eval-
uation (Section 6) shows that: 2.2 Limitations of the Vectorwise Approach
• Our system can dynamically respond to changes in the data The first and most obvious limitation of the Vectorwise approach is
distribution, choosing the best plan for the current data. that the implementation effort has no wider impact beyond uses of

930
the Vectorwise system itself. It might be possible for a competing it in interpreted mode. We describe some of the performance-related
DBMS to mimic the implementation described by Vectorwise, but choices that need to be made below.
applications of the techniques beyond in-memory relational DBMSs Condition Ordering and Non-Branching Plans. Selection
are unclear. In contrast, our approach embeds the optimization/exe- condition order is important for in-memory query processing [54].
cution decision making at the programming language level, making Branch misprediction effects contribute significantly to query pro-
the techniques broadly applicable to a wide variety of applications. cessing costs. Among the plans considered are plans that avoid
A second limitation is that the Vectorwise approach uses a few branches altogether by converting control dependencies to data
hand-crafted code fragments that cover only the essential DBMS op- dependencies. For example, the plan above might be rewritten as
erators. These code fragments are precompiled at DBMS build time. follows to avoid branches:
Code fragments with in-lined user-defined code are not considered.
Access patterns in which there is interaction between consecutive for(i=0;i<total;i++) {
rows are common in applications such as time-series analysis, but // & rather than &&; no branches
are essentially absent in a relational DBMS. We compile code frag- test = (time[i]>start & lat[ID[i]] > 30);
ments at query-time, allowing user-defined code and arbitrary access // -1 = 0xFFFFFFFF; -0 = 0
patterns that might not match a handful of predefined templates. mask = -test;
The paper describing the Vectorwise system describes how they // 0 mask means add 0, i.e., no-op
used several different compilers, with different optimization set- accum[zip[ID[i]]] += (mask & rain[i]);
tings, and observed varying performance results. The results were }
so unpredictable that they were forced to compile multiple variants
of each code fragment: two compilers and two optimization settings While branch-free code eliminates the branch misprediction
would require four compiled code variants to cover all of the cases. overhead, it is not always the best choice. For example, if a condi-
The Vectorwise authors remarked that they resisted the temptation tion is very selective, so that it fails most of the time, then executing
to investigate why the compilers had such different behaviors [55], the condition early is good because (a) it avoids unnecessary work
presumably because they had no control to effect a change even of for most tuples, and (b) conditions that fail most of the time are
they could identify an inefficiency. In our method, the compiler is not relatively well-predicted by modern processors. When several con-
an external black box. Instead, because DB-style optimizations and ditions are present, the best ordering of those conditions depends
traditional compiler optimizations happen in the same framework, both on the selectivity of the condition and the cost of testing the
we can control code generation. If the compiler is unsure whether an condition [23, 54]. These kinds of alternative rewritings are used
optimization helps or not, two variants of the code fragment could be in the hand-generated templates of the Vectorwise system [55].
generated internally, by the compiler itself. Our system automatically generates candidate plans at query-time
The Vectorwise system chooses somewhat arbitrary values for using each kind of rewriting (details in later sections).
parameters such as the window size to run the current best plan, Cache Misses. Accesses to the arrays time, ID and rain are
and the window size within which other candidate plans are run. sequential, and prefetching is likely to be effective in minimizing
While these settings may have been adequate for the limited set of cache latency for those accesses. lat and zip are accessed non-
operators considered by Vectorwise, it is not clear that such choices sequentially, and may generate cache misses whose latency may
would be optimal under the broader contexts considered in this paper. be significant (tens of cycles for an L2 miss, about 100 cycles for
We investigate principled ways for setting such parameters, allowing an L3 miss). These costs also influence the ordering of selections,
them to vary based on performance feedback generated so far. Section 6 since a cache miss might make a condition like lat[ID[i]] > 30
shows an experiment where the choice of window size matters. expensive to test. Whether lat[ID[i]] > 30 generates a cache
miss on lat depends on: (a) How many IDs there are in total and
how compactly they are allocated in the lat array (e.g., are sensor
3 PERFORMANCE DIVERSITY AND IDs re-used when a sensor is taken out of service?); (b) How many
REWRITING OPTIONS IDs are likely to be registering rain at the same time (depends on
Let us return to the loop introduced in Section 1, in which we first sensor placement and weather patterns); (c) How likely it is that a
specialize and in-line the definition of interesting, and combine. sensor that registers rain at time 𝑡 also registers rain at time 𝑡 + 1
The user has specified that an ID is interesting if its latitude is (affects temporal locality, and depends on weather patterns). Given
greater than 30, and has stated that the way to combine rainfall the complexity of predicting cache behavior, we circumvent the
readings is to sum the rain amounts grouped by zipcode. zip[id] problem by considering a very limited number of scenarios. For
represents the zipcode where the sensor having identifier id is example, we might just consider two extreme scenarios, one in
located, and lat[id] and long[id] represent the latitude and lon- which we expect an L3 cache miss and one in which we expect an
gitude of the sensor. L1 cache hit.
In database terminology this query applies two selection condi- SIMD. SIMD instructions can be applied to both the conditions
tions, performs two foreign key joins to the lat and zip “tables”, and actions of the code above. Let 𝑤 be the number of SIMD lanes.
and performs a grouped SUM aggregate of the rainfall. We assume The condition lat[ID[i]] > 30 might be evaluated on 𝑤 consec-
that total is very large, so that optimizing the loop is likely to have utive i values by (a) loading a SIMD register with 𝑤 consecutive ID
a big performance impact. The hot-spot compiler will be triggered values; (b) using a SIMD gather instruction to look up 𝑤 different
relatively quickly to compile the code rather than continuing to run addresses within the lat array; and (c) comparing the results with

931
(a) Varying skew (b) Varying selectivity

Figure 1: Performance diversity

a SIMD register pre-loaded with 𝑤 copies of the value 30. The re- This choice allows for more general plans, including in-lined user-
sulting booleans can then be ANDed with other boolean conditions, defined functions that are not known in advance. Nevertheless, this
or used as a mask for other actions. choice is challenging because it makes query compilation itself
The update of the accum array can similarly use SIMD gather part of the observable response time. Our preliminary observations
operations to load the current running sums, SIMD add instructions using the Graal compiler (Section 5) suggest that a plan can be
to perform the updates, and SIMD scatter operations to write out compiled in tens of milliseconds. Thus, if we were performing a
the results. Special SIMD instructions detect conflicts (e.g., updates large scan taking several seconds, say, we could probably not af-
to a common memory address) across SIMD lanes and serialize ford to compile more than 10 plans. Beyond that, the overhead of
them in the same sequence as the input. compilation may outweigh the benefits of adaptive query process-
SIMD processing has the potential to speed up processing if the ing/optimization.
workload is not memory-bound by using fewer instructions to do
the same work. It is not always clear that SIMD optimization is 4.1 On-Line Analysis
desirable because (a) similarly to no-branch plans, it does the entire First, we optimize abstractions of the loop components. For example,
work even if the first condition would have led to a quick rejection; the cost estimate for a SIMD computation may depend on the
(b) under conditions of skew, the conflict resolution step of the skew in the group-by values (Figure 1(a)). We may simply optimize
SIMD scatters may dominate the cost, making the SIMD option under two abstracted conditions: no-skew and high-skew. As a
slower than the scalar option. Rather than trying to estimate skew second example, the cost estimate for a condition-testing plan may
and determine whether the exact cost of the SIMD option is optimal depend on the selectivity (Figure 1(b)) and cache-behavior of the
for the current data, we simply generate SIMD plans as additional data. Rather than estimating a selectivity for a condition, we impose
candidates to be considered at run-time. a selectivity on that condition as a way of making sure we cover an
Performance Diversity. Figure 1 illustrates two cases of per- appropriate subregion of the optimization space. A condition may
formance diversity alluded to in the previous discussion. Figure 1(a) be given selectivities that are “small,” “medium,” or “large” (say 0.05,
shows the performance of a grouped aggregation, where the group- 0.5, 0.95 respectively).
ing column may be skewed according to a Zipf factor shown on the
x-axis. The SIMD code is faster than scalar code under low skew, 4.2 Off-Line Analysis
but slower under high skew due to the high cost of conflict reso-
There is an implicit bias in our on-line analysis, because our rela-
lution as described above [67]. Scalar code is fastest at high skew
tively coarse abstractions of parameters may be far from either (a)
because the grouping cardinality is small and so the aggregates fit
the true parameters, or (b) the critical values of the parameters for
in the L1 cache. Figure 1(b) shows three plans for a query having
which the choice of plans would change. We therefore supplement
two selection conditions, using plans of the kind described in [54].
our on-line analysis with an off-line analysis for common query pat-
Each of the three plans is best in some selectivity range. Because
terns. For example, we imagine that loops containing if-statements
the selectivity may not be known in advance, or may vary within
that test any number of conditions may be common in practice. We
the dataset, our approach will be to include multiple plans and to
therefore perform a more detailed off-line analysis of 𝑐-condition
choose the best plan according to the recent performance history.
loops for all 𝑐 below some moderately large threshold (at least 10).
Although this off-line analysis is expensive, it would happen once
4 ADAPTIVE CODE GENERATION for a target hardware environment before the compiler is released,
So far we have suggested that we will be generating multiple plans, or during a calibration step when the compiler is installed. After the
running each for chunks of data during a testing phase, and then off-line analysis, the system stores the generated candidate plans as
selecting the fastest plan to run for an extended period. Unlike the a summary to use for adaptive code generation online (Section 5).
Vectorwise system, where an arbitrary number of plans might be For each 𝑐, we use a more fine-grained approach to compute a
precompiled in advance, we aim to generate plans at query time. cost estimate of candidate plans for 𝑐 conditions based on the cost

932
Table 1: Candidate plans Table 1 shows how this algorithm performs for 3 conditions
(𝑐 = 3) and up to 5 plans (1 ≤ 𝑘 ≤ 5). For comparison, we also show
# plans (exhaustive) ratio plans (local) ratio the results of an exhaustive search. In general, the best set of 𝑘 − 1
1 { C0 & C1 & C2 } 9.77 IF ( C0 ) { C1 & C2 } 12.64 plans may not be a subset of the best set of 𝑘 candidate plans, but
IF ( C0 ) { C1 & C2 } IF ( C0 ) { C1 & C2 } our heuristic algorithm does choose 𝑘 − 1 plans from among the
2 5.40 12.07
IF ( C1 & C2 ) { C0 } IF ( C1 ) { C0 & C2 } best 𝑘 plans. We observe that the heuristic performs reasonably
IF ( C0 ) { C1 & C2 } IF ( C0 ) { C1 & C2 } well when 𝑘 ≥ 3, which is likely in our application domain.
3 IF ( C1 ) { C0 & C2 } 3.25 IF ( C1 ) { C0 & C2 } 3.25 For small 𝑘, an exhaustive search will be feasible, and it does not
IF ( C2 ) { C0 & C1 } IF ( C2 ) { C0 & C1 } miss the best plans that the above heuristic could prune. Therefore
{ C0 & C1 & C2 } { C0 & C1 & C2 } we use a hybrid approach: Generate the best 10 plans using the
IF ( C0 ) { C1 & C2 } IF ( C0 ) { C1 & C2 } heuristic, and then search exhaustively among them for the best
4 1.97 1.97
IF ( C1 ) { C0 & C2 } IF ( C1 ) { C0 & C2 } pair of plans. This aproach is more accurate for small number of
IF ( C2 ) { C0 & C1 } IF ( C2 ) { C0 & C1 } candidate plans.
{ C0 & C1 & C2 } { C0 & C1 & C2 } We used the maximum performance ratio as our heuristic func-
IF ( C0 ) { C1 & C2 } IF ( C0 ) { C1 & C2 } tion, but we could alternatively have used the average performance
5 IF ( C1 && C0 ) { C2 } 1.79 IF ( C0 & C1 ) { C2 } 1.97 ratio. We argue that the average can be biased depending on how
IF ( C1 & C2 ) { C0 } IF ( C1 ) { C0 & C2 } the selectivity and cost values are chosen. For example, they would
IF ( C2 && C0 ) { C1 } IF ( C2 ) { C0 & C1 } give extra weight to the regions of parameter space that were more
heavily sampled. In contrast, the max ratio is relatively stable, and
focuses the optimization on the part of the parameter space where
it matters most.
formulas of [54]. For example, for 3 conditions, we try all 6 orders
Table 1 shows that there is a diminishing return in reducing the
as well as all logical-and, bitwise-and, and no-branch plans. Since
max ratio metric as we choose more candidate plans. During online
we do not know the selectivity and cost of each condition (and
execution, the best plan among the candidate plans is chosen. As-
the cost of the body part) in advance before query execution, we
suming that there is enough data for exploitation (so the exploration
develop a large number of configurations in an offline analysis. For
cost is negligible), then it is in theory better to choose from more
every condition, we test 20 selectivities ranging multiplicatively
candidate plans, but the marginal benefit is decreasing (as is the
from 0.0001 to 0.9999. We test 10 cost values from 1 to 1024 cycles,
metric). As we demonstrate in the experiments, the performance
again multiplicatively. Then, for each of these 20x10 configurations,
stabilizes as we increase the value of 𝑘. In practice, a reasonable
we compute the cost of all different plans [54].
heuristic for 𝑘 conditions is to use at least 𝑘 candidate plans so that
We then compute a summary of the best plans to use during
every condition can be the first condition in some plan.
online exploration. Suppose we can afford to use 𝑘 plans for explo-
ration. Our metric for evaluating the quality of a set of 𝑘 plans is
based on the worst-case ratio of estimated performance across all 4.3 Measuring Execution
configurations:
We follow the Vectorwise approach by measuring actual times and
choosing plans based on their recent history of execution times.

the best cost among the 𝑘 plans
max The Graal/Truffle system already instruments interpreted code
{configurations} the best cost among all plans
with counters to observe events like a branch being taken. When
Then we would like to choose the set of 𝑘 plans that minimizes the interpreted code is identified as a hot-spot and compiled, that
this ratio. An exhaustive search would be too costly (exponential information is used to inform the subsequent compilation phase.
in 𝑘) and so we propose the following heuristic method. The counter instrumentation is omitted from the compiled code to
(1) Every plan is considered as a valid candidate, and every minimize overhead.
configuration is mapped to the plan that minimizes its cost For the execution of compiled code, we divide the entire execu-
(which we record as the baseline cost for the configuration, to tion into a series of alternative exploration and exploitation periods.
be used in the denominator of the formula above). Any plan During an exploration period, a number of candidate plans is tested
that contains no configurations at this point is eliminated. over input chunks and compared with the execution times. In the
(2) While there are still too many plans, consider each plan 𝑃 following exploitation period, the best plan is maximally employed
in turn as follows: (a) Map each configuration previously over a larger number of chunks. We keep a recent history of chunk
assigned to 𝑃 to the next-best plan, and compute the ratio execution performance, so that the system can react to changes by
of the new estimated cost to the baseline cost. Record the comparing the current execution with previous executions. Two
highest cost ratio as the score for 𝑃. (b) Remove the plan heuristics are used for dynamically setting parameters:
with the lowest score, and re-assign its configurations to • Dynamic exploitation (DE). For consecutive exploration
their next-best plans. periods, if the best plan does not change, then it suggests
We eliminate plans with the lowest ratio because their elimination that the data is behaving consistently, so we double the size
makes the smallest incremental difference to the overall ratio we are of the exploitation period; otherwise, the data distribution is
trying to minimize. In other words, the next-best plans are almost likely to have changed during the two explorations, so we
as good as the elimination candidate. reduce the size to half of the original exploitation period.

933
(a) AND plan (b) Reorder (c) No-branch

Figure 2: Rewriting of a conditional AST

• Early exploration (EE). When we observe that the chunk in the for-loop, but defined outside of the for-loop, are stored in a
takes significantly longer to execute (more than double the dictionary; (3) The variable dictionary and the for-loop string are
average of recent chunks), it is a strong indication that the passed to the code generation framework via the Polyglot API.
underlying data has changed, so we start exploration using To control the adaptive code generation, we implement a set of
additional plans starting from the next chunk. AST nodes extending from Truffle Nodes, including value nodes
In practice, combining these two heuristics works well for our (e.g., constants), arithmetic nodes (e.g., Addition), and condition
experimental datasets (Section 6.2). nodes (e.g., LessThan). When the rewritten source code is executed
and the adaptive execution framework is invoked, control is handed
5 IMPLEMENTATION over to the root node, a special Truffle AST node that handles the
We use the Truffle language implementation framework [64] to de- execution of the loop and measures the performance. We use a
velop the adaptive execution framework. Truffle is an open-source custom parser built with ANTLR to parse the for-loop string into the
library that simplifies the development of language execution en- Truffle expression nodes we implemented, and generate multiple
gines and data processing engines using self-optimizing abstract ASTs representing the candidate plans according to the summary
syntax trees (ASTs) in the GraalVM ecosystem. Each node in the obtained from offline analysis (Section 4.2). The variable values
AST represents an operation (e.g., a comparison, an evaluation of stored in the dictionary are written to the procedure stack, so that
an AND condition, an arithmetic computation, etc.) that is com- they can be accessed and modified during the adaptive execution.
piled to machine code by the Graal compiler. During the execution, Under the root node of the loop, a TopLevelCondition node rep-
an AST node can make use of runtime information and change resents the if-statement. For conjunctive conditions (AndCondition),
its internals to specialized versions that have better performance. a candidate plan specifies the ordering of the conditions as well
Node rewriting and JIT compilation are automatically handled by as a mode indicating how the conditions are computed and com-
the Graal compiler. bined together (LogicalAnd, BitwiseAnd, or NoBranch). The order-
In this paper, we focus on JavaScript programs with a for-loop ing and the node properties are stored as internal variables of an
like the example in Section 1. Users can write a pragma directly AndCondition. The body part of the if-statement (true branch) is
above the for-loop they wish to perform adaptive execution on: a generic AST node if all conditions have been evaluated. If there
are remaining conditions to be evaluated as no-branch conditions,
var input0 = ... // initialize data arrays then the body part is rewritten to an AndCondition node with
var input1 = ... NoBranch mode. The body also uses a mask to determine whether
var count = 0; the result is written to output. Multiple assignment statements are
permitted in the body.
" adaptive execution "; // adaptive execution pragma Depending on the number of conditions (i.e., the structure of
for ( i =0; i <1000000000; ++ i )
the code), the root node chooses from a summary with matching
if ( input0 [ i ] <20 && input1 [ i ] <50)
count ++;
conditions a set of candidate plans. For each candidate plan, the root
node constructs an AST as shown in Figure 2. An AndCondition
By using the pragma, the user is (a) certifying that the predicates node has conditions as its child nodes, which are basic conditions
in the if-statement can be reordered, and (b) hinting that adaptive like LessThan comparisons. Figure 2 shows three example plans
execution should be applied to the for-loop. with the same semantics. By reordering the conditions (C0, C1 and
C2), the AST in Figure 2(a) is rewritten to Figure 2(b) and thus
5.1 Preprocessing executed differently. Either logical or bitwise AND can be used
depending on the mode set in the AndCondition. If a no-branch
Upon execution of the JavaScript program written by the user, a
plan is used then an AndCondition with the no-branch mode is
custom script first rewrites the program source code to use the
used to rewrite the plan into Figure 2(c), where only condition C0 is
Polyglot API. Polyglot allows different languages implemented
executed with a branching if-condition. We then invoke the Graal
with Truffle to interoperate with each other. In our implementation,
compiler backend to compile the AST into callable machine code.
we use Polyglot to access variables in JavaScript, and make the
When there are no if-conditions as in the example in Section 6.1,
following changes to the source code: (1) The for-loop itself is
then the body part is just an AST representing the assignment
transformed into a string; (2) Values of all variables that are used

934
Table 2: Time breakdown (s), 109 rows, median of 10 runs

Adaptive && & reverse &&

Preprocess 0.10 0.11 0.11 0.11
Execute (interpreted) 0.22 0.07 0.07 0.07
Compile 0.81 0.55 0.55 0.55
Execute (compiled) 6.28 11.49 5.94 11.47
Total 7.44 12.22 6.67 12.19

Figure 3: Compiling a sequence of plans

and arithmetic computations. Additionally, we support SIMD code

generation for basic arithmetic computations. The implementa- 6 EXPERIMENTAL EVALUATION
tion extends the Graal compiler to add intrinsics using AVX-512 In Sections 6.1–3, we conducted experiments on a Linux server
instructions. For the example in Section 6.1, we implement the com- with a 2.5 GHz Xeon Platinum 8175M processor. In Section 6.4,
putation and conflict resolution in SIMD as a compiler directive. we demonstrate support for a visualization application using the
Based on a template, the root node recognizes the code structure adaptive execution approach, running on a laptop with an Intel
written in scalar code, and generates the corresponding body node Core i7-1065G7 processor. The execution of the query program uses
using the specific compiler directive. The candidate plans are then a single thread, processing in-memory datasets stored as arrays
the scalar and SIMD versions of the body node. of data. Given the relatively low, fixed overhead of compilation,
After the above preprocessing, the root node triggers the compi- we focus in this section on the main loop execution time using
lation of the candidate ASTs. Note that the program has to run in compiled plans.
the interpreted mode for a very short time before the compilation
is triggered. Then during an exploration period, the root node tries
6.1 Microbenchmark on Skewed Data using
the compiled candidate plans and measures the actual execution
times of a chunk. For each plan, we run over 2 chunks, and measure SIMD
the time of the second chunk, to overcome instabilities of the first We present one set of experiments to show how our system can
chunk measurement. The chunk size is set to 1000 tuples so that we adapt to data distributions, and to illustrate how parameters such
can amortize the overhead of time measurement and still get a rank as the exploration window size might be worth tuning for optimal
of the plans. The best plan is then used for the longer exploitation performance. Our baseline query has the form:
period. When dynamic exploitation (Section 4.3) is used, we also for(i=0;i<n;i++)
track 10 recent chunk execution times to enable the heuristic. output[data[i]] += compute(i,data[i]);
The compute function involves 3 logical shifts, 3 exclusive-ors, and
5.2 Compiler Overheads two integer multiplications, all of which can be performed in a
To measure the overheads of compilation itself, we measured the data-parallel fashion using SIMD instructions. compute is in-lined
time taken for an end-to-end compilation of three plans followed by to avoid the function call overhead. An important aspect of this loop
an execution of a loop over 109 tuples. We compare with the time is the distribution of the data[i] values. A narrow distribution
taken by the unmodified Truffle/Graal compiler on each plan indi- will lead to better cache locality in the output array, but potential
vidually. We used the G1 garbage collector for Truffle/Graal with conflicts to resolve common outputs from different SIMD lanes.
default settings, and performed the experiments on a Xeon E5620 A broad distribution will have worse cache locality, but will be
machine. For this experiment, one of the three plans (the & plan) is mostly conflict-free, as discussed in Section 3. We model skew in
optimal for the whole dataset and so compiling this plan directly the data[i] values by using a suitable Zipf distribution with a 𝑧
represents a baseline for the adaptive technique. The results in parameter between 0 (uniform) and 1.8 (highly skewed). 𝑧 ≈ 1 is a
Table 2 show that the compilation overhead for adaptive execution common value for real-world skewed data. The data is divided into
is small relative to the cost for compiling the optimal plan without 1.9 million chunks of size 1, 000 (1.9 billion records).
adaptive execution. The performance of the non-optimal plans is Figure 4(a) shows the performance of the compiled code when
significantly worse than adaptive execution. Adaptive execution is the data distribution becomes more skewed as the index increases:
thus robust with respect to how a programmer might have initially the first 100,000 chunks have 𝑧 = 0, the next 100,000 chunks have
coded the conditional expression test. 𝑧 = 0.1 and 𝑧 is subsequently incremented by 0.1 each 100,000
Compilation performance improves as we compile more plans as chunks. There are two implementations for the loop: (i) a standard
shown in Figure 3, primarily because the compiler itself is just-in- scalar implementation whose performance as a function of 𝑧 is
time compiled dynamically during execution. Based on the results in shown in blue; and (ii) a SIMD implementation with conflict detec-
Figure 3 one could expect to reduce the compilation overhead from tion/resolution, whose performance is shown in red. From these
0.81 seconds to 0.23 seconds if one were to precompile the compiler curves alone it is apparent that SIMD beats scalar for small 𝑧, be-
itself, as in libgraal [59]. The small spike in performance for the cause it can parallelize the work of the compute function. However,
seventh plan is caused by the compiler’s invocation of garbage for large 𝑧, the SIMD algorithm becomes an order of magnitude
collection. worse than the scalar code due to the need for conflict resolution.

935
data, whereas a sequence of local storms might generate regions
of data skew. A close examination of Figure 4(b) shows that at the
beginning of each skewed region, there is a small period during
which the inferior plan is being run. The system has not yet reached
the next window where it re-evaluates plans; it continues executing
the same plan until that happens.
To understand the impact of the length of the exploitation peri-
ods, we ran several experiments with different exploitation period
sizes. Figure 4(c) shows the elapsed time spent in exploration and
exploitation mode separately. When the exploitation period is too
small, we waste time running suboptimal plans too often: a subop-
timal plan that is 5X worse than optimal and run 3% of the time
will constitute a 12% overhead. When the period is too large, we
do not notice a change in the data distribution until we have been
(a) 𝑧 increasing running a suboptimal plan for a while. For this example, the best
intermediate value for the exploitation period is around 200 chunks,
and the exploration mode takes 4.3% of the time.

6.2 Microbenchmark with Different

Selectivities
The previous subsection demonstrates how to choose between two
code generation options (scalar and SIMD) adaptively. In this subsec-
tion, we study how to choose among various plans of conjunctive
conditions. The query used in this set of experiments has the form:
for (i=0; i<n; ++i)
if (a[i]<A && b[i]<B && c[i]<C)
sum += compute(input[i]);
In this program, we have three range predicates over the input data
(b) 𝑧 varying arrays a, b, and c. For the microbenchmark, we generate random
numbers in the a, b, c arrays and control the selectivity of three
tests by setting the corresponding A, B, C constants. As described
in Section 4, there are many different orderings and plans including
the non-branching plans to be considered.
Figure 5(a) shows the average running time per chunk (in mi-
croseconds) on three different datasets, using a varying number of
plans. Two of the datasets have a fixed selectivity: 0.5 and 0.001 for
all three conditions. The third dataset sets random selectivities for
each condition every 1M tuples.
For the dataset with 0.5 selectivity, the first plan from the sum-
mary generated by offline analysis is the no-branch plan, and it is
(c) Effect of exploitation period always chosen as the best plan for this dataset. Adding more plans
does not improve the performance. For the dataset with selectivity
Figure 4: Performance on skewed data with SIMD 0.001, the no-branch plan is not the best. If we use more than one
candidate plan during the exploration period, then the candidate
plans include the plan if (a[i] < A){...} (or other symmetrical
The purple dots in Figure 4(a) show the performance of the plans), so it uses this plan as the best plan and reduces the average
chosen algorithm for each chunk. There is high density in some running time. For the changing data, we find that after 5–7 plans,
regions leading to what appears to be solid coloring. The more the running time becomes stable, suggesting a suitable number of
diffuse dots are where alternative plans are run to obtain estimates plans to use during online exploitation.
of their performance, to see if it is worth switching plans. Figure 4(a) By default we use 200 chunks as the exploitation period. How-
demonstrates that the active plan chosen tracks the better of the ever, the underlying data distribution may change more or less
two plans. Figure 4(b) shows a variant in which the same loop is run rapidly than what we can detect. To study the effect of exploita-
over data whose Zipf parameter alternates between uniform and tion period, we generate datasets that change the selectivity (and
a randomly chosen 𝑧 value at unpredictable points in the data set. thus the best plan) randomly. Using 5 plans, Figure 5(b) shows the
This kind of data is what we might expect in our rainfall example: average running time per chunk on the 7 different datasets, using
steady rain over a wide region might generate uniform-looking a varying exploitation period (x-axis: the unit is the number of

936
(a) Varying plans (a) Different plans on unsorted data

(b) Varying exploitation periods (b) Profiling on sorted data

(c) Dynamic parameters (c) Performance on sorted data

Figure 5: Performance with different selectivities Figure 6: Performance on TPC-H Q6

chunks, chunk size is 1000 tuples). The selectivity of a dataset is written in a javascript program as an if-statement with five different
changing randomly every 1K (10K, 100K, . . . ) tuples, and the best conditions (range predicates).
exploitation period also changes correspondingly. for ( i =0; i < N ; ++ i )
Figure 5(c) shows that the dynamic heuristics of Section 4.3 are if ( shipdate [ i ] >= DATE_MIN &&
able to achieve the performance of the best exploitation period shipdate [ i ] < DATE_MAX &&
(on the 1M dataset). Using one heuristic alone can avoid the worst discount [ i ] >= DISCOUNT_MIN &&
case exploitation period, in Figure 5(b), and using both heuristics discount [ i ] <= DISCOUNT_MAX &&
together we can achieve similar performance of the best exploitation quantity [ i ] < QUANTITY )
period found. sum += price [ i ]* discount [ i ];

6.3 TPC-H Queries Figure 6 shows the performance of TPC-H query Q6. Figure 6(a)
shows the results on unsorted data, as generated by the benchmark
We now show results using code that implements TPC-H queries Q6
data generator. The running times are clustered by the first condi-
and Q19 [1]. We chose those queries because they have interesting
tion used during evaluation. The two best clusters correspond to
condition structures that might benefit from our approach.
the two range predicates on the shipdate column. From left to right,
6.3.1 Query 6. Query 6 quantifies the amount of revenue increase the clusters are:
that would have resulted from eliminating certain companywide • 6 points: DateMin, DateMax as the first two conditions
discounts in a given percentage range in a given year. The query is • 18 points: DateMin first, other conditions second

937
• 6 points: DateMax, DateMin as the first two conditions Table 3: TPC-H Q19 time (us) on unsorted data
• 18 points: DateMax first, other conditions second
• remainder: neither DateMin nor DateMax first Plan (first condition) Conj1 Conj2 Conj3
Since the first condition has the most impact on performance, BRAND 2767167 2721858 2513244
the compiler uses the following ordering of predicates in the six CONTAINER 2820004 3051187 2902889
candidate plans for adaptive execution: QUANTITY MIN 2968445 3611849 2503752
(1) DateMin, DateMax, DiscountMin, DiscountMax, Quantity QUANTITY MAX 3498422 4088774 3975201
(2) DateMax, DateMin, DiscountMin, DiscountMax, Quantity SIZE 3549345 4767787 3360388
(3) DiscountMin, DateMin, DateMax, DiscountMax, Quantity SHIPMODE 3125625 2654871 2409472
(4) DiscountMax, DateMin, DateMax, DiscountMin, Quantity SHIPINSTRUCT 3826366 2665140 2425727
(5) Quantity, DateMin, DateMax, DiscountMin, DiscountMax
(6) no-branch plan (order unimportant)
Each of the first five plans has a different first condition to be For each of the seven predicates in Conj1, we could include at
evaluated. Under the adaptive execution, the compiler automatically least one plan that checks the predicate first, for a total of seven
chooses the best plan to process most of the data, no matter how plans. The same observation holds for Conj2 and Conj3. Since the
the program is written. conjunctions are quite selective, no-branch plans for evaluating the
Figure 6(b) shows the results on data sorted by shipdate, which is conjunction are excluded because they are likely to perform badly.
likely to be the typical case in real world. We show the performance A naive application of our approach would then need to generate
of each of the six plans, as well as the adaptive execution results. In 73 combined plans in order to cover all of the important cases.
this figure, we plot the average running time per chunk for every Instead, we observe that because the conjunctions are relatively
200 chunks of input data in the exploitation period. Because the selective, and combined by disjunction, all of the conjunctions are
data is sorted by the shipdate column, there is a discontinuity at likely to be executed for most rows. In other words, it is unlikely
around 20,000 input chunks, corresponding to the lower bound that a positive result from testing one of the conditions would
DATE_MIN specified in the query. Before the discontinuity, the adap- be effective at short-circuiting the evaluation to avoid the other
tive execution chooses Plan 1; during the immediately following conditions. (We also verified that the running time of all 6 orders
period, it chooses the no-branch plan, because evaluating the Date of the three conjunctions perform roughly the same.) If all three
conditions is extra work (they always succeed) and Plans 1 and 2 conjunctions are going to be executed almost all of the time anyway,
become the most expensive; after the data crosses the DATE_MAX we should optimize them independently. As a result we get 7 ∗
threshold, the system chooses Plan 2. 3 = 21 plans rather than 73 plans. For this particular query, the
Figure 6(c) shows the total running time on the sorted data for three conjunctions have the same structure, and so 7 plans (with
different plans. The compiled code automatically chooses the best three instances of each) would suffice. However, the compiler does
variant among the six candidate plans, and the total execution time not know that the conjunctions have similar structure, and so it
of the adaptive method is reduced compared with any single fixed cannot share plans in this way. Instead of one exploration period,
plan. we now use three exploration periods separately for each of the
conjunctions. In each exploration period, we select the best plan
6.3.2 Query 19. Query 19 reports the gross discounted revenue at-
for one of the conjunctions.
tributed to the sale of selected parts handled in a particular manner.
Table 3 shows the running time on unsorted data. The table
The query’s where clause is a disjunction of three conjunctions.
shows the first predicate of the plan for each of the three conjunc-
Each of the three conjunctions has the same structure of predi-
tions, when the other two conjunctions each uses BRAND as the
cates but the predicates have different parameter values. For this
first predicate. In adaptive execution, Conj1 chooses the BRAND
experiment, we preprocessed the text data so that the parameters
plan, while Conj2 and Conj3 each choose the SHIPMODE plan. As
are numetic values supported by our current implementation. We
a result the adaptive execution takes about 2.4 seconds to compute
manually implemented adaptive execution for this query because
the revenue loss no matter how the program is written (i.e., the
our current full compilation pipeline currently handles only con-
ordering of the conditions by the programmer does not matter due
junctive expressions. Written as a JavaScript program, the foreign
to adaptive execution). This performance is about twice as good
key join is executed as an index lookup into the referencing array.
as the worst plan in Table 3, which is probably not the worst plan
for ( i =0; i < N ; ++ i ) overall. This experiment shows that the advantage of our approach
if (( brand [ partkey [ i ]] == BRAND1 && includes robust performance even for complex conditions involving
container [ partkey [ i ]] == CONTAINER1 && conjunctions and disjunctions.
quantity [ i ] >= QUANTITY1 && When the data is sorted by SHIPMODE, there is a region of data
quantity [ i ] <= QUANTITY1 + 10 && where the equality predicate on SHIPMODE is always satisfied.
psize [ partkey [ i ]] <= SIZE1 && We find that each conjunction either uses the SHIPMODE plan to
shipmode [ i ] == SHIPMODE1 && quickly filter out the unqualified data, or when the SHIPMODE
shipinstruct [ i ] == SHIPINSTRUCT1 ) || // Conj1 equality predicate is satisfied, uses the BRAND plan since it is
(...) || // Conj2 the most selective and thus has the best performance. Figure 7
(...)) // Conj3 shows the runtime profiling of the BRAND plan, SHIPMODE plan,
sum += price [ i ] * (1 - discount [ i ]); and the adaptive execution. The other plans behave similarly to

938
Figure 7: Performance on Q19 (sorted) Figure 8: Performance of varying TIME_MIN

the BRAND plan, but they are slower; for clarity, their profiles
are omitted in the figure. The profile of the adaptive execution
overlaps with the BRAND plan when the SHIPMODE test is true,
demonstrating that the execution switches to a different plan when
the underlying data is changed. As a result, the adaptive plan takes
2.25 seconds to complete the computation, compared with a fixed
BRAND plan taking 2.81 seconds and a fixed SHIPMODE plan
taking 2.46 seconds.

6.4 Visualization Application

As motivated earlier, we imagine a hypothetical tool written in
Figure 9: Performance of varying DATE_MIN
a JavaScript-like language for visualizing weather data. The tool
allows users to choose a range of dates and times, and to select a
bounding box on a map based on a range of latitudes and longitudes.
The tool also computes some aggregation results using the data for ( i =0; i < N ; ++ i )
points that fit into these ranges. As the user adjusts a slider to if ( data [ i ] != INVALID_DATA &&
change the parameter values, the program dynamically recomputes time [ i ] >= TIME_MIN &&
the query results, and the user interface interactively refreshes the time [ i ] < TIME_MAX &&
screen to display new results. If the computation is slow for some date [ i ] >= DATE_MIN &&
parameter values, the display frame rate drops and there could be date [ i ] < DATE_MAX &&
perceptible jitter as the user operates the slider in the graphical latitude [ station [ i ]] >= LAT_MIN &&
user interface. latitude [ station [ i ]] < LAT_MAX &&
We downloaded a real climate dataset containing historical 15- longitude [ station [ i ]] >= LON_MIN &&
minute precipitation observations for selected U.S. stations1 . The longitude [ station [ i ]] < LON_MAX )
dataset has 18.2 million data points for the precipitation amount sum += data [ i ];
at 34354 stations across the U.S, for the period from 1971 to 1998.
The precipitation data (date, time, amount) and the station data Figure 8 shows the execution time of the program (left y-axis)
(latitude, longitude) are stored separately, and we load them into and the derived frames per second (FPS, right y-axis) as the user
memory as separate arrays of data, keyed by the station id. Some controls a slider to vary the TIME_MIN value from 00:00 to 24:00.
of the data points are marked as invalid in the dataset. The entire We compare the baseline method using the example program shown
dataset is about 3GB, so one is able to process them on a laptop above with the adaptive execution approach. As the TIME_MIN
computer instead of a server as in previous experiments. As an varies to around 12:00, the selectivity of condition time[i] >=
example, we compute the total amount of precipitation using the TIME_MIN becomes closer to 0.5, incurring an expensive branch
following program. It has nine conditions and computes a sum of misprediction overhead. Therefore, the baseline method has an
the precipitation for measurements that satisfy all the conditions. execution time of 45 milliseconds and the framerate drops to be-
Since station locations are stored separately, an indirect lookup is low 25 FPS. The adaptive execution method, however, is able to
used to check the latitude and longitude of station locations. switch to a different plan that checks TIME_MIN conditions later
For each of the nine predicates, one of the candidate plans is after checking other conditions, so its execution time is at most 27
included to check that predicate first. Since the dataset is ordered by milliseconds and the frame rate stays above 36 FPS. The difference
station, data points satisfying the latitude and longitude conditions in FPS is over 1.5x, and is well within the limits of human visual
are clustered in multiple regions. For different selectivities of the perception [30].
predicates, the best plan changes during an adaptive execution. Figure 9 shows the running time and visualization frame rates as
the user slides the DATE_MIN value from 1970-01-01 to 2000-01-01.
1 https://www.ncdc.noaa.gov/cdo-web/datasets The results are similar to the former case, and the difference in

939
FPS is up to 1.4x. Changing TIME_MAX or DATE_MAX conditions Database and programming language compilers have a common
result in similar observations. goal, namely to generate efficient machine code for queries/pro-
Since the dataset is not globally ordered by any single dimension grams written in a high-level language. Recent query compilers
of the nine conditions, during the adaptive execution of the program resemble programming-language compilers, sharing some of the
in the experiments of Figures 8 and 9, seven out of the nine candidate low-level infrastructure such as LLVM [12, 44]. The programming
plans were exploited at least once. This observation emphasizes the language community has built hot-spot compilers [46] that ini-
need for a diversity of plans to handle runtime configurations that tially interpret (and profile) code sections. When the interpreter
are difficult to predict, and the ability of our system to dynamically determines that the code section is a hot-spot, it pauses, compiles
choose an appropriate plan. the code section in real time, and executes the remainder of the
code section using the compiled code. This choice balances compi-
lation and execution time, and similar innovations have recently
7 RELATED WORK been described for database query compilation [36]. While data-
Column-oriented execution [42] and cache-conscious operators base compilers have adopted programming-language innovations
[43] were proposed before the advent of multi-core CPUs. Block- such as LLVM and hot-spot compilation, our method shows that
at-a-time execution [8] and query-dependent code generation [21, there is also an opportunity for technology transfer in the opposite
38, 44] are both state-of-the-art designs for analytical query en- direction.
gines [33]. The present work has features from both block-at-a-time Our system extends the Truffle framework [64] and the Graal
execution and query-dependent code generation. compiler [65]. Using Graal as the host compiler, Truffle is particu-
SIMD optimizations have been applied to a variety of database larly well-suited for languages with very dynamic semantics and
operators including joins [3, 4, 6, 27, 34, 58], sorting [10, 26, 50, 56], whose execution depends heavily on the size, layout and contents
scans [69] and compression [40, 52, 63]. Advanced SIMD optimiza- of the input data. Truffle offers numerous primitives for collecting
tions [49, 51] include non-linear-access operators. SIMD optimiza- information about the observed data types and program behavior.
tions work best when data is cache-resident [68], the there are Additionally, so-called assumptions allow for non-local optimiza-
trade-offs between scalar and SIMD code as we demonstrated in tions where the point that uses optimized code based on a specific
Section 3. assumption is only loosely connected to the points that potentially
Adaptive query processing aims to refine a query plan at run- invalidate this assumption. Leveraging this speculative just-in-time
time on the basis of statistics gathered at intermediate stages of the compilation based on implicit schemas that are discovered at run-
query computation [2, 14]. Multiple sub-plans could be compiled time, Truffle has also been used to develop efficient parsers for JSON
into a query, with a choice to be determined based on partial compu- and CSV data [9], and to accelerate data de-serialization [57]. The
tations such as the size of an intermediate table. Alternatively, when existing profiling and assumption mechanism in Truffle are based
a departure from the predicted behavior occurs, another round of on heuristics; they are local, behavior-centric, and strictly stabiliz-
query optimization could be performed at run-time. Early work ing (always moving towards the most generic version). This paper
on this topic instrumented query code with counters to gather extends them with a dynamic mechanism, directly observing the
statistics that inform such choices [13, 28]. More recent work us- actual performance of different but semantically equal algorithms.
ing in-memory databases uses hardware performance counters to
gather such statistics without any performance overhead [66]. 8 CONCLUSIONS
We use a limited number of query plans based on an analysis of We studied optimization techniques for data-analysis style queries
regions of parameter space. The Picasso database query optimizer expressed as tight loops in a conventional imperative program-
visualizer allows one to visually inspect optimal plan choices for ming language. Since the data distribution often strongly affects
different regions of the parameter space [15, 22]. Our choice of a query performance, it is important to make the code generation
small number of plans is analogous to how Picasso would create a and execution adaptive to the underlying data. To adapt to this
“reduced diagram” with a bounded reduction in overall performance. performance diversity, we built upon an open-source compiler to
Empirically, the authors find that ten plans is almost always suffi- generate code that efficiently processes large data sets with varying
cient to cover the parameter space with at most a 20% degradation data distributions and predicate selectivities. By using a learning
in the plan cost at any point in the space [15]. PlanBouquets [17] framework with alternative exploration and exploitation periods,
incrementally discovers actual selectivity at runtime in order to we enabled code generation using different plans and SIMD op-
identify appropriate plan to execute, and recent work [31] has im- tions. We showed that the system could tune run-time execution
proved its significant compile-time overheads. Our plans are likely parameters automatically, with minimal guidance from the pro-
to be simpler than the ones considered by Picasso, so fewer then grammer. As a result, we achieved robust query performance in
ten plans may typically be sufficient. both microbenchmark and TPC-H queries. When the underlying
To deal with arbitrary user defined functions, [12] compiles a data changes, the adaptive code generation and execution can in
high-level query workflow into a distributed program. UDFs are fact achieve better performance.
compiled with LLVM into intermediate representations and then
linked with the workflow program into binary executables. A dif- ACKNOWLEDGMENTS
ferent approach proposed recently is to compile UDFs into plain This research was supported in part by a gift to Columbia University
SQL queries [16, 24], where arbitrary control flows are translated from Oracle Corp, and by NSF grant IIS-2008295.
into recursive expressions.

940
REFERENCES [27] Saurabh Jha, Bingsheng He, Mian Lu, Xuntao Cheng, and Huynh Phung Huynh.
[1] [n.d.]. The TPC-H Benchmark. http://www.tpc.org/tpch. 2015. Improving Main Memory Hash Joins on Intel Xeon Phi Processors: An
[2] Ron Avnur and Joseph M Hellerstein. 2000. Eddies: Continuously adaptive query Experimental Approach. PVLDB 8, 6 (Feb. 2015), 642–653.
processing. In Proceedings of the 2000 ACM SIGMOD international conference on [28] Navin Kabra and David J. DeWitt. 1998. Efficient Mid-query Re-optimization of
Management of data. 261–272. Sub-optimal Query Execution Plans. In Proceedings of the 1998 ACM SIGMOD
[3] Cagri Balkesen, Gustavo Alonso, Jens Teubner, and M. Tamer Ozsu. 2013. Mul- International Conference on Management of Data (Seattle, Washington, USA)
tiCore, Main-Memory Joins: Sort vs. Hash Revisited. PVLDB 7, 1 (Sept. 2013), (SIGMOD ’98). ACM, New York, NY, USA, 106–117. https://doi.org/10.1145/
85–96. 276304.276315
[4] Cagri Balkesen, Jens Teubner, Gustavo Alonso, and M. Tamer Ozsu. 2013. Main- [29] Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alexander
memory Hash Joins on Multi-core CPUs: Tuning to the Underlying Hardware. Rasin, Stanley Zdonik, Evan P. C. Jones, Samuel Madden, Michael Stonebraker,
In ICDE. 362–373. Yang Zhang, John Hugg, and Daniel J. Abadi. 2008. H-Store: a High-Performance,
[5] Ronald Barber, Peter Bendel, Marco Czech, Oliver Draese, Frederick Ho, Namik Distributed Main Memory Transaction Processing System. Proc. VLDB Endow. 1,
Hrle, Stratos Idreos, Min-Soo Kim, Oliver Koeth, Jae-Gil Lee, Tianchao Tim 2 (2008), 1496–1499. https://doi.org/10.1145/1454159.1454211
Li, Guy M. Lohman, Konstantinos Morfonios, René Müller, Keshava Murthy, [30] Michael Kalloniatis and Charles Luu. [n.d.]. Temporal Resolu-
Ippokratis Pandis, Lin Qiao, Vijayshankar Raman, Richard Sidle, Knut Stolze, and tion. https://webvision.med.utah.edu/book/part-viii-psychophysics-of-
Sandor Szabo. 2012. Business Analytics in (a) Blink. IEEE Data Eng. Bull. 35, 1 vision/temporal-resolution/.
(2012), 9–14. [31] Srinivas Karthik, Jayant R Haritsa, Sreyash Kenkre, and Vinayaka Pandit. 2018. A
[6] Spyros Blanas, Yinan Li, and Jignesh Patel. 2011. Design and Evaluation of Main concave path to low-overhead robust query processing. Proceedings of the VLDB
Memory Hash Join Algorithms for Multi-core CPUs. In SIGMOD. 37–48. Endowment 11, 13 (2018), 2183–2195.
[7] Peter A. Boncz, Martin L. Kersten, and Stefan Manegold. 2008. Breaking the [32] Alfons Kemper and Thomas Neumann. 2011. HyPer: A Hybrid OLTP&OLAP Main
Memory Wall in MonetDB. Commun. ACM 51, 12 (Dec. 2008), 77–85. https: Memory Database System Based on Virtual Memory Snapshots. In Proceedings
//doi.org/10.1145/1409360.1409380 of the 2011 IEEE 27th International Conference on Data Engineering (ICDE ’11).
[8] Peter A. Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: Hyper- IEEE Computer Society, Washington, DC, USA, 195–206. https://doi.org/10.1109/
pipelining query execution. In CIDR. ICDE.2011.5767867
[9] Daniele Bonetta and Matthias Brantner. 2017. FAD.js: fast JSON data access using [33] Timo Kersten, Viktor Leis, Alfons Kemper, Thomas Neumann, Andrew Pavlo,
JIT-based speculative optimizations. Proceedings of the VLDB Endowment 10, 12 and Peter Boncz. 2018. Everything You Always Wanted to Know About Compiled
(2017), 1778–1789. and Vectorized Queries but Were Afraid to Ask. Proc. VLDB Endow. 11, 13 (Sept.
[10] Jatin Chhugani et al. 2008. Efficient implementation of sorting on multi-core 2018), 2209–2222. https://doi.org/10.14778/3275366.3275370
SIMD CPU architecture. In VLDB. 1313–1324. [34] Changkyu Kim et al. 2009. Sort vs. Hash revisited: fast join implementation on
[11] Confluent Inc. 2019. Streaming SQL for Apache Kafka. https://www.confluent. modern multi-core CPUs. PVLDB 2, 2 (Aug. 2009), 1378–1389.
io/product/ksql. [35] Yannis Klonatos, Christoph Koch, Tiark Rompf, and Hassan Chafi. 2014. Building
[12] Andrew Crotty, Alex Galakatos, Kayhan Dursun, Tim Kraska, Carsten Binnig, Efficient Query Engines in a High-level Language. Proc. VLDB Endow. 7, 10 (June
Ugur Çetintemel, and Stan Zdonik. 2015. An Architecture for Compiling UDF- 2014), 853–864. https://doi.org/10.14778/2732951.2732959
centric Workflows. PVLDB 8, 12 (2015), 1466–1477. [36] André Kohn, Viktor Leis, and Thomas Neumann. 2018. Adaptive Execution of
[13] Amol Deshpande and Joseph M. Hellerstein. 2004. Lifting the Burden of History Compiled Queries. In 34th IEEE International Conference on Data Engineering,
from Adaptive Query Processing. In Proceedings of the Thirtieth International ICDE 2018, Paris, France, April 16-19, 2018. 197–208.
Conference on Very Large Data Bases - Volume 30 (Toronto, Canada) (VLDB ’04). [37] Konstantinos Krikellas, Stratis Viglas, and Marcelo Cintra. 2010. Generating code
VLDB Endowment, 948–959. http://dl.acm.org/citation.cfm?id=1316689.1316771 for holistic query evaluation. In Proceedings of the 26th International Conference
[14] Amol Deshpande, Zachary Ives, and Vijayshankar Raman. 2007. Adaptive query on Data Engineering, ICDE 2010, March 1-6, 2010, Long Beach, California, USA.
processing. Now Publishers Inc. 613–624. https://doi.org/10.1109/ICDE.2010.5447892
[15] Harish Doraiswamy, Pooja N. Darera, and Jayant R. Haritsa. 2007. On the Produc- [38] Konstantinos Krikellas, Stratis Viglas, and Marcelo Cintra. 2010. Generating code
tion of Anorexic Plan Diagrams. In Proceedings of the 33rd International Conference for holistic query evaluation. In ICDE. 613–624.
on Very Large Data Bases, University of Vienna, Austria, September 23-27, 2007. [39] Tirthankar Lahiri, Marie-Anne Neimat, and Steve Folkman. 2013. Oracle
1081–1092. http://www.vldb.org/conf/2007/papers/research/p1081-d.pdf TimesTen: An In-Memory Database for Enterprise Applications. IEEE Data
[16] Christian Duta, Denis Hirn, and Torsten Grust. 2019. Compiling PL/SQL Away. Eng. Bull. 36, 2 (2013), 6–13.
arXiv preprint arXiv:1909.03291 (2019). [40] Harald Lang, Tobias Mühlbauer, Florian Funke, Peter A. Boncz, Thomas Neumann,
[17] Anshuman Dutt and Jayant R Haritsa. 2014. Plan bouquets: query processing and Alfons Kemper. 2016. Data Blocks: Hybrid OLTP and OLAP on Compressed
without selectivity estimation. In Proceedings of the 2014 ACM SIGMOD interna- Storage using both Vectorization and Compilation. In Proceedings of the 2016
tional conference on Management of data. 1039–1050. International Conference on Management of Data, SIGMOD Conference 2016, San
[18] Franz Faerber, Alfons Kemper, Per-Åke Larson, Justin J. Levandoski, Thomas Francisco, CA, USA, June 26 - July 01, 2016. 311–326. https://doi.org/10.1145/
Neumann, and Andrew Pavlo. 2017. Main Memory Database Systems. Foundations 2882903.2882925
and Trends in Databases 8, 1-2 (2017), 1–130. https://doi.org/10.1561/1900000058 [41] Per-Åke Larson, Mike Zwilling, and Kevin Farlee. 2013. The Hekaton Memory-
[19] Franz Färber, Sang Kyun Cha, Jürgen Primsch, Christof Bornhövd, Stefan Sigg, Optimized OLTP Engine. IEEE Data Eng. Bull. 36, 2 (2013), 34–40. http://sites.
and Wolfgang Lehner. 2012. SAP HANA Database: Data Management for Modern computer.org/debull/A13june/Hekaton1.pdf
Business Applications. SIGMOD Rec. 40, 4 (Jan. 2012), 45–51. https://doi.org/10. [42] Stefan Manegold, Peter Boncz, and Martin Kersten. 2000. Optimizing database
1145/2094114.2094126 architecture for the new bottleneck: memory access. J. VLDB 9, 3 (2000), 231–246.
[20] Martin Grund, Jens Krüger, Hasso Plattner, Alexander Zeier, Philippe Cudre- [43] Stefan Manegold, Peter Boncz, and Martin Kersten. 2002. Optimizing Main-
Mauroux, and Samuel Madden. 2010. HYRISE: A Main Memory Hybrid Storage Memory Join on Modern Hardware. TKDE 14, 4 (July 2002), 709–730.
Engine. Proc. VLDB Endow. 4, 2 (Nov. 2010), 105–116. http://dl.acm.org/citation. [44] Thomas Neumann. 2011. Efficiently Compiling Efficient Query Plans for Modern
cfm?id=1921071.1921077 Hardware. PVLDB 4, 9 (June 2011), 539–550.
[21] Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Ste- [45] Oracle Corp. 2019. GraalVM. https://www.graalvm.org/.
fano Stefani, and Vidhya Srinivasan. 2015. Amazon Redshift and the Case for [46] Michael Paleczny, Christopher Vick, and Cliff Click. 2001. The Java hotspotTM
Simpler Data Warehouses. In Proceedings of the 2015 ACM SIGMOD International Server Compiler. In Proceedings of the 2001 Symposium on JavaTM Virtual Machine
Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June Research and Technology Symposium - Volume 1 (Monterey, California) (JVM’01).
4, 2015. 1917–1923. https://doi.org/10.1145/2723372.2742795 USENIX Association, Berkeley, CA, USA, 1–1. http://dl.acm.org/citation.cfm?
[22] Jayant R. Haritsa. 2010. The Picasso Database Query Optimizer Visualizer. Proc. id=1267847.1267848
VLDB Endow. 3, 1-2 (Sept. 2010), 1517–1520. https://doi.org/10.14778/1920841. [47] Jignesh M. Patel, Harshad Deshmukh, Jianqiao Zhu, Navneet Potti, Zuyu Zhang,
1921027 Marc Spehlmann, Hakan Memisoglu, and Saket Saurabh. 2018. Quickstep: A
[23] Joseph M. Hellerstein. 1998. Optimization techniques for queries with expensive Data Platform Based on the Scaling-Up Approach. PVLDB 11, 6 (2018), 663–676.
methods. ACM Transactions on Database Systems 23, 2 (June 1998), 113–157. [48] Andrew Pavlo, Gustavo Angulo, Joy Arulraj, Haibin Lin, Jiexi Lin, Lin Ma,
[24] Denis Hirn and Torsten Grust. 2020. PL/SQL Without the PL. In Proceedings of the Prashanth Menon, Todd Mowry, Matthew Perron, Ian Quah, Siddharth San-
2020 ACM SIGMOD International Conference on Management of Data. 2677–2680. turkar, Anthony Tomasic, Skye Toor, Dana Van Aken, Ziqi Wang, Yingjun
[25] InfluxData Inc. 2019. Time series database (TSDB) explained. https://www. Wu, Ran Xian, and Tieying Zhang. 2017. Self-Driving Database Manage-
influxdata.com/time-series-database. ment Systems. In CIDR 2017, Conference on Innovative Data Systems Research.
[26] Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu, and Toshio Nakatani. 2007. http://db.cs.cmu.edu/papers/2017/p42-pavlo-cidr17.pdf
AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors. In [49] Orestis Polychroniou, Arun Raghavan, and Kenneth A. Ross. 2015. Rethinking
PACT. 189–198. SIMD Vectorization for In-Memory Databases. In SIGMOD. 1493–1508.

941
[50] Orestis Polychroniou and Kenneth A. Ross. 2014. A Comprehensive Study of [60] Michael Stonebraker, Paul Brown, Alex Poliakov, and Suchi Raman. 2011. The Ar-
Main-Memory Partitioning and Its Application to Large-scale Comparison- and chitecture of SciDB. In Proceedings of the 23rd International Conference on Scientific
Radix-sort. In SIGMOD. 755–766. and Statistical Database Management (Portland, OR) (SSDBM’11). Springer-Verlag,
[51] Orestis Polychroniou and Kenneth A. Ross. 2014. Vectorized Bloom Filters for Berlin, Heidelberg, 1–16. http://dl.acm.org/citation.cfm?id=2032397.2032399
Advanced SIMD Processors. In DaMoN. Article 6. [61] Tableau Inc. 2019. Tableau. https://www.tableau.com.
[52] Orestis Polychroniou and Kenneth A. Ross. 2015. Efficient Lightweight Compres- [62] Sandeep Tata. 2007. Declarative Querying for Biological Sequences. Ph.D. Disser-
sion Alongside Fast Scans. In DaMoN. Article 9. tation. Ann Arbor, MI, USA. Advisor(s) Patel, Jignesh M. AAI3276308.
[53] Vijayshankar Raman, Gopi Attaluri, Ronald Barber, Naresh Chainani, David [63] Thomas Willhalm et al. 2009. SIMD-scan: ultra fast in-memory table scan using
Kalmuk, Vincent KulandaiSamy, Jens Leenstra, Sam Lightstone, Shaorong Liu, on-chip vector processing units. PVLDB 2, 1 (Aug. 2009), 385–394.
Guy M. Lohman, Tim Malkemus, Rene Mueller, Ippokratis Pandis, Berni Schiefer, [64] Christian Wimmer and Thomas Würthinger. 2012. Truffle: a self-optimizing run-
David Sharpe, Richard Sidle, Adam Storm, and Liping Zhang. 2013. DB2 with time system. In Proceedings of the 3rd annual conference on Systems, programming,
BLU Acceleration: So Much More Than Just a Column Store. Proc. VLDB Endow. and applications: software for humanity. 13–14.
6, 11 (Aug. 2013), 1080–1091. https://doi.org/10.14778/2536222.2536233 [65] Thomas Würthinger, Christian Wimmer, Andreas Wöß, Lukas Stadler, Gilles
[54] Kenneth A. Ross. 2004. Selection Conditions in Main Memory. ACM Transactions Duboscq, Christian Humer, Gregor Richards, Doug Simon, and Mario Wolczko.
on Database Systems 29, 1 (2004), 132–161. 2013. One VM to rule them all. In Proceedings of the 2013 ACM international
[55] Bogdan Răducanu, Peter Boncz, and Marcin Zukowski. 2013. Micro Adaptivity symposium on New ideas, new paradigms, and reflections on programming &
in Vectorwise. In Proceedings of the 2013 ACM SIGMOD International Conference software. 187–204.
on Management of Data (New York, New York, USA) (SIGMOD ’13). ACM, New [66] Steffen Zeuch, Holger Pirk, and Johann-Christoph Freytag. 2016. Non-invasive
York, NY, USA, 1231–1242. https://doi.org/10.1145/2463676.2465292 Progressive Optimization for In-memory Databases. Proc. VLDB Endow. 9, 14
[56] Nadathur Satish et al. 2010. Fast sort on CPUs and GPUs: a case for bandwidth (Oct. 2016), 1659–1670. https://doi.org/10.14778/3007328.3007332
oblivious SIMD sort. In SIGMOD. 351–362. [67] Wangda Zhang and Kenneth A Ross. 2020. Exploiting data skew for improved
[57] Filippo Schiavio, Daniele Bonetta, and Walter Binder. 2020. Dynamic speculative query performance. IEEE Transactions on Knowledge and Data Engineering (2020).
optimizations for SQL compilation in Apache Spark. Proceedings of the VLDB [68] Wangda Zhang and Kenneth A Ross. 2020. Permutation Index: Exploiting Data
Endowment 13, 5 (2020), 754–767. Skew for Improved Query Performance. In 2020 IEEE 36th International Conference
[58] Stefan Schuh, Xiao Chen, and Jens Dittrich. 2016. An Experimental Comparison on Data Engineering (ICDE). IEEE, 1982–1985.
of Thirteen Relational Equi-Joins in Main Memory. In SIGMOD. 1961–1976. [69] Jingren Zhou and Kenneth A. Ross. 2002. Implementing Database Operations
[59] Doug Simon. [n.d.]. libgraal: GraalVM compiler as a precompiled GraalVM Using SIMD Instructions. In Proceedings of SIGMOD Conference.
native image. https://medium.com/graalvm/libgraal-graalvm-compiler-as-a- [70] M. Zukowski, M. van de Wiel, and P. Boncz. 2012. Vectorwise: A Vectorized Ana-
precompiled-graalvm-native-image-26e354bee5c. lytical DBMS. In Data Engineering (ICDE), 2012 IEEE 28th International Conference
on. 1349–1350. https://doi.org/10.1109/ICDE.2012.148

942

Programming Language Translation Issues: Chapter - 02
No ratings yet
Programming Language Translation Issues: Chapter - 02
23 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
Computer Packages Notes (1)
100% (1)
Computer Packages Notes (1)
147 pages
BAD601 Big Data Model Question Paper Solution Search Creators
No ratings yet
BAD601 Big Data Model Question Paper Solution Search Creators
50 pages
DIC_PLB_L1
No ratings yet
DIC_PLB_L1
64 pages
Datascience-unit3
No ratings yet
Datascience-unit3
19 pages
Jifs223295 2
No ratings yet
Jifs223295 2
25 pages
Data Intensive Computing
No ratings yet
Data Intensive Computing
18 pages
BDA Handy Notes
No ratings yet
BDA Handy Notes
19 pages
Unit 5 Lecture Notes 5
No ratings yet
Unit 5 Lecture Notes 5
20 pages
Data Stream Management
No ratings yet
Data Stream Management
46 pages
Implementation of Multi Node Clusters in Column Oriented Database Using HDFS
No ratings yet
Implementation of Multi Node Clusters in Column Oriented Database Using HDFS
5 pages
Implementing K Means For Achievement Stu
No ratings yet
Implementing K Means For Achievement Stu
5 pages
Modeling of Big Data Processing
No ratings yet
Modeling of Big Data Processing
15 pages
Lecture 2 Scalable Data Systems
No ratings yet
Lecture 2 Scalable Data Systems
41 pages
Computer Programming Questions
67% (15)
Computer Programming Questions
147 pages
Plutus Smart Contracts
100% (1)
Plutus Smart Contracts
126 pages
b.tech R-22 Iii_ii
No ratings yet
b.tech R-22 Iii_ii
11 pages
Compiler Design - Module 1-Notes
No ratings yet
Compiler Design - Module 1-Notes
74 pages
TSPi Workbook 20041202
No ratings yet
TSPi Workbook 20041202
37 pages
BIG DATA FRAMEWORKS
No ratings yet
BIG DATA FRAMEWORKS
3 pages
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
No ratings yet
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
6 pages
1746752480555_CT 2
No ratings yet
1746752480555_CT 2
8 pages
Replication-Based Query Management For Resource Allocation Using Hadoop and MapReduce Over Big Data
No ratings yet
Replication-Based Query Management For Resource Allocation Using Hadoop and MapReduce Over Big Data
13 pages
PWP Chapter 1
No ratings yet
PWP Chapter 1
45 pages
Unit 1 Introduction To System Software Short Answer Questions
No ratings yet
Unit 1 Introduction To System Software Short Answer Questions
23 pages
IE494_Big_Data_Processing_Course_File_Autumn24_PMJ - PM Jat
No ratings yet
IE494_Big_Data_Processing_Course_File_Autumn24_PMJ - PM Jat
5 pages
ucPDF (14)
No ratings yet
ucPDF (14)
10 pages
BDA
No ratings yet
BDA
8 pages
Java
No ratings yet
Java
336 pages
Principles of Programming Languages - ASU 2013
60% (5)
Principles of Programming Languages - ASU 2013
113 pages
Cheat Sheet v2
No ratings yet
Cheat Sheet v2
3 pages
Programming For Problem Solving
No ratings yet
Programming For Problem Solving
95 pages
C Se 487 Course Outline Jan 28
No ratings yet
C Se 487 Course Outline Jan 28
4 pages
Chapter 1. Introduction To: C# Programming
No ratings yet
Chapter 1. Introduction To: C# Programming
31 pages
Semester 1 Booklet Grade 7
No ratings yet
Semester 1 Booklet Grade 7
23 pages
Syllabus E63 Spring2016-2
No ratings yet
Syllabus E63 Spring2016-2
3 pages
COMPILER DESIGNE1st Exam Paper 1
No ratings yet
COMPILER DESIGNE1st Exam Paper 1
2 pages
Aleksandr Egorov jptv20
No ratings yet
Aleksandr Egorov jptv20
8 pages
Akshar Arote A-50: LAB Manual Part A
No ratings yet
Akshar Arote A-50: LAB Manual Part A
12 pages
ASM: A Code Manipulation Tool To Implement Adaptable Systems
No ratings yet
ASM: A Code Manipulation Tool To Implement Adaptable Systems
12 pages
CC-ll
No ratings yet
CC-ll
15 pages
Lab 03
No ratings yet
Lab 03
6 pages
Rohan Gedam (English)
No ratings yet
Rohan Gedam (English)
3 pages
Cppcheck Manual
No ratings yet
Cppcheck Manual
20 pages
CENG103 Week1 LN
No ratings yet
CENG103 Week1 LN
31 pages
Unit I 2marks
No ratings yet
Unit I 2marks
10 pages
SBU ModelSim 6.5 Tutorial - 3
No ratings yet
SBU ModelSim 6.5 Tutorial - 3
12 pages
05-Introd Comp &amp C++ Prog (CE)
No ratings yet
05-Introd Comp &amp C++ Prog (CE)
2 pages
Tạo Thư Viện Mới Trong MiKroC
No ratings yet
Tạo Thư Viện Mới Trong MiKroC
1 page
Syllabus-7th Semester B.tech CSE GGSIPU
No ratings yet
Syllabus-7th Semester B.tech CSE GGSIPU
4 pages
Asmita Project
No ratings yet
Asmita Project
42 pages
Virtual University of Pakistan - Study Scheme
No ratings yet
Virtual University of Pakistan - Study Scheme
3 pages
Redis Essentials: Definitive Reference for Developers and Engineers
From Everand
Redis Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
MongoDB Architecture and Operations: Definitive Reference for Developers and Engineers
From Everand
MongoDB Architecture and Operations: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
From Everand
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Amazon EMR Solutions in Cloud Computing: Definitive Reference for Developers and Engineers
From Everand
Amazon EMR Solutions in Cloud Computing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Aerospike Architecture and Implementation: Definitive Reference for Developers and Engineers
From Everand
Aerospike Architecture and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Storm Systems for Real-Time Data Processing: Definitive Reference for Developers and Engineers
From Everand
Storm Systems for Real-Time Data Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Scalability By Design
From Everand
Scalability By Design
Chukwunonso Offor
No ratings yet
Google Cloud Memorystore in Practice: Definitive Reference for Developers and Engineers
From Everand
Google Cloud Memorystore in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Couchbase Essentials: Definitive Reference for Developers and Engineers
From Everand
Couchbase Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical TimescaleDB Solutions: Definitive Reference for Developers and Engineers
From Everand
Practical TimescaleDB Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
DynamoDB Solutions Guide: Definitive Reference for Developers and Engineers
From Everand
DynamoDB Solutions Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Debezium in Action: Definitive Reference for Developers and Engineers
From Everand
Debezium in Action: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Distributed Cluster Operations with DC/OS: Definitive Reference for Developers and Engineers
From Everand
Distributed Cluster Operations with DC/OS: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Architecting Real-Time Analytics with Druid: Definitive Reference for Developers and Engineers
From Everand
Architecting Real-Time Analytics with Druid: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Building Scalable Systems with C: Optimizing Performance and Portability
From Everand
Building Scalable Systems with C: Optimizing Performance and Portability
Larry Jones
No ratings yet
Netdata in Practice: Definitive Reference for Developers and Engineers
From Everand
Netdata in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Redshift Essentials: Definitive Reference for Developers and Engineers
From Everand
Redshift Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Deploying Scalable Systems with Nomad: Definitive Reference for Developers and Engineers
From Everand
Deploying Scalable Systems with Nomad: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Distributed Systems and Beyond
From Everand
Distributed Systems and Beyond
Pasquale De Marco
No ratings yet
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
From Everand
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
Mustafa Al-Dori
4/5 (1)
InfluxDB Essentials: Definitive Reference for Developers and Engineers
From Everand
InfluxDB Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
The DynamoDB Handbook: Practical Solutions for Modern NoSQL Database Management
From Everand
The DynamoDB Handbook: Practical Solutions for Modern NoSQL Database Management
Robert Johnson
No ratings yet
Dataproc Administration and Engineering Solutions: Definitive Reference for Developers and Engineers
From Everand
Dataproc Administration and Engineering Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cloud Computing Essentials: A Practical Guide with Examples
From Everand
Cloud Computing Essentials: A Practical Guide with Examples
William E. Clark
No ratings yet
Airflow for Data Workflow Automation
From Everand
Airflow for Data Workflow Automation
Richard Johnson
No ratings yet
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
Essays on Infrastructure-as-code
From Everand
Essays on Infrastructure-as-code
Ravi Rajamani
No ratings yet
Model-Driven Online Capacity Management for Component-Based Software Systems
From Everand
Model-Driven Online Capacity Management for Component-Based Software Systems
André van Hoorn
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
The Ultimate Guide to Unlocking the Full Potential of Cloud Services: Tips, Recommendations, and Strategies for Success
From Everand
The Ultimate Guide to Unlocking the Full Potential of Cloud Services: Tips, Recommendations, and Strategies for Success
Rick Spair
No ratings yet
Embedded Systems Programming with C++: Real-World Techniques
From Everand
Embedded Systems Programming with C++: Real-World Techniques
Robert Johnson
No ratings yet
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Edge Cloud Operations: A Systems Approach
From Everand
Edge Cloud Operations: A Systems Approach
Larry L Peterson
No ratings yet
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

2021 - Adaptive Code Generation for Data-Intensive Analytics

Uploaded by

2021 - Adaptive Code Generation for Data-Intensive Analytics

Uploaded by

Adaptive Code Generation for Data-Intensive Analytics

Wangda Zhang Junyoung Kim Kenneth A. Ross

Eric Sedlar Lukas Stadler

Figure 1: Performance diversity

Figure 2: Rewriting of a conditional AST

Adaptive && & reverse &&

Figure 3: Compiling a sequence of plans

and arithmetic computations. Additionally, we support SIMD code

6.2 Microbenchmark with Different

(b) Varying exploitation periods (b) Profiling on sorted data

(c) Dynamic parameters (c) Performance on sorted data

Figure 5: Performance with different selectivities Figure 6: Performance on TPC-H Q6

6.4 Visualization Application

You might also like