PLDI Week 04 LLVM
PLDI Week 04 LLVM
Ilya Sergey
[email protected]
ilyasergey.net/CS4212/
Last Week: Directly Translating AST to Assembly
• Key Challenges:
– storing intermediate values needed to compute complex expressions
– some instructions use specific registers (e.g. shift)
One Simple Strategy
• Invariants:
– Compilation of an expression yields its result in %rax
– Argument (Xi) is stored in a dedicated operand register
– Intermediate values are pushed onto the stack
– Stack slot is popped after use (so the space is reclaimed)
• Resulting code is wrapped (e.g., with retq) to comply with cdecl calling conventions
• Alternative strategy: using stack machine language as an IR; see the compile2 in compile.ml
Intermediate Representations
Why do something else?
• We have seen a simple syntax-directed translation
– Input syntax uniquely determines the output, no complex analysis or code transformation is done.
– It works fine for simple languages.
But…
• The resulting code quality is poor.
• Richer source language features are hard to encode
– Structured data types, objects, first-class functions, etc.
• It’s hard to optimize the resulting assembly code.
– The representation is too concrete – e.g. it has committed to using certain registers and the stack
– Only a fixed number of registers
– Some instructions have restrictions on where the operands are located
• Control-flow is not structured:
– Arbitrary jumps from one code block to another
– Implicit fall-through makes sequences of code non-modular
(i.e. you can’t rearrange sequences of code easily)
• Retargeting the compiler to a new architecture is hard.
– Target assembly code is hard-wired into the translation
Intermediate Representations (IR’s)
• Abstract machine code: hides details of the target architecture
x86
Java
AST IR Byte-
code
Optimization Arm
Multiple IR’s
• Goal: get program closer to machine code without losing the
information needed to do analysis and optimizations
Java
AST HIR MIR Byte-
code
Optimizations
What makes a good IR?
• Easy translation target (from the level above)
• Easy to translate (to the level below)
• Narrow interface
– Fewer constructs means simpler phases/optimizations
• Example: Source language might have “while”, “for”, and “foreach” loops
(and maybe more variants)
– IR might have only “while” loops and sequencing
– Translation eliminates “for” and “foreach”
– Here the notation ⟦cmd⟧ denotes the “translation” or “compilation” of the command cmd.
IR’s at the extreme
• High-level IR’s
– Abstract syntax + new node types not generated by the parser
• e.g. Type checking information or disambiguated syntax nodes
– Typically preserves the high-level language constructs
• Structured control flow, variable names, methods, functions, etc.
• May do some simplification (e.g. convert for to while)
– Allows high-level optimizations based on program structure
• e.g. inlining “small” functions, reuse of constants, etc.
– Useful for semantic analyses like type checking
GHC Compilation Pipeline
A number of Intermediate Languages
• Haskell Source
• Core
• C--
data AltCon
= DataAlt DataCon
| LitAlt Literal
| DEFAULT
How to Get Core
Try with
module Mysum where
More at
IR’s at the extreme
• High-level IR’s
– Abstract syntax + new node types not generated by the parser
• e.g. Type checking information or disambiguated syntax nodes
– Typically preserves the high-level language constructs
• Structured control flow, variable names, methods, functions, etc.
• May do some simplification (e.g. convert for to while)
– Allows high-level optimizations based on program structure
• e.g. inlining “small” functions, reuse of constants, etc.
– Useful for semantic analyses like type checking
• Low-level IR’s
– Machine dependent assembly code + extra pseudo-instructions
• e.g. a pseudo instruction for interfacing with garbage collector or memory allocator (parts of the language runtime
system)
• e.g. (on x86) a imulq instruction that doesn’t restrict register usage
– Source structure of the program is lost:
• Translation to assembly code is straightforward
– Allows low-level optimizations based on target architecture
• e.g. register allocation, instruction selection, memory layout, etc.
• What’s in between?
Mid-level IR’s: Many Varieties
• Many examples:
– Triples: OP a b
• Useful for instruction selection on X86 via “graph tiling” (a way to better utilise registers)
– Quadruples: a = b OP c (RISC-like “three address form”)
– SSA: variant of quadruples where each variable is assigned exactly once
• Easy dataflow analysis for optimization
• e.g. LLVM: industrial-strength IR, based on SSA
– Stack-based:
• Easy to generate
• e.g. Java Bytecode, UCODE
Growing an IR
IR ?
• Idea: name intermediate values, make order of evaluation explicit.
– No nested operations.
Translation to SLL
Add(Add(Const 1, Var X4),
• Given this: Add(Const 3, Mul(Var X1,
Const 5)))
• IR1: Expressions
– simple arithmetic expressions, immutable global variables
• IR2: Commands
– global mutable variables
– commands for update and sequencing
• https://github.com/cs4212/week-03-intermediate-2023
• https://github.com/cs4212/week-03-intermediate-2023
• Definitions: ir3.ml
LLVM
Low-Level Virtual Machine (LLVM)
LLVM
frontends
Typed SSA llc
like backend
'clang' IR
code gen
jit
Optimisations/
Transformations
Analysis
Example LLVM Code
factorial-pretty.ll
• LLVM offers a textual representation of its IR define @factorial(%n) {
%1 = alloca
– files ending in .ll %acc = alloca
store %n, %1
store 1, %acc
factorial64.c br label %start
else:
%12 = load %acc
ret %12
}
Real LLVM
factorial.ll
; Function Attrs: nounwind ssp
define i64 @factorial(i64 %n) #0 {
%1 = alloca i64, align 8
• Decorates values with type information %acc = alloca i64, align 8
i64 store i64 %n, i64* %1, align 8
store i64 1, i64* %acc, align 8
i64* br label %2
; <label>:5 ; preds = %2
%6 = load i64* %acc, align 8
• Has alignment annotations %7 = load i64* %1, align 8
(padding for some specified number of bytes) %8 = mul nsw i64 %6, %7
store i64 %8, i64* %acc, align 8
%9 = load i64* %1, align 8
%10 = sub nsw i64 %9, 1
• Keeps track of entry edges for each block: store i64 %10, i64* %1, align 8
br label %2
preds = %5, %0
; <label>:11 ; preds = %2
%12 = load i64* %acc, align 8
ret i64 %12
}
Example Control-flow Graph
define @factorial(%n) {
entry:
define @factorial(%n) { %1 = alloca
%1 = alloca
%acc = alloca
%acc = alloca store %n, %1
store %n, %1 store 1, %acc
store 1, %acc br label %start
br label %start
start:
start: %3 = load %1
%4 = icmp sgt %3, 0
%3 = load %1
br %4, label %then, label %else
%4 = icmp sgt %3, 0
br %4, label %then, label %else
then:
%6 = load %acc
then: else:
%7 = load %1
%8 = mul %6, %7
%6 = load %acc store %8, %acc
%12 = load %acc
%7 = load %1 %9 = load %1
ret %12
%8 = mul %6, %7
%10 = sub %9, 1
store %8, %acc
store %10, %1
%9 = load %1
%10 = sub %9, 1 br label %start
store %10, %1
br label %start else:
%12 = load %acc
ret %12
} }
LL Basic Blocks and Control-Flow Graphs
type block = {
insns : (uid * insn) list;
term : (uid * terminator)
}
• A control flow graph is represented as a list of labeled basic blocks with these invariants:
– No two blocks have the same label
– All terminators mention only labels that are defined among the set of basic blocks
– There is a distinguished, unlabelled, entry block:
• Local variables:
– Defined by the instructions of the form %uid = …
– Must satisfy the single static assignment (SSA) invariant
• Each %uid appears on the left-hand side of an assignment only once in the entire control flow graph.
– The value of a %uid remains unchanged throughout its lifetime
– Analogous to “let %uid = e in …” in OCaml
• Full “SSA” to allow richer use of local variables by taking the control flow into the account
– phi functions (https://en.wikipedia.org/wiki/Static_single-assignment_form)
LL Storage Model: alloca
• The alloca instruction allocates stack space and returns a reference to it.
– The returned reference is stored in local:
%ptr = alloca typ
– The amount of space allocated is determined by the type
• The contents of the slot are accessed via the load and store instructions:
p x y
• Compiler needs to know the size of the struct at compile time to allocate the needed storage space.
• Compiler needs to know the shape of the struct at compile time to index into the structure.
Assembly-level Member Access
square ll.x ll.y lr.x lr.y ul.x ul.y ur.x ur.y
x a b y
x a b y
x a b y
Padding
Copy-in/Copy-out
When we do an assignment in C as in:
then we copy all of the elements out of the source and put them
in the target. Same as doing word-level operations:
• For really large copies, the compiler uses something like memcpy
(which is implemented using a loop in assembly).
C Procedure Calls
• Similarly, when we call a procedure, we copy arguments in, and copy results out.
– Caller sets aside extra space in its frame to store results that are bigger than will fit in %rax.
– We do the same with scalar values such as integers or doubles.
• Benefit: locality
• Problem: expensive for large records…
• Languages like Java and OCaml always pass non-word-sized objects by reference.
Call-by-Reference
void mkSquare(struct Point *ll, int elen,
struct Rect *res) {
res->lr = res->ul = res->ur = res->ll = *ll;
res->lr.x += elen;
res->ur.x += elen;
res->ur.y += elen;
res->ul.y += elen;
}
void foo() {
struct Point origin = {0,0};
struct Square unit_sq;
mkSquare(&origin, 1, &unit_sq);
}
• The caller passes in the address of the point and the address of the result (1 word each).
• Note that returning references to stack-allocated data can cause problems.
– This space might be reclaimed when foo() is done
– Need to allocate storage in the heap…
Arrays
void foo() { void foo() {
char buf[27]; char buf[27];
arr
Size=7 A[0] A[1] A[2] A[3] A[4] A[5] A[6]
• Other possibilities:
– Pascal: only permit statically known array sizes (very unwieldy in practice)
– What about multi-dimensional arrays?
Array Bounds Checks (Implementation)
• Example: Assume %rax holds the base pointer (arr) and %ecx holds the array index i.
To read a value from the array arr[i]:
char *p = "foo”;
p[0] = 'b’;
• In C: enum Day {sun, mon, tue, wed, thu, fri, sat} today;
• In OCaml: type day = Sun | Mon | Tue | Wed | Thu | Fri | Sat
• OCaml datatypes can also carry data: type foo = Bar of int | Baz of int * foo
switch (e) {
case sun: s1; break;
case mon: s2; break;
…
case sat: s3; break;
}
• Each $tag1…$tagN is just a constant int l1: %cmp1 = icmp eq %tag, $tag1
br %cmp1 label %b1, label %l2
tag value. b1: ⟦s1⟧
br label %l2
• Note: ⟦break;⟧
l2: %cmp2 = icmp eq %tag, $tag2
(within the switch branches) is: br %cmp2 label %b2, label %l3
b2: ⟦s2⟧
br %merge br label %l3
…
lN: %cmpN = icmp eq %tag, $tagN
br %cmpN label %bN, label %merge
bN: ⟦sN⟧
br label %merge
merge:
Alternatives for Switch Compilation
• Nested if-then-else works OK in practice if # of branches is small
– (e.g. < 16 or so).
• For more branches, use better data structures to organise the jumps:
– Create a table of pairs (v1, branch_label) and loop through
– Or, do binary search rather than linear search
– Or, use a hash table rather than binary search
• Compilation strategy:
match e with
– “Flatten” nested patterns into | Bar(z) -> e1
matches against one constructor | Baz(y, tmp) ->
(match tmp with
at a time. | Bar(w) -> e2
– Compile the match against the | Baz(_, _) -> e3)
tags of the datatype as for C-style switches.
– Code for each branch additionally must copy data from ⟦e⟧ to the variables bound in the patterns.
• There are many opportunities for optimisations, many papers about “pattern-match compilation”
– Many of these transformations can be done at the AST level
Datatypes in LLVM IR
Structured Data in LLVM
• LLVM’s IR is uses types to describe the structure of data.
t ::=
void
i1 | i8 | i64 N-bit integers
[<#elts> x t] arrays
fty function types
{t1, t2, … , tn} structures
t* pointers
%Tident named (identified) type
insn ::= …
| getelementptr t* %val, t1 idx1, t2 idx2 ,…
• The first is i32 0 a “step through” the pointer to, e.g., %square, with offset 0.
• To index into a deeply nested structure, one has to “follow the pointer” by
loading from the computed pointer
Compiling Data Structures via LLVM
define @foo() {
%1 = alloca %rect3 ; allocate a three-field record
%2 = bitcast %rect3* %1 to %rect2* ; safe cast
%3 = getelementptr %rect2* %2, i32 0, i32 1 ; allowed
…
}
Demo: Compiling to LLVM
• Clone https://github.com/cs4212/week-04-llvm-demo
• LLVMLite Specification
• Overview of HW3
• Lexical Analysis