0% found this document useful (0 votes)
96 views

Quick Primer On LLVM IR: (For Those Already Familiar With LLVM IR, Feel Free To)

The document provides an overview of LLVM IR, including: - LLVM IR is a low-level intermediate representation used by the LLVM compiler framework that is platform-independent. - Compilers are split into front-end, middle-end, and back-end components that take LLVM IR as input/output and optimize/compile it. - LLVM IR examples are shown for a simple C program, demonstrating its static typing and use of registers.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views

Quick Primer On LLVM IR: (For Those Already Familiar With LLVM IR, Feel Free To)

The document provides an overview of LLVM IR, including: - LLVM IR is a low-level intermediate representation used by the LLVM compiler framework that is platform-independent. - Compilers are split into front-end, middle-end, and back-end components that take LLVM IR as input/output and optimize/compile it. - LLVM IR examples are shown for a simple C program, demonstrating its static typing and use of registers.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Quick primer on LLVM IR

(For those already familiar with LLVM IR, feel free to jump to the next
section).

LLVM IR is a low-level intermediate representation used by the LLVM


compiler framework. You can think of LLVM IR as a platform-
independent assembly language with an infinite number of function
local registers.

When developing compilers there are huge benefits with compiling


your source language to an intermediate representation (IR) 1 instead of
compiling directly to a target architecture (e.g. x86). As many
optimization techniques are general (e.g. dead code elimination,
constant propagation), these optimization passes may be performed
directly on the IR level and thus shared between all targets 2.

Compilers are therefore often split into three components, the front-
end, middle-end and back-end; each with a specific task that takes IR
as input and/or produces IR as output.

 Front-end: compiles source language to IR.


 Middle-end: optimizes IR.
 Back-end: compiles IR to machine code.

Example program in LLVM IR assembly

To get a glimpse of what LLVM IR assembly may look like, lets consider
the following C program.
1
2 int f(int a, int b) {
3 return a + 2*b;
4}
5
6 int main() {
7 return f(10, 20);
}

Using Clang3, the above C code compiles to the following LLVM IR


assembly.

1
2
3 define i32 @f(i32 %a, i32 %b) {
4 ; <label>:0
5 %1 = mul i32 2, %b
6 %2 = add i32 %a, %1
7 ret i32 %2
8 }
9
1 define i32 @main() {
0 ; <label>:0
1 %1 = call i32 @f(i32 10, i32 20)
1 ret i32 %1
1 }
2

By looking at the LLVM IR assembly above, we may observe a few


noteworthy details about LLVM IR, namely:

 LLVM IR is statically typed (i.e. 32-bit integer values are denoted


with the i32 type).
 Local variables are scoped to each function (i.e. %1 in
the @main function is different from %1 in the @f function).
 Unnamed (temporary) registers are assigned local IDs
(e.g. %1, %2) from an incrementing counter in each function.
 Each function may use an infinite number of registers (i.e. we are
not limited to 32 general purpose registers).
 Global identifiers (e.g. @f) and local identifiers (e.g. %a, %1) are
distinguished by their prefix (@ and %, respectively).
 Most instructions do what you’d think, mul performs
multiplication, add addition, etc.
 Line comments are prefixed with ; as is quite common for
assembly languages.

The structure of LLMV IR assembly

The contents of an LLVM IR assembly file denotes a module. A module


contains zero or more top-level entities, such as global
variables and functions.

A function declaration contains zero basic blocks and a function


definition contains one or more basic blocks (i.e. the body of the
function).

A more detailed example of an LLVM IR module is given below,


including the global definition @foo and the function
definition @f containing three basic blocks
(%entry, %block_1 and %block_2).

1 ; Global variable initialized to the 32-bit integer value 21.


2 @foo = global i32 21
3
4 ; f returns 42 if the condition cond is true, and 0 otherwise.
5 define i32 @f(i1 %cond) {
6 ; Entry basic block of function containing zero non-branching instructions and a
7 ; conditional branching terminator instruction.
8 entry:
9 ; The conditional br terminator transfers control flow to block_1 if %cond
1 ; is true, and to block_2 otherwise.
0 br i1 %cond, label %block_1, label %block_2
1
1 ; Basic block containing two non-branching instructions and a return terminator.
1 block_1:
2 %tmp = load i32, i32* @foo
1 %result = mul i32 %tmp, 2
3 ret i32 %result
1
4 ; Basic block with zero non-branching instructions and a return terminator.
1 block_2:
5 ret i32 0
1 }
6
1
7
1
8
1
9
2
0
2
1
2
2

Basic block

A basic block is a sequence of zero or more non-branching instructions


followed by a branching instruction (referred to as the terminator
instruction). The key idea behind a basic block is that if a single
instruction of the basic block is executed, then all instructions of the
basic block are executed. This notion simplifies control flow analysis.

Instruction

An instruction is a non-branching LLVM IR instruction, usually


performing a computation or accessing memory (e.g. add, load), but
not changing the control flow of the program.

Terminator instruction

A terminator instruction is at the end of each basic block, and


determines where to transfer control flow once the basic block finishes
executing. For instance ret terminators returns control flow back to the
caller function, and br terminators branches control flow either
conditionally or unconditionally.

Static Single Assignment form

One very important property of LLVM IR is that it is in SSA-form (Static


Single Assignment), which essentially means that each register is
assigned exactly once. This property simplifies data flow analysis.

To handle variables that are assigned more than once in the original
source code, a notion of phi instructions are used in LLVM IR.
A phi instruction essentially returns one value from a set of incoming
values, based on the control flow path taken during execution to reach
the phi instruction. Each incoming value is therefore associated with a
predecessor basic block.

For a concrete example, consider the following LLVM IR function.


1
2
3
4
5
6 define i32 @f(i32 %a) {
7 ; <label>:0
8 switch i32 %a, label %default [
9 i32 42, label %case1
1 ]
0
1 case1:
1 %x.1 = mul i32 %a, 2
1 br label %ret
2
1 default:
3 %x.2 = mul i32 %a, 3
1 br label %ret
4
1 ret:
5 %x.0 = phi i32 [ %x.2, %default ], [ %x.1, %case1 ]
1 ret i32 %x.0
6 }
1
7
1
8

The phi instruction (sometimes referred to as phi nodes) in the above


example essentially models the set of possible incoming values as
distinct assignment statements, exactly one of which is executed
based on the control flow path taken to reach the basic block of
the phi instruction during execution. One way to illustrate the
corresponding data flow is as follows:

In general, when developing compilers which translates source code


into LLVM IR, all local variables of the source code may be transformed
into SSA-form, with the exception of variables of which the address is
taken.

To simplify the implementation of LLVM front-ends, one


recommendation is to model local variables in the source language as
memory allocated variables (using alloca), model assignments to local
variables as store to memory, and uses of local variables as load from
memory. The reason for this is that it may be non-trivial to directly
translate a source language into LLVM IR in SSA-form. As long as the
memory accesses follows certain patters, we may then rely on
the mem2reg LLVM optimization pass to translate memory allocate local
variables to registers in SSA-form (using phi nodes where necessary).

LLVM IR library in pure Go


The two main libraries for working with LLVM IR in Go are:

 llvm.org/llvm/bindings/go/llvm: the official LLVM bindings for the


Go programming language.
 github.com/llir/llvm: a pure Go library for interacting with LLVM
IR.

The official LLVM bindings for Go uses Cgo to provide access to the rich
and powerful API of the LLVM compiler framework, while
the llir/llvm project is entirely written in Go and relies on LLVM IR to
interact with the LLVM compiler framework.

This post focuses on llir/llvm, but should generalize to working with


other libraries as well.

Why write a new library?

The primary motivation for developing a pure Go library for interacting


with LLVM IR was to make it more fun to code compilers and static
analysis tools that rely on and interact with the LLVM compiler
framework. In part because the compile time of projects relying on the
official LLVM bindings for Go could be quite substantial (Thanks
to @aykevl, the author of TinyGo, there are now ways to speed up the
compile time by dynamically linking against a system-installed version
of LLVM4).
Another leading motivation was to try and design an idiomatic Go API
from the ground up. The main difference between the API of the LLVM
bindings for Go and llir/llvm is how LLVM values are modelled. In the
LLVM bindings for Go, LLVM values are modelled as a concrete struct
type, which essentially contains every possible method of every
possible LLVM value. My personal experience with using this API is that
it was difficult to know what subsets of methods you were allowed to
invoke for a given value. For instance, to retrieve the Opcode of an
instruction, you’d invoke the InstructionOpcode method – which is
quite intuitive. However, if you happen to invoke the Opcode method
instead (which is used to retrieve the Opcode of constant expressions),
you’d get the runtime errors “cast<Ty>() argument of incompatible
type!”.

The llir/llvm library was therefore designed to provide compile time


guarantees by further relying on the Go type system. LLVM values
in llir/llvm are modelled as an interface type. This approach only
exposes the minimum set of methods shared by all values, and if you
want to access more specific methods or fields, you’d use a type
switch (as illustrated in the analysis example below).

Usage examples

Now, lets consider a few concrete usage examples. Given that we have
a library to work with, what may we wish to do with LLVM IR?

Firstly, we may want to parse LLVM IR produced by other tools, such as


Clang and the LLVM optimizer opt (see the input example below).

Secondly, we may want to process LLVM IR to perform analysis of our


own (e.g. custom optimization passes) or implement interpreters and
Just-in-Time compilers (see the analysis example below).

Thirdly, we may want to produce LLVM IR to be consumed by other


tools. This is the approach taken when developing a front-end for a
new programming language (see the output example below).

Input example - Parsing LLVM IR

1 // This example program parses an LLVM IR assembly file, and prints the parsed
2 // module to standard output.
3 package main
4
5
6
7
8
9
1
0
1 import (
1 "fmt"
1
2 "github.com/llir/llvm/asm"
1 )
3
1 func main() {
4 // Parse LLVM IR assembly file.
1 m, err := asm.ParseFile("foo.ll")
5 if err != nil {
1 panic(err)
6 }
1 // process, interpret or optimize LLVM IR.
7
1 // Print LLVM IR module.
8 fmt.Println(m)
1 }
9
2
0
2
1

Analysis example - Processing LLVM IR

1 // This example program analyses an LLVM IR module to produce a callgraph in


2 // Graphviz DOT format.
3 package main
4
5 import (
6 "bytes"
7 "fmt"
8 "io/ioutil"
9
1 "github.com/llir/llvm/asm"
0 "github.com/llir/llvm/ir"
1 )
1
1 func main() {
2 // Parse LLVM IR assembly file.
1 m, err := asm.ParseFile("foo.ll")
3 if err != nil {
1 panic(err)
4 }
1 // Produce callgraph of module.
5 callgraph := genCallgraph(m)
1 // Output callgraph in Graphviz DOT format.
6 if err := ioutil.WriteFile("callgraph.dot", callgraph, 0644); err != nil {
1 panic(err)
7 }
1 }
8
1 // genCallgraph returns the callgraph in Graphviz DOT format of the given LLVM IR
9 // module.
2 func genCallgraph(m *ir.Module) []byte {
0 buf := &bytes.Buffer{}
2 buf.WriteString("digraph {\n")
1 // For each function of the module.
2 for _, f := range m.Funcs {
2 // Add caller node.
2 caller := f.Ident()
3 fmt.Fprintf(buf, "\t%q\n", caller)
2 // For each basic block of the function.
4 for _, block := range f.Blocks {
2 // For each non-branching instruction of the basic block.
5 for _, inst := range block.Insts {
2 // Type switch on instruction to find call instructions.
6 switch inst := inst.(type) {
2 case *ir.InstCall:
7 callee := inst.Callee.Ident()
2 // Add edges from caller to callee.
8 fmt.Fprintf(buf, "\t%q -> %q\n", caller, callee)
2 }
9 }
3 // Terminator of basic block.
0 switch term := block.Term.(type) {
3 case *ir.TermRet:
1 // do something.
3 _ = term
2 }
3 }
3 }
3 buf.WriteString("}")
4 return buf.Bytes()
3 }
5
3
6
3
7
3
8
3
9
4
0
4
1
4
2
4
3
4
4
4
5
4
6
4
7
4
8
4
9
5
0
5
1
5
2
5
3
5
4
5
5
5
6
5
7
5
8
5
9
6
0

Output example - Producing LLVM IR

1 // This example produces LLVM IR code equivalent to the following C code, which
2 // implements a pseudo-random number generator.
3 //
4 // int abs(int x);
5 //
6 // int seed = 0;
7 //
8 // // ref: https://en.wikipedia.org/wiki/Linear_congruential_generator
9 // // a = 0x15A4E35
1 // // c = 1
0 // int rand(void) {
1 // seed = seed*0x15A4E35 + 1;
1 // return abs(seed);
1
2
1 // }
3 package main
1
4 import (
1 "fmt"
5
1 "github.com/llir/llvm/ir"
6 "github.com/llir/llvm/ir/constant"
1 "github.com/llir/llvm/ir/types"
7 )
1
8 func main() {
1 // Create convenience types and constants.
9 i32 := types.I32
2 zero := constant.NewInt(i32, 0)
0 a := constant.NewInt(i32, 0x15A4E35) // multiplier of the PRNG.
2 c := constant.NewInt(i32, 1) // increment of the PRNG.
1
2 // Create a new LLVM IR module.
2 m := ir.NewModule()
2
3 // Create an external function declaration and append it to the module.
2 //
4 // int abs(int x);
2 abs := m.NewFunc("abs", i32, ir.NewParam("x", i32))
5
2 // Create a global variable definition and append it to the module.
6 //
2 // int seed = 0;
7 seed := m.NewGlobalDef("seed", zero)
2
8 // Create a function definition and append it to the module.
2 //
9 // int rand(void) { ... }
3 rand := m.NewFunc("rand", i32)
0
3 // Create an unnamed entry basic block and append it to the `rand` function.
1 entry := rand.NewBlock("")
3
2 // Create instructions and append them to the entry basic block.
3 tmp1 := entry.NewLoad(seed)
3 tmp2 := entry.NewMul(tmp1, a)
3 tmp3 := entry.NewAdd(tmp2, c)
4 entry.NewStore(tmp3, seed)
3 tmp4 := entry.NewCall(abs, tmp3)
5 entry.NewRet(tmp4)
3
6 // Print the LLVM IR assembly of the module.
3 fmt.Println(m)
7 }
3
8
3
9
4
0
4
1
4
2
4
3
4
4
4
5
4
6
4
7
4
8
4
9
5
0
5
1
5
2
5
3
5
4
5
5
5
6
5
7
5
8
5
9
6
0
6
1
6
2
6
3

Closing notes
The design and implementation of llir/llvm has been guided by a
community of people who have contributed – not only by writing code –
but through shared discussions, pair-programming sessions, bug
hunting, profiling investigations, and most of all, a curiosity for learning
and taking on exciting challenges.

One particularly challenging part of the llir/llvm project has been to


construct an EBNF grammar for LLVM IR covering the entire LLVM IR
assembly language as of LLVM v7.0. This was challenging, not because
the process itself is difficult, but because there existed no official
grammar covering the entire language. Several community projects
have attempted to define a formal grammar for LLVM IR assembly, but
these have, to the best of our knowledge, only covered subsets of the
language.

The exciting part of having a grammar for LLVM IR is that it enables a


lot of interesting projects. For instance, generating syntactically valid
LLVM IR assembly to be used for fuzzing tools and libraries consuming
LLVM IR (the same approach as taken by GoSmith). This could be used
for cross-validation efforts between LLVM projects implemented in
different languages, and also help tease out potential security
vulnerabilities and bugs in implementations.

You might also like