LLVM Essentials - Sample Chapter
LLVM Essentials - Sample Chapter
ee
P U B L I S H I N G
$ 24.99 US
15.99 UK
pl
Building LLVM IR
C o m m u n i t y
Mayur Pandey
Suyog Sarda
LLVM Essentials
LLVM Essentials
Sa
m
E x p e r i e n c e
LLVM Essentials
Get familiar with the LLVM infrastructure and start using LLVM
libraries to design a compiler
D i s t i l l e d
Suyog Sarda
Mayur Pandey
Preface
Preface
LLVM is one of the very hot topics in recent times. It is an open source project
with an ever-increasing number of contributors. Every programmer comes across
a compiler at some point or the other while programming. Simply speaking, a
compiler converts a high-level language to machine-executable code. However,
what goes on under the hood is a lot of complex algorithms at work. So, to get
started with compiler, LLVM will be the simplest infrastructure to study. Written
in object-oriented C++, modular in design, and with concepts that are very easy to
map to theory, LLVM proves to be attractive for experienced compiler programmers
and for novice students who are willing to learn.
As authors, we maintain that simple solutions frequently work better and are easier
to grasp than complex solutions. Throughout the book we will look at various topics
that will help you enhance your skills and drive you to learn more.
We also believe that this book will be helpful for people not directly involved in
compiler development as knowledge of compiler development will help them
write code optimally.
Preface
[1]
The code generator also makes use of this modular design like the Optimizer, for
splitting the code generation into individual passes, namely, instruction selection,
register allocation, scheduling, code layout optimization, and assembly emission.
[2]
Chapter 1
In each of the following phases mentioned there are some common things for
almost every target, such as an algorithm for assigning physical registers available
to virtual registers even though the set of registers for different targets vary. So,
the compiler writer can modify each of the passes mentioned above and create
custom target-specific passes. The use of the tablegen tool helps in achieving this
using table description .td files for specific architectures. We will discuss how this
happens later in the book.
Another capability that arises out of this is the ability to easily pinpoint a bug to a
particular pass in the optimizer. A tool name Bugpoint makes use of this capability
to automatically reduce the test case and pinpoint the pass that is causing the bug.
An in-memory compiler IR
[3]
Now let's take an example to see how this LLVM IR looks like. We will take a small
C code and convert it into LLVM IR using clang and try to understand the details
of LLVM IR by mapping it back to the source language.
$ cat add.c
int globvar = 12;
int add(int a) {
return globvar + a;
}
Use the clang frontend with the following options to convert it to LLVM IR:
$ clang -emit-llvm -c -S add.c
$ cat add.ll
; ModuleID = 'add.c'
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
@globvar = global i32 12, align 4
; Function Attrs: nounwind uwtable
define i32 @add(i32 %a) #0 {
%1 = alloca i32, align 4
store i32 %a, i32* %1, align 4
%2 = load i32, i32* @globvar, align 4
%3 = load i32, i32* %1, align 4
%4 = add nsw i32 %2, %3
ret i32 %4
}
attributes #0 = { nounwind uwtable "less-precise-fpmad"="false" "noframe-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fpmath"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8"
"target-cpu"="x86-64" "unsafe-fp-math"="false" "use-soft-float"="false" }
!llvm.ident = !{!0}
Now let's look at the IR generated and see what it is all about. You can see the very
first line giving the ModuleID, that it defines the LLVM module for add.c file. An
LLVM module is a toplevel data structure that has the entire contents of the input
LLVM file. It consists of functions, global variables, external function prototypes,
and symbol table entries.
[4]
Chapter 1
The following lines show the target data layout and target triple from which we can
know that the target is x86_64 processor with Linux running on it. The datalayout
string tells us what is the endianess of machine ('e' meaning little endian), and the
name mangling (m : e denotes elf type). Each specification is separated by ''and
each following spec gives information about the type and size of that type. For
example, i64:64 says 64 bit integer is of 64 bits.
Then we have a global variable globvar. In LLVM IR all globals start with '@' and
all local variables start with '%'. There are two main reasons why the variables are
prefixed with these symbols. The first one being, the compiler won't have to bother
about a name clash with reserved words, the other being that the compiler can come
up quickly with a temporary name without having to worry about a conflict with
symbol table conflicts. This second property is useful for representing the IR in
static single assignment (SSA) from where each variable is assigned only a single
time and every use of a variable is preceded by its definition. So, while converting
a normal program to SSA form, we create a new temporary name for every
redefinition of a variable and limit the range of earlier definition till this redefinition.
LLVM views global variables as pointers, so an explicit dereference of the global
variable using load instruction is required. Similarly, to store a value, an explicit
store instruction is required.
Local variables have two categories:
Register allocated local variables: These are the temporaries and allocated
virtual registers. The virtual registers are allocated physical registers during
the code generation phase which we will see in a later chapter of the book.
They are created by using a new symbol for the variable like:
%1 = some value
Now let's see how the add function is represented in LLVM IR. define i32
@add(i32 %a) is very similar to how functions are declared in C. It specifies the
function returns integer type i32 and takes an integer argument. Also, the function
name is preceded by '@', meaning it has global visibility.
[5]
Within the function is actual processing for functionality. Some important things to
note here are that LLVM uses a three-address instruction, that is a data processing
instruction, which has two source operands and places the result in a separate
destination operand (%4 = add i32 %2, %3). Also the code is in SSA form, that is
each value in the IR has a single assignment which defines the value. This is useful
for a number of optimizations.
The attributes string that follows in the generated IR specifies the function attributes
which are very similar to C++ attributes. These attributes are for the function that
has been defined. For each function defined there is a set of attributes defined in
the LLVM IR.
The code that follows the attributes is for the ident directive that identifies the
module and compiler version.
llvm-as: This is the LLVM assembler that takes LLVM IR in assembly form
(human readable) and converts it to bitcode format. Use the preceding add.
ll as an example to convert it into bitcode. To know more about the LLVM
Bitcode file format refer to http://llvm.org/docs/BitCodeFormat.html
$ llvm-as add.ll o add.bc
To view the content of this bitcode file, a tool such as hexdump can be used.
$ hexdump c add.bc
llvm-dis: This is the LLVM disassembler. It takes a bitcode file as input and
outputs the llvm assembly.
$ llvm-dis add.bc o add.ll
If you check add.ll and compare it with the previous version, it will be the
same as the previous one.
[6]
Chapter 1
llvm-link: llvm-link links two or more llvm bitcode files and outputs one
llvm bitcode file. To view a demo write a main.c file that calls the function
in the add.c file.
$ cat main.c
#include<stdio.h>
extern int add(int);
int main() {
int a = add(2);
printf("%d\n",a);
return 0;
}
Convert the C source code to LLVM bitcode format using the following
command.
$ clang -emit-llvm -c main.c
lli: lli directly executes programs in LLVM bitcode format using a just-in-time
compiler or interpreter, if one is available for the current architecture. lli is not
like a virtual machine and cannot execute IR of different architecture and can
only interpret for host architecture. Use the bitcode format file generated by
llvm-link as input to lli. It will display the output on the standard output.
$ lli output.bc
14
llc: llc is the static compiler. It compiles LLVM inputs (assembly form/
bitcode form) into assembly language for a specified architecture. In the
following example it takes the output.bc file generated by llvm-link and
generates the assembly file output.s.
$ llc output.bc o output.s
Let's look at the content of the output.s assembly, specifically the two
functions of the generated code, which is very similar to what a native
assembler would have generated.
Function main:
.type main,@function
main:
.cfi_startproc
# BB#0:
# @main
[7]
Function: add
add:
.cfi_startproc
# BB#0:
pushq %rbp
.Ltmp3:
.cfi_def_cfa_offset 16
.Ltmp4:
.cfi_offset %rbp, -16
movq %rsp, %rbp
.Ltmp5:
.cfi_def_cfa_register %rbp
movl %edi, -4(%rbp)
addl globvar(%rip), %edi
movl %edi, %eax
popq %rbp
retq
.Lfunc_end1:
# @add
[8]
Chapter 1
opt: This is modular LLVM analyzer and optimizer. It takes the input file and
runs the optimization or analysis specified on the command line. Whether it
runs the analyzer or optimizer depends on the command-line option.
opt [options] [input file name]
When the analyze option is not passed, the opt tool does the actual
optimization work and tries to optimize the code depending upon the
command-line options passed. Similarly to the preceding case, you can use
some of the optimization passes already present or write your own pass for
optimization. Some of the useful optimization passes that can be specified
using the following command-line arguments are:
[9]
Summary
In this chapter, we looked into the modular design of LLVM: How it is used in the
opt tool of LLVM, and how it is applicable across LLVM core libraries. Then we took
a look into LLVM intermediate representation, and how various entities (variables,
functions etc.) of a language are mapped to LLVM IR. In the last section, we discussed
about some of the important LLVM tools, and how they can be used to transform the
LLVM IR from one form to another.
In the next chapter, we will see how we can write a frontend for a language that can
output LLVM IR using the LLVM machinery.
[ 10 ]
www.PacktPub.com
Stay Connected: