blob: cec59428656a6a20f5c785fa91b0c37fb09af36e [file] [log] [blame] [view]
Amir Ayupov1c5d3a02020-12-02 00:29:391# BOLT
2
3BOLT is a post-link optimizer developed to speed up large applications.
4It achieves the improvements by optimizing application's code layout based on
5execution profile gathered by sampling profiler, such as Linux `perf` tool.
6An overview of the ideas implemented in BOLT along with a discussion of its
7potential and current results is available in
8[CGO'19 paper](https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/).
9
10## Input Binary Requirements
11
12BOLT operates on X86-64 and AArch64 ELF binaries. At the minimum, the binaries
13should have an unstripped symbol table, and, to get maximum performance gains,
14they should be linked with relocations (`--emit-relocs` or `-q` linker flag).
15
16BOLT disassembles functions and reconstructs the control flow graph (CFG)
17before it runs optimizations. Since this is a nontrivial task,
18especially when indirect branches are present, we rely on certain heuristics
19to accomplish it. These heuristics have been tested on a code generated with
20Clang and GCC compilers. The main requirement for C/C++ code is not to rely
21on code layout properties, such as function pointer deltas.
22Assembly code can be processed too. Requirements for it include a clear
23separation of code and data, with data objects being placed into data
24sections/segments. If indirect jumps are used for intra-function control
25transfer (e.g., jump tables), the code patterns should be matching those
26generated by Clang/GCC.
27
28NOTE: BOLT is currently incompatible with the `-freorder-blocks-and-partition`
29compiler option. Since GCC8 enables this option by default, you have to
30explicitly disable it by adding `-fno-reorder-blocks-and-partition` flag if
Amir Aupov4ed87112022-01-12 05:23:2631you are compiling with GCC8 or above.
Amir Ayupov1c5d3a02020-12-02 00:29:3932
33PIE and .so support has been added recently. Please report bugs if you
34encounter any issues.
35
36## Installation
37
38### Docker Image
39
Rafael Auler62550dd2021-09-25 18:20:4740You can build and use the docker image containing BOLT using our [docker file](./bolt/utils/docker/Dockerfile).
Amir Ayupov1c5d3a02020-12-02 00:29:3941Alternatively, you can build BOLT manually using the steps below.
42
43### Manual Build
44
45BOLT heavily uses LLVM libraries, and by design, it is built as one of LLVM
46tools. The build process is not much different from a regular LLVM build.
47The following instructions are assuming that you are running under Linux.
48
Rafael Auler62550dd2021-09-25 18:20:4749Start with cloning LLVM and BOLT repos:
Amir Ayupov1c5d3a02020-12-02 00:29:3950
51```
Amir Ayupov1c5d3a02020-12-02 00:29:3952> git clone https://github.com/facebookincubator/BOLT llvm-bolt
Amir Ayupov1c5d3a02020-12-02 00:29:3953> mkdir build
54> cd build
Rafael Auler62550dd2021-09-25 18:20:4755> cmake -G Ninja ../llvm-bolt/llvm -DLLVM_TARGETS_TO_BUILD="X86;AArch64" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_ENABLE_PROJECTS="clang;lld;bolt"
Amir Ayupov1c5d3a02020-12-02 00:29:3956> ninja
57```
58
59`llvm-bolt` will be available under `bin/`. Add this directory to your path to
60ensure the rest of the commands in this tutorial work.
61
Rafael Auler62550dd2021-09-25 18:20:4762Note that we use a specific snapshot of LLVM monorepo as we currently
63rely on a set of patches that are not yet upstreamed.
Amir Ayupov1c5d3a02020-12-02 00:29:3964
65## Optimizing BOLT's Performance
66
67BOLT runs many internal passes in parallel. If you foresee heavy usage of
68BOLT, you can improve the processing time by linking against one of memory
69allocation libraries with good support for concurrency. E.g. to use jemalloc:
70
71```
72> sudo yum install jemalloc-devel
73> LD_PRELOAD=/usr/lib64/libjemalloc.so llvm-bolt ....
74```
75Or if you rather use tcmalloc:
76```
77> sudo yum install gperftools-devel
78> LD_PRELOAD=/usr/lib64/libtcmalloc_minimal.so llvm-bolt ....
79```
80
81## Usage
82
Rafael Auler62550dd2021-09-25 18:20:4783For a complete practical guide of using BOLT see [Optimizing Clang with BOLT](./bolt/docs/OptimizingClang.md).
Amir Ayupov1c5d3a02020-12-02 00:29:3984
85### Step 0
86
87In order to allow BOLT to re-arrange functions (in addition to re-arranging
88code within functions) in your program, it needs a little help from the linker.
89Add `--emit-relocs` to the final link step of your application. You can verify
90the presence of relocations by checking for `.rela.text` section in the binary.
91BOLT will also report if it detects relocations while processing the binary.
92
93### Step 1: Collect Profile
94
95This step is different for different kinds of executables. If you can invoke
96your program to run on a representative input from a command line, then check
97**For Applications** section below. If your program typically runs as a
98server/service, then skip to **For Services** section.
99
100The version of `perf` command used for the following steps has to support
101`-F brstack` option. We recommend using `perf` version 4.5 or later.
102
103#### For Applications
104
105This assumes you can run your program from a command line with a typical input.
106In this case, simply prepend the command line invocation with `perf`:
107```
108$ perf record -e cycles:u -j any,u -o perf.data -- <executable> <args> ...
109```
110
111#### For Services
112
113Once you get the service deployed and warmed-up, it is time to collect perf
114data with LBR (branch information). The exact perf command to use will depend
115on the service. E.g., to collect the data for all processes running on the
116server for the next 3 minutes use:
117```
118$ perf record -e cycles:u -j any,u -a -o perf.data -- sleep 180
119```
120
121Depending on the application, you may need more samples to be included with
122your profile. It's hard to tell upfront what would be a sweet spot for your
123application. We recommend the profile to cover 1B instructions as reported
124by BOLT `-dyno-stats` option. If you need to increase the number of samples
125in the profile, you can either run the `sleep` command for longer and use
126`-F<N>` option with `perf` to increase sampling frequency.
127
128Note that for profile collection we recommend using cycle events and not
129`BR_INST_RETIRED.*`. Empirically we found it to produce better results.
130
131If the collection of a profile with branches is not available, e.g., when you run on
132a VM or on hardware that does not support it, then you can use only sample
133events, such as cycles. In this case, the quality of the profile information
134would not be as good, and performance gains with BOLT are expected to be lower.
135
Vasily Leonenko285ac262021-06-25 08:27:47136#### With instrumentation
Amir Ayupov1c5d3a02020-12-02 00:29:39137
138If perf record is not available to you, you may collect profile by first
139instrumenting the binary with BOLT and then running it.
140```
141llvm-bolt <executable> -instrument -o <instrumented-executable>
142```
143
144After you run instrumented-executable with the desired workload, its BOLT
145profile should be ready for you in `/tmp/prof.fdata` and you can skip
146**Step 2**.
147
148Run BOLT with the `-help` option and check the category "BOLT instrumentation
Vasily Leonenko285ac262021-06-25 08:27:47149options" for a quick reference on instrumentation knobs.
Amir Ayupov1c5d3a02020-12-02 00:29:39150
151### Step 2: Convert Profile to BOLT Format
152
153NOTE: you can skip this step and feed `perf.data` directly to BOLT using
154experimental `-p perf.data` option.
155
156For this step, you will need `perf.data` file collected from the previous step and
157a copy of the binary that was running. The binary has to be either
158unstripped, or should have a symbol table intact (i.e., running `strip -g` is
159okay).
160
161Make sure `perf` is in your `PATH`, and execute `perf2bolt`:
162```
163$ perf2bolt -p perf.data -o perf.fdata <executable>
164```
165
166This command will aggregate branch data from `perf.data` and store it in a
167format that is both more compact and more resilient to binary modifications.
168
169If the profile was collected without LBRs, you will need to add `-nl` flag to
170the command line above.
171
172### Step 3: Optimize with BOLT
173
174Once you have `perf.fdata` ready, you can use it for optimizations with
175BOLT. Assuming your environment is setup to include the right path, execute
176`llvm-bolt`:
177```
178$ llvm-bolt <executable> -o <executable>.bolt -data=perf.fdata -reorder-blocks=cache+ -reorder-functions=hfsort -split-functions=2 -split-all-cold -split-eh -dyno-stats
179```
180
181If you do need an updated debug info, then add `-update-debug-sections` option
182to the command above. The processing time will be slightly longer.
183
184For a full list of options see `-help`/`-help-hidden` output.
185
186The input binary for this step does not have to 100% match the binary used for
187profile collection in **Step 1**. This could happen when you are doing active
188development, and the source code constantly changes, yet you want to benefit
189from profile-guided optimizations. However, since the binary is not precisely the
190same, the profile information could become invalid or stale, and BOLT will
191report the number of functions with a stale profile. The higher the
192number, the less performance improvement should be expected. Thus, it is
193crucial to update `.fdata` for release branches.
194
195## Multiple Profiles
196
197Suppose your application can run in different modes, and you can generate
198multiple profiles for each one of them. To generate a single binary that can
199benefit all modes (assuming the profiles don't contradict each other) you can
200use `merge-fdata` tool:
201```
202$ merge-fdata *.fdata > combined.fdata
203```
204Use `combined.fdata` for **Step 3** above to generate a universally optimized
205binary.
206
207## License
208
Rafael Aulerda752c9c2021-03-17 22:04:19209BOLT is licensed under the [Apache License v2.0 with LLVM Exceptions](./LICENSE.TXT).