blob: 50b5a8bed345a2ca0f8ba5334e6f803bcf7647a3 [file] [log] [blame] [view]
Amir Ayupov1c5d3a02020-12-02 00:29:391# BOLT
2
3BOLT is a post-link optimizer developed to speed up large applications.
4It achieves the improvements by optimizing application's code layout based on
5execution profile gathered by sampling profiler, such as Linux `perf` tool.
6An overview of the ideas implemented in BOLT along with a discussion of its
7potential and current results is available in
8[CGO'19 paper](https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/).
9
10## Input Binary Requirements
11
12BOLT operates on X86-64 and AArch64 ELF binaries. At the minimum, the binaries
13should have an unstripped symbol table, and, to get maximum performance gains,
14they should be linked with relocations (`--emit-relocs` or `-q` linker flag).
15
16BOLT disassembles functions and reconstructs the control flow graph (CFG)
17before it runs optimizations. Since this is a nontrivial task,
18especially when indirect branches are present, we rely on certain heuristics
19to accomplish it. These heuristics have been tested on a code generated with
20Clang and GCC compilers. The main requirement for C/C++ code is not to rely
21on code layout properties, such as function pointer deltas.
22Assembly code can be processed too. Requirements for it include a clear
23separation of code and data, with data objects being placed into data
24sections/segments. If indirect jumps are used for intra-function control
25transfer (e.g., jump tables), the code patterns should be matching those
26generated by Clang/GCC.
27
28NOTE: BOLT is currently incompatible with the `-freorder-blocks-and-partition`
29compiler option. Since GCC8 enables this option by default, you have to
30explicitly disable it by adding `-fno-reorder-blocks-and-partition` flag if
31you are compiling with GCC8.
32
33PIE and .so support has been added recently. Please report bugs if you
34encounter any issues.
35
36## Installation
37
38### Docker Image
39
40You can build and use the docker image containing BOLT using our [docker file](./utils/docker/Dockerfile).
41Alternatively, you can build BOLT manually using the steps below.
42
43### Manual Build
44
45BOLT heavily uses LLVM libraries, and by design, it is built as one of LLVM
46tools. The build process is not much different from a regular LLVM build.
47The following instructions are assuming that you are running under Linux.
48
49Start with cloning LLVM and BOLT repos:
50
51```
52> git clone https://github.com/llvm-mirror/llvm llvm
53> cd llvm/tools
54> git checkout -b llvm-bolt f137ed238db11440f03083b1c88b7ffc0f4af65e
55> git clone https://github.com/facebookincubator/BOLT llvm-bolt
56> cd ..
57> patch -p 1 < tools/llvm-bolt/llvm.patch
58```
59
60Proceed to a normal LLVM build using a compiler with C++11 support (for GCC
61use version 4.9 or later):
62
63```
64> cd ..
65> mkdir build
66> cd build
67> cmake -G Ninja ../llvm -DLLVM_TARGETS_TO_BUILD="X86;AArch64" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON
68> ninja
69```
70
71`llvm-bolt` will be available under `bin/`. Add this directory to your path to
72ensure the rest of the commands in this tutorial work.
73
74Note that we use a specific revision of LLVM as we currently rely on a set of
75patches that are not yet upstreamed.
76
77## Optimizing BOLT's Performance
78
79BOLT runs many internal passes in parallel. If you foresee heavy usage of
80BOLT, you can improve the processing time by linking against one of memory
81allocation libraries with good support for concurrency. E.g. to use jemalloc:
82
83```
84> sudo yum install jemalloc-devel
85> LD_PRELOAD=/usr/lib64/libjemalloc.so llvm-bolt ....
86```
87Or if you rather use tcmalloc:
88```
89> sudo yum install gperftools-devel
90> LD_PRELOAD=/usr/lib64/libtcmalloc_minimal.so llvm-bolt ....
91```
92
93## Usage
94
95For a complete practical guide of using BOLT see [Optimizing Clang with BOLT](./docs/OptimizingClang.md).
96
97### Step 0
98
99In order to allow BOLT to re-arrange functions (in addition to re-arranging
100code within functions) in your program, it needs a little help from the linker.
101Add `--emit-relocs` to the final link step of your application. You can verify
102the presence of relocations by checking for `.rela.text` section in the binary.
103BOLT will also report if it detects relocations while processing the binary.
104
105### Step 1: Collect Profile
106
107This step is different for different kinds of executables. If you can invoke
108your program to run on a representative input from a command line, then check
109**For Applications** section below. If your program typically runs as a
110server/service, then skip to **For Services** section.
111
112The version of `perf` command used for the following steps has to support
113`-F brstack` option. We recommend using `perf` version 4.5 or later.
114
115#### For Applications
116
117This assumes you can run your program from a command line with a typical input.
118In this case, simply prepend the command line invocation with `perf`:
119```
120$ perf record -e cycles:u -j any,u -o perf.data -- <executable> <args> ...
121```
122
123#### For Services
124
125Once you get the service deployed and warmed-up, it is time to collect perf
126data with LBR (branch information). The exact perf command to use will depend
127on the service. E.g., to collect the data for all processes running on the
128server for the next 3 minutes use:
129```
130$ perf record -e cycles:u -j any,u -a -o perf.data -- sleep 180
131```
132
133Depending on the application, you may need more samples to be included with
134your profile. It's hard to tell upfront what would be a sweet spot for your
135application. We recommend the profile to cover 1B instructions as reported
136by BOLT `-dyno-stats` option. If you need to increase the number of samples
137in the profile, you can either run the `sleep` command for longer and use
138`-F<N>` option with `perf` to increase sampling frequency.
139
140Note that for profile collection we recommend using cycle events and not
141`BR_INST_RETIRED.*`. Empirically we found it to produce better results.
142
143If the collection of a profile with branches is not available, e.g., when you run on
144a VM or on hardware that does not support it, then you can use only sample
145events, such as cycles. In this case, the quality of the profile information
146would not be as good, and performance gains with BOLT are expected to be lower.
147
148#### With instrumentation (experimental)
149
150If perf record is not available to you, you may collect profile by first
151instrumenting the binary with BOLT and then running it.
152```
153llvm-bolt <executable> -instrument -o <instrumented-executable>
154```
155
156After you run instrumented-executable with the desired workload, its BOLT
157profile should be ready for you in `/tmp/prof.fdata` and you can skip
158**Step 2**.
159
160Run BOLT with the `-help` option and check the category "BOLT instrumentation
161options" for a quick reference on instrumentation knobs. Instrumentation is
162experimental and currently does not work for PIEs/SOs.
163
164### Step 2: Convert Profile to BOLT Format
165
166NOTE: you can skip this step and feed `perf.data` directly to BOLT using
167experimental `-p perf.data` option.
168
169For this step, you will need `perf.data` file collected from the previous step and
170a copy of the binary that was running. The binary has to be either
171unstripped, or should have a symbol table intact (i.e., running `strip -g` is
172okay).
173
174Make sure `perf` is in your `PATH`, and execute `perf2bolt`:
175```
176$ perf2bolt -p perf.data -o perf.fdata <executable>
177```
178
179This command will aggregate branch data from `perf.data` and store it in a
180format that is both more compact and more resilient to binary modifications.
181
182If the profile was collected without LBRs, you will need to add `-nl` flag to
183the command line above.
184
185### Step 3: Optimize with BOLT
186
187Once you have `perf.fdata` ready, you can use it for optimizations with
188BOLT. Assuming your environment is setup to include the right path, execute
189`llvm-bolt`:
190```
191$ llvm-bolt <executable> -o <executable>.bolt -data=perf.fdata -reorder-blocks=cache+ -reorder-functions=hfsort -split-functions=2 -split-all-cold -split-eh -dyno-stats
192```
193
194If you do need an updated debug info, then add `-update-debug-sections` option
195to the command above. The processing time will be slightly longer.
196
197For a full list of options see `-help`/`-help-hidden` output.
198
199The input binary for this step does not have to 100% match the binary used for
200profile collection in **Step 1**. This could happen when you are doing active
201development, and the source code constantly changes, yet you want to benefit
202from profile-guided optimizations. However, since the binary is not precisely the
203same, the profile information could become invalid or stale, and BOLT will
204report the number of functions with a stale profile. The higher the
205number, the less performance improvement should be expected. Thus, it is
206crucial to update `.fdata` for release branches.
207
208## Multiple Profiles
209
210Suppose your application can run in different modes, and you can generate
211multiple profiles for each one of them. To generate a single binary that can
212benefit all modes (assuming the profiles don't contradict each other) you can
213use `merge-fdata` tool:
214```
215$ merge-fdata *.fdata > combined.fdata
216```
217Use `combined.fdata` for **Step 3** above to generate a universally optimized
218binary.
219
220## License
221
Rafael Aulerda752c9c2021-03-17 22:04:19222BOLT is licensed under the [Apache License v2.0 with LLVM Exceptions](./LICENSE.TXT).