blob: 7c624d5c4294b179ab829b74d88ccac16d2eea3b [file] [log] [blame] [view]
Amir Ayupov1c5d3a02020-12-02 00:29:391# BOLT
2
3BOLT is a post-link optimizer developed to speed up large applications.
4It achieves the improvements by optimizing application's code layout based on
5execution profile gathered by sampling profiler, such as Linux `perf` tool.
6An overview of the ideas implemented in BOLT along with a discussion of its
7potential and current results is available in
8[CGO'19 paper](https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/).
9
10## Input Binary Requirements
11
12BOLT operates on X86-64 and AArch64 ELF binaries. At the minimum, the binaries
13should have an unstripped symbol table, and, to get maximum performance gains,
14they should be linked with relocations (`--emit-relocs` or `-q` linker flag).
15
16BOLT disassembles functions and reconstructs the control flow graph (CFG)
17before it runs optimizations. Since this is a nontrivial task,
18especially when indirect branches are present, we rely on certain heuristics
19to accomplish it. These heuristics have been tested on a code generated with
20Clang and GCC compilers. The main requirement for C/C++ code is not to rely
21on code layout properties, such as function pointer deltas.
22Assembly code can be processed too. Requirements for it include a clear
23separation of code and data, with data objects being placed into data
24sections/segments. If indirect jumps are used for intra-function control
25transfer (e.g., jump tables), the code patterns should be matching those
26generated by Clang/GCC.
27
28NOTE: BOLT is currently incompatible with the `-freorder-blocks-and-partition`
29compiler option. Since GCC8 enables this option by default, you have to
30explicitly disable it by adding `-fno-reorder-blocks-and-partition` flag if
31you are compiling with GCC8.
32
33PIE and .so support has been added recently. Please report bugs if you
34encounter any issues.
35
36## Installation
37
38### Docker Image
39
40You can build and use the docker image containing BOLT using our [docker file](./utils/docker/Dockerfile).
41Alternatively, you can build BOLT manually using the steps below.
42
43### Manual Build
44
45BOLT heavily uses LLVM libraries, and by design, it is built as one of LLVM
46tools. The build process is not much different from a regular LLVM build.
47The following instructions are assuming that you are running under Linux.
48
Amir Ayupovc33f08e2021-07-28 21:45:1049Start with cloning BOLT repo:
Amir Ayupov1c5d3a02020-12-02 00:29:3950
51```
Amir Ayupov1c5d3a02020-12-02 00:29:3952> git clone https://github.com/facebookincubator/BOLT llvm-bolt
Amir Ayupov1c5d3a02020-12-02 00:29:3953```
54
55Proceed to a normal LLVM build using a compiler with C++11 support (for GCC
56use version 4.9 or later):
57
58```
Amir Ayupovc33f08e2021-07-28 21:45:1059> cd llvm-bolt
Amir Ayupov1c5d3a02020-12-02 00:29:3960> mkdir build
61> cd build
Amir Ayupovc33f08e2021-07-28 21:45:1062> cmake -G Ninja ../llvm -DLLVM_ENABLE_PROJECTS="bolt" -DLLVM_TARGETS_TO_BUILD="X86;AArch64" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON
Amir Ayupov1c5d3a02020-12-02 00:29:3963> ninja
64```
65
66`llvm-bolt` will be available under `bin/`. Add this directory to your path to
67ensure the rest of the commands in this tutorial work.
68
69Note that we use a specific revision of LLVM as we currently rely on a set of
70patches that are not yet upstreamed.
71
72## Optimizing BOLT's Performance
73
74BOLT runs many internal passes in parallel. If you foresee heavy usage of
75BOLT, you can improve the processing time by linking against one of memory
76allocation libraries with good support for concurrency. E.g. to use jemalloc:
77
78```
79> sudo yum install jemalloc-devel
80> LD_PRELOAD=/usr/lib64/libjemalloc.so llvm-bolt ....
81```
82Or if you rather use tcmalloc:
83```
84> sudo yum install gperftools-devel
85> LD_PRELOAD=/usr/lib64/libtcmalloc_minimal.so llvm-bolt ....
86```
87
88## Usage
89
90For a complete practical guide of using BOLT see [Optimizing Clang with BOLT](./docs/OptimizingClang.md).
91
92### Step 0
93
94In order to allow BOLT to re-arrange functions (in addition to re-arranging
95code within functions) in your program, it needs a little help from the linker.
96Add `--emit-relocs` to the final link step of your application. You can verify
97the presence of relocations by checking for `.rela.text` section in the binary.
98BOLT will also report if it detects relocations while processing the binary.
99
100### Step 1: Collect Profile
101
102This step is different for different kinds of executables. If you can invoke
103your program to run on a representative input from a command line, then check
104**For Applications** section below. If your program typically runs as a
105server/service, then skip to **For Services** section.
106
107The version of `perf` command used for the following steps has to support
108`-F brstack` option. We recommend using `perf` version 4.5 or later.
109
110#### For Applications
111
112This assumes you can run your program from a command line with a typical input.
113In this case, simply prepend the command line invocation with `perf`:
114```
115$ perf record -e cycles:u -j any,u -o perf.data -- <executable> <args> ...
116```
117
118#### For Services
119
120Once you get the service deployed and warmed-up, it is time to collect perf
121data with LBR (branch information). The exact perf command to use will depend
122on the service. E.g., to collect the data for all processes running on the
123server for the next 3 minutes use:
124```
125$ perf record -e cycles:u -j any,u -a -o perf.data -- sleep 180
126```
127
128Depending on the application, you may need more samples to be included with
129your profile. It's hard to tell upfront what would be a sweet spot for your
130application. We recommend the profile to cover 1B instructions as reported
131by BOLT `-dyno-stats` option. If you need to increase the number of samples
132in the profile, you can either run the `sleep` command for longer and use
133`-F<N>` option with `perf` to increase sampling frequency.
134
135Note that for profile collection we recommend using cycle events and not
136`BR_INST_RETIRED.*`. Empirically we found it to produce better results.
137
138If the collection of a profile with branches is not available, e.g., when you run on
139a VM or on hardware that does not support it, then you can use only sample
140events, such as cycles. In this case, the quality of the profile information
141would not be as good, and performance gains with BOLT are expected to be lower.
142
Vasily Leonenko285ac262021-06-25 08:27:47143#### With instrumentation
Amir Ayupov1c5d3a02020-12-02 00:29:39144
145If perf record is not available to you, you may collect profile by first
146instrumenting the binary with BOLT and then running it.
147```
148llvm-bolt <executable> -instrument -o <instrumented-executable>
149```
150
151After you run instrumented-executable with the desired workload, its BOLT
152profile should be ready for you in `/tmp/prof.fdata` and you can skip
153**Step 2**.
154
155Run BOLT with the `-help` option and check the category "BOLT instrumentation
Vasily Leonenko285ac262021-06-25 08:27:47156options" for a quick reference on instrumentation knobs.
Amir Ayupov1c5d3a02020-12-02 00:29:39157
158### Step 2: Convert Profile to BOLT Format
159
160NOTE: you can skip this step and feed `perf.data` directly to BOLT using
161experimental `-p perf.data` option.
162
163For this step, you will need `perf.data` file collected from the previous step and
164a copy of the binary that was running. The binary has to be either
165unstripped, or should have a symbol table intact (i.e., running `strip -g` is
166okay).
167
168Make sure `perf` is in your `PATH`, and execute `perf2bolt`:
169```
170$ perf2bolt -p perf.data -o perf.fdata <executable>
171```
172
173This command will aggregate branch data from `perf.data` and store it in a
174format that is both more compact and more resilient to binary modifications.
175
176If the profile was collected without LBRs, you will need to add `-nl` flag to
177the command line above.
178
179### Step 3: Optimize with BOLT
180
181Once you have `perf.fdata` ready, you can use it for optimizations with
182BOLT. Assuming your environment is setup to include the right path, execute
183`llvm-bolt`:
184```
185$ llvm-bolt <executable> -o <executable>.bolt -data=perf.fdata -reorder-blocks=cache+ -reorder-functions=hfsort -split-functions=2 -split-all-cold -split-eh -dyno-stats
186```
187
188If you do need an updated debug info, then add `-update-debug-sections` option
189to the command above. The processing time will be slightly longer.
190
191For a full list of options see `-help`/`-help-hidden` output.
192
193The input binary for this step does not have to 100% match the binary used for
194profile collection in **Step 1**. This could happen when you are doing active
195development, and the source code constantly changes, yet you want to benefit
196from profile-guided optimizations. However, since the binary is not precisely the
197same, the profile information could become invalid or stale, and BOLT will
198report the number of functions with a stale profile. The higher the
199number, the less performance improvement should be expected. Thus, it is
200crucial to update `.fdata` for release branches.
201
202## Multiple Profiles
203
204Suppose your application can run in different modes, and you can generate
205multiple profiles for each one of them. To generate a single binary that can
206benefit all modes (assuming the profiles don't contradict each other) you can
207use `merge-fdata` tool:
208```
209$ merge-fdata *.fdata > combined.fdata
210```
211Use `combined.fdata` for **Step 3** above to generate a universally optimized
212binary.
213
214## License
215
Rafael Aulerda752c9c2021-03-17 22:04:19216BOLT is licensed under the [Apache License v2.0 with LLVM Exceptions](./LICENSE.TXT).