blob: eed46fff848f120b0a09deef228e195d4dc64d38 [file] [log] [blame] [view]
Amir Ayupov1c5d3a02020-12-02 00:29:391# BOLT
2
3BOLT is a post-link optimizer developed to speed up large applications.
4It achieves the improvements by optimizing application's code layout based on
5execution profile gathered by sampling profiler, such as Linux `perf` tool.
6An overview of the ideas implemented in BOLT along with a discussion of its
7potential and current results is available in
8[CGO'19 paper](https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/).
9
10## Input Binary Requirements
11
12BOLT operates on X86-64 and AArch64 ELF binaries. At the minimum, the binaries
13should have an unstripped symbol table, and, to get maximum performance gains,
14they should be linked with relocations (`--emit-relocs` or `-q` linker flag).
15
16BOLT disassembles functions and reconstructs the control flow graph (CFG)
17before it runs optimizations. Since this is a nontrivial task,
18especially when indirect branches are present, we rely on certain heuristics
19to accomplish it. These heuristics have been tested on a code generated with
20Clang and GCC compilers. The main requirement for C/C++ code is not to rely
21on code layout properties, such as function pointer deltas.
22Assembly code can be processed too. Requirements for it include a clear
23separation of code and data, with data objects being placed into data
24sections/segments. If indirect jumps are used for intra-function control
25transfer (e.g., jump tables), the code patterns should be matching those
26generated by Clang/GCC.
27
28NOTE: BOLT is currently incompatible with the `-freorder-blocks-and-partition`
29compiler option. Since GCC8 enables this option by default, you have to
30explicitly disable it by adding `-fno-reorder-blocks-and-partition` flag if
Amir Aupov4ed87112022-01-12 05:23:2631you are compiling with GCC8 or above.
Amir Ayupov1c5d3a02020-12-02 00:29:3932
Maksim Panchenkod97fcf32022-01-26 21:52:4033NOTE2: DWARF v5 is the new debugging format generated by the latest LLVM and GCC
34compilers. It offers several benefits over the previous DWARF v4. Currently, the
35support for v5 is a work in progress for BOLT. While you will be able to
36optimize binaries produced by the latest compilers, until the support is
37complete, you will not be able to update the debug info with
38`-update-debug-sections`. To temporarily work around the issue, we recommend
39compiling binaries with `-gdwarf-4` option that forces DWARF v4 output.
40
Amir Ayupov1c5d3a02020-12-02 00:29:3941PIE and .so support has been added recently. Please report bugs if you
42encounter any issues.
43
44## Installation
45
46### Docker Image
47
Shoaib Meenaid9b29832022-01-13 01:28:2548You can build and use the docker image containing BOLT using our [docker file](utils/docker/Dockerfile).
Amir Ayupov1c5d3a02020-12-02 00:29:3949Alternatively, you can build BOLT manually using the steps below.
50
51### Manual Build
52
53BOLT heavily uses LLVM libraries, and by design, it is built as one of LLVM
54tools. The build process is not much different from a regular LLVM build.
55The following instructions are assuming that you are running under Linux.
56
Amir Ayupov65d39942022-01-12 05:26:0157Start with cloning LLVM repo:
Amir Ayupov1c5d3a02020-12-02 00:29:3958
59```
Amir Ayupov65d39942022-01-12 05:26:0160> git clone https://github.com/llvm/llvm-project.git
Amir Ayupov1c5d3a02020-12-02 00:29:3961> mkdir build
62> cd build
Amir Ayupov65d39942022-01-12 05:26:0163> cmake -G Ninja ../llvm-project/llvm -DLLVM_TARGETS_TO_BUILD="X86;AArch64" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_ENABLE_PROJECTS="bolt"
64> ninja bolt
Amir Ayupov1c5d3a02020-12-02 00:29:3965```
66
67`llvm-bolt` will be available under `bin/`. Add this directory to your path to
68ensure the rest of the commands in this tutorial work.
69
Amir Ayupov1c5d3a02020-12-02 00:29:3970## Optimizing BOLT's Performance
71
72BOLT runs many internal passes in parallel. If you foresee heavy usage of
73BOLT, you can improve the processing time by linking against one of memory
74allocation libraries with good support for concurrency. E.g. to use jemalloc:
75
76```
77> sudo yum install jemalloc-devel
78> LD_PRELOAD=/usr/lib64/libjemalloc.so llvm-bolt ....
79```
80Or if you rather use tcmalloc:
81```
82> sudo yum install gperftools-devel
83> LD_PRELOAD=/usr/lib64/libtcmalloc_minimal.so llvm-bolt ....
84```
85
86## Usage
87
Shoaib Meenaid9b29832022-01-13 01:28:2588For a complete practical guide of using BOLT see [Optimizing Clang with BOLT](docs/OptimizingClang.md).
Amir Ayupov1c5d3a02020-12-02 00:29:3989
90### Step 0
91
92In order to allow BOLT to re-arrange functions (in addition to re-arranging
93code within functions) in your program, it needs a little help from the linker.
94Add `--emit-relocs` to the final link step of your application. You can verify
95the presence of relocations by checking for `.rela.text` section in the binary.
96BOLT will also report if it detects relocations while processing the binary.
97
98### Step 1: Collect Profile
99
100This step is different for different kinds of executables. If you can invoke
101your program to run on a representative input from a command line, then check
102**For Applications** section below. If your program typically runs as a
103server/service, then skip to **For Services** section.
104
105The version of `perf` command used for the following steps has to support
106`-F brstack` option. We recommend using `perf` version 4.5 or later.
107
108#### For Applications
109
110This assumes you can run your program from a command line with a typical input.
111In this case, simply prepend the command line invocation with `perf`:
112```
113$ perf record -e cycles:u -j any,u -o perf.data -- <executable> <args> ...
114```
115
116#### For Services
117
118Once you get the service deployed and warmed-up, it is time to collect perf
119data with LBR (branch information). The exact perf command to use will depend
120on the service. E.g., to collect the data for all processes running on the
121server for the next 3 minutes use:
122```
123$ perf record -e cycles:u -j any,u -a -o perf.data -- sleep 180
124```
125
126Depending on the application, you may need more samples to be included with
127your profile. It's hard to tell upfront what would be a sweet spot for your
128application. We recommend the profile to cover 1B instructions as reported
129by BOLT `-dyno-stats` option. If you need to increase the number of samples
130in the profile, you can either run the `sleep` command for longer and use
131`-F<N>` option with `perf` to increase sampling frequency.
132
133Note that for profile collection we recommend using cycle events and not
134`BR_INST_RETIRED.*`. Empirically we found it to produce better results.
135
136If the collection of a profile with branches is not available, e.g., when you run on
137a VM or on hardware that does not support it, then you can use only sample
138events, such as cycles. In this case, the quality of the profile information
139would not be as good, and performance gains with BOLT are expected to be lower.
140
Vasily Leonenko285ac262021-06-25 08:27:47141#### With instrumentation
Amir Ayupov1c5d3a02020-12-02 00:29:39142
143If perf record is not available to you, you may collect profile by first
144instrumenting the binary with BOLT and then running it.
145```
146llvm-bolt <executable> -instrument -o <instrumented-executable>
147```
148
149After you run instrumented-executable with the desired workload, its BOLT
150profile should be ready for you in `/tmp/prof.fdata` and you can skip
151**Step 2**.
152
153Run BOLT with the `-help` option and check the category "BOLT instrumentation
Vasily Leonenko285ac262021-06-25 08:27:47154options" for a quick reference on instrumentation knobs.
Amir Ayupov1c5d3a02020-12-02 00:29:39155
156### Step 2: Convert Profile to BOLT Format
157
158NOTE: you can skip this step and feed `perf.data` directly to BOLT using
159experimental `-p perf.data` option.
160
161For this step, you will need `perf.data` file collected from the previous step and
162a copy of the binary that was running. The binary has to be either
163unstripped, or should have a symbol table intact (i.e., running `strip -g` is
164okay).
165
166Make sure `perf` is in your `PATH`, and execute `perf2bolt`:
167```
168$ perf2bolt -p perf.data -o perf.fdata <executable>
169```
170
171This command will aggregate branch data from `perf.data` and store it in a
172format that is both more compact and more resilient to binary modifications.
173
174If the profile was collected without LBRs, you will need to add `-nl` flag to
175the command line above.
176
177### Step 3: Optimize with BOLT
178
179Once you have `perf.fdata` ready, you can use it for optimizations with
180BOLT. Assuming your environment is setup to include the right path, execute
181`llvm-bolt`:
182```
Fabian Parzefall96f6ec52022-06-25 00:00:20183$ llvm-bolt <executable> -o <executable>.bolt -data=perf.fdata -reorder-blocks=ext-tsp -reorder-functions=hfsort -split-functions -split-all-cold -split-eh -dyno-stats
Amir Ayupov1c5d3a02020-12-02 00:29:39184```
185
186If you do need an updated debug info, then add `-update-debug-sections` option
187to the command above. The processing time will be slightly longer.
188
189For a full list of options see `-help`/`-help-hidden` output.
190
191The input binary for this step does not have to 100% match the binary used for
192profile collection in **Step 1**. This could happen when you are doing active
193development, and the source code constantly changes, yet you want to benefit
194from profile-guided optimizations. However, since the binary is not precisely the
195same, the profile information could become invalid or stale, and BOLT will
196report the number of functions with a stale profile. The higher the
197number, the less performance improvement should be expected. Thus, it is
198crucial to update `.fdata` for release branches.
199
200## Multiple Profiles
201
202Suppose your application can run in different modes, and you can generate
203multiple profiles for each one of them. To generate a single binary that can
204benefit all modes (assuming the profiles don't contradict each other) you can
205use `merge-fdata` tool:
206```
207$ merge-fdata *.fdata > combined.fdata
208```
209Use `combined.fdata` for **Step 3** above to generate a universally optimized
210binary.
211
212## License
213
Rafael Aulerda752c9c2021-03-17 22:04:19214BOLT is licensed under the [Apache License v2.0 with LLVM Exceptions](./LICENSE.TXT).