Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 1 | # BOLT |
| 2 | |
| 3 | BOLT is a post-link optimizer developed to speed up large applications. |
| 4 | It achieves the improvements by optimizing application's code layout based on |
| 5 | execution profile gathered by sampling profiler, such as Linux `perf` tool. |
| 6 | An overview of the ideas implemented in BOLT along with a discussion of its |
| 7 | potential and current results is available in |
| 8 | [CGO'19 paper](https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/). |
| 9 | |
| 10 | ## Input Binary Requirements |
| 11 | |
| 12 | BOLT operates on X86-64 and AArch64 ELF binaries. At the minimum, the binaries |
| 13 | should have an unstripped symbol table, and, to get maximum performance gains, |
| 14 | they should be linked with relocations (`--emit-relocs` or `-q` linker flag). |
| 15 | |
| 16 | BOLT disassembles functions and reconstructs the control flow graph (CFG) |
| 17 | before it runs optimizations. Since this is a nontrivial task, |
| 18 | especially when indirect branches are present, we rely on certain heuristics |
| 19 | to accomplish it. These heuristics have been tested on a code generated with |
| 20 | Clang and GCC compilers. The main requirement for C/C++ code is not to rely |
| 21 | on code layout properties, such as function pointer deltas. |
| 22 | Assembly code can be processed too. Requirements for it include a clear |
| 23 | separation of code and data, with data objects being placed into data |
| 24 | sections/segments. If indirect jumps are used for intra-function control |
| 25 | transfer (e.g., jump tables), the code patterns should be matching those |
| 26 | generated by Clang/GCC. |
| 27 | |
| 28 | NOTE: BOLT is currently incompatible with the `-freorder-blocks-and-partition` |
| 29 | compiler option. Since GCC8 enables this option by default, you have to |
| 30 | explicitly disable it by adding `-fno-reorder-blocks-and-partition` flag if |
Amir Aupov | 4ed8711 | 2022-01-12 05:23:26 | [diff] [blame] | 31 | you are compiling with GCC8 or above. |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 32 | |
| 33 | PIE and .so support has been added recently. Please report bugs if you |
| 34 | encounter any issues. |
| 35 | |
| 36 | ## Installation |
| 37 | |
| 38 | ### Docker Image |
| 39 | |
Shoaib Meenai | d9b2983 | 2022-01-13 01:28:25 | [diff] [blame] | 40 | You can build and use the docker image containing BOLT using our [docker file](utils/docker/Dockerfile). |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 41 | Alternatively, you can build BOLT manually using the steps below. |
| 42 | |
| 43 | ### Manual Build |
| 44 | |
| 45 | BOLT heavily uses LLVM libraries, and by design, it is built as one of LLVM |
| 46 | tools. The build process is not much different from a regular LLVM build. |
| 47 | The following instructions are assuming that you are running under Linux. |
| 48 | |
Amir Ayupov | 65d3994 | 2022-01-12 05:26:01 | [diff] [blame] | 49 | Start with cloning LLVM repo: |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 50 | |
| 51 | ``` |
Amir Ayupov | 65d3994 | 2022-01-12 05:26:01 | [diff] [blame] | 52 | > git clone https://github.com/llvm/llvm-project.git |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 53 | > mkdir build |
| 54 | > cd build |
Amir Ayupov | 65d3994 | 2022-01-12 05:26:01 | [diff] [blame] | 55 | > cmake -G Ninja ../llvm-project/llvm -DLLVM_TARGETS_TO_BUILD="X86;AArch64" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_ENABLE_PROJECTS="bolt" |
| 56 | > ninja bolt |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 57 | ``` |
| 58 | |
| 59 | `llvm-bolt` will be available under `bin/`. Add this directory to your path to |
| 60 | ensure the rest of the commands in this tutorial work. |
| 61 | |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 62 | ## Optimizing BOLT's Performance |
| 63 | |
| 64 | BOLT runs many internal passes in parallel. If you foresee heavy usage of |
| 65 | BOLT, you can improve the processing time by linking against one of memory |
| 66 | allocation libraries with good support for concurrency. E.g. to use jemalloc: |
| 67 | |
| 68 | ``` |
| 69 | > sudo yum install jemalloc-devel |
| 70 | > LD_PRELOAD=/usr/lib64/libjemalloc.so llvm-bolt .... |
| 71 | ``` |
| 72 | Or if you rather use tcmalloc: |
| 73 | ``` |
| 74 | > sudo yum install gperftools-devel |
| 75 | > LD_PRELOAD=/usr/lib64/libtcmalloc_minimal.so llvm-bolt .... |
| 76 | ``` |
| 77 | |
| 78 | ## Usage |
| 79 | |
Shoaib Meenai | d9b2983 | 2022-01-13 01:28:25 | [diff] [blame] | 80 | For a complete practical guide of using BOLT see [Optimizing Clang with BOLT](docs/OptimizingClang.md). |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 81 | |
| 82 | ### Step 0 |
| 83 | |
| 84 | In order to allow BOLT to re-arrange functions (in addition to re-arranging |
| 85 | code within functions) in your program, it needs a little help from the linker. |
| 86 | Add `--emit-relocs` to the final link step of your application. You can verify |
| 87 | the presence of relocations by checking for `.rela.text` section in the binary. |
| 88 | BOLT will also report if it detects relocations while processing the binary. |
| 89 | |
| 90 | ### Step 1: Collect Profile |
| 91 | |
| 92 | This step is different for different kinds of executables. If you can invoke |
| 93 | your program to run on a representative input from a command line, then check |
| 94 | **For Applications** section below. If your program typically runs as a |
| 95 | server/service, then skip to **For Services** section. |
| 96 | |
| 97 | The version of `perf` command used for the following steps has to support |
| 98 | `-F brstack` option. We recommend using `perf` version 4.5 or later. |
| 99 | |
| 100 | #### For Applications |
| 101 | |
| 102 | This assumes you can run your program from a command line with a typical input. |
| 103 | In this case, simply prepend the command line invocation with `perf`: |
| 104 | ``` |
| 105 | $ perf record -e cycles:u -j any,u -o perf.data -- <executable> <args> ... |
| 106 | ``` |
| 107 | |
| 108 | #### For Services |
| 109 | |
| 110 | Once you get the service deployed and warmed-up, it is time to collect perf |
| 111 | data with LBR (branch information). The exact perf command to use will depend |
| 112 | on the service. E.g., to collect the data for all processes running on the |
| 113 | server for the next 3 minutes use: |
| 114 | ``` |
| 115 | $ perf record -e cycles:u -j any,u -a -o perf.data -- sleep 180 |
| 116 | ``` |
| 117 | |
| 118 | Depending on the application, you may need more samples to be included with |
| 119 | your profile. It's hard to tell upfront what would be a sweet spot for your |
| 120 | application. We recommend the profile to cover 1B instructions as reported |
| 121 | by BOLT `-dyno-stats` option. If you need to increase the number of samples |
| 122 | in the profile, you can either run the `sleep` command for longer and use |
| 123 | `-F<N>` option with `perf` to increase sampling frequency. |
| 124 | |
| 125 | Note that for profile collection we recommend using cycle events and not |
| 126 | `BR_INST_RETIRED.*`. Empirically we found it to produce better results. |
| 127 | |
| 128 | If the collection of a profile with branches is not available, e.g., when you run on |
| 129 | a VM or on hardware that does not support it, then you can use only sample |
| 130 | events, such as cycles. In this case, the quality of the profile information |
| 131 | would not be as good, and performance gains with BOLT are expected to be lower. |
| 132 | |
Vasily Leonenko | 285ac26 | 2021-06-25 08:27:47 | [diff] [blame] | 133 | #### With instrumentation |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 134 | |
| 135 | If perf record is not available to you, you may collect profile by first |
| 136 | instrumenting the binary with BOLT and then running it. |
| 137 | ``` |
| 138 | llvm-bolt <executable> -instrument -o <instrumented-executable> |
| 139 | ``` |
| 140 | |
| 141 | After you run instrumented-executable with the desired workload, its BOLT |
| 142 | profile should be ready for you in `/tmp/prof.fdata` and you can skip |
| 143 | **Step 2**. |
| 144 | |
| 145 | Run BOLT with the `-help` option and check the category "BOLT instrumentation |
Vasily Leonenko | 285ac26 | 2021-06-25 08:27:47 | [diff] [blame] | 146 | options" for a quick reference on instrumentation knobs. |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 147 | |
| 148 | ### Step 2: Convert Profile to BOLT Format |
| 149 | |
| 150 | NOTE: you can skip this step and feed `perf.data` directly to BOLT using |
| 151 | experimental `-p perf.data` option. |
| 152 | |
| 153 | For this step, you will need `perf.data` file collected from the previous step and |
| 154 | a copy of the binary that was running. The binary has to be either |
| 155 | unstripped, or should have a symbol table intact (i.e., running `strip -g` is |
| 156 | okay). |
| 157 | |
| 158 | Make sure `perf` is in your `PATH`, and execute `perf2bolt`: |
| 159 | ``` |
| 160 | $ perf2bolt -p perf.data -o perf.fdata <executable> |
| 161 | ``` |
| 162 | |
| 163 | This command will aggregate branch data from `perf.data` and store it in a |
| 164 | format that is both more compact and more resilient to binary modifications. |
| 165 | |
| 166 | If the profile was collected without LBRs, you will need to add `-nl` flag to |
| 167 | the command line above. |
| 168 | |
| 169 | ### Step 3: Optimize with BOLT |
| 170 | |
| 171 | Once you have `perf.fdata` ready, you can use it for optimizations with |
| 172 | BOLT. Assuming your environment is setup to include the right path, execute |
| 173 | `llvm-bolt`: |
| 174 | ``` |
Fabian Parzefall | 96f6ec5 | 2022-06-25 00:00:20 | [diff] [blame] | 175 | $ llvm-bolt <executable> -o <executable>.bolt -data=perf.fdata -reorder-blocks=ext-tsp -reorder-functions=hfsort -split-functions -split-all-cold -split-eh -dyno-stats |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 176 | ``` |
| 177 | |
| 178 | If you do need an updated debug info, then add `-update-debug-sections` option |
| 179 | to the command above. The processing time will be slightly longer. |
| 180 | |
| 181 | For a full list of options see `-help`/`-help-hidden` output. |
| 182 | |
| 183 | The input binary for this step does not have to 100% match the binary used for |
| 184 | profile collection in **Step 1**. This could happen when you are doing active |
| 185 | development, and the source code constantly changes, yet you want to benefit |
| 186 | from profile-guided optimizations. However, since the binary is not precisely the |
| 187 | same, the profile information could become invalid or stale, and BOLT will |
| 188 | report the number of functions with a stale profile. The higher the |
| 189 | number, the less performance improvement should be expected. Thus, it is |
| 190 | crucial to update `.fdata` for release branches. |
| 191 | |
| 192 | ## Multiple Profiles |
| 193 | |
| 194 | Suppose your application can run in different modes, and you can generate |
| 195 | multiple profiles for each one of them. To generate a single binary that can |
| 196 | benefit all modes (assuming the profiles don't contradict each other) you can |
| 197 | use `merge-fdata` tool: |
| 198 | ``` |
| 199 | $ merge-fdata *.fdata > combined.fdata |
| 200 | ``` |
| 201 | Use `combined.fdata` for **Step 3** above to generate a universally optimized |
| 202 | binary. |
| 203 | |
| 204 | ## License |
| 205 | |
Rafael Auler | da752c9c | 2021-03-17 22:04:19 | [diff] [blame] | 206 | BOLT is licensed under the [Apache License v2.0 with LLVM Exceptions](./LICENSE.TXT). |