Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 1 | # Optimizing Clang : A Practical Example of Applying BOLT |
| 2 | |
| 3 | ## Preface |
| 4 | |
| 5 | *BOLT* (Binary Optimization and Layout Tool) is designed to improve the application |
| 6 | performance by laying out code in a manner that helps CPU better utilize its caching and |
| 7 | branch predicting resources. |
| 8 | |
| 9 | The most obvious candidates for BOLT optimizations |
| 10 | are programs that suffer from many instruction cache and iTLB misses, such as |
| 11 | large applications measuring over hundreds of megabytes in size. However, medium-sized |
| 12 | programs can benefit too. Clang, one of the most popular open-source C/C++ compilers, |
| 13 | is a good example of the latter. Its code size could easily be in the order of tens of megabytes. |
| 14 | As we will see, the Clang binary suffers from many instruction cache |
| 15 | misses and can be significantly improved with BOLT, even on top of profile-guided and |
| 16 | link-time optimizations. |
| 17 | |
| 18 | In this tutorial we will first build Clang with PGO and LTO, and then will show steps on how to |
| 19 | apply BOLT optimizations to make Clang up to 15% faster. We will also analyze where |
| 20 | the compile-time performance gains are coming from, and verify that the speed-ups are |
| 21 | sustainable while building other applications. |
| 22 | |
| 23 | ## Building Clang |
| 24 | |
| 25 | The process of getting Clang sources and performing the build is very similar to the |
| 26 | one described at http://clang.llvm.org/get_started.html. For completeness, we provide the detailed steps |
| 27 | on how to obtain and build Clang in [Bootstrapping Clang-7 with PGO and LTO](#bootstrapping-clang-7-with-pgo-and-lto) section. |
| 28 | |
| 29 | The only difference from the standard Clang build is that we require the `-Wl,-q` flag to be present during |
| 30 | the final link. This option saves relocation metadata in the executable file, but does not affect |
| 31 | the generated code in any way. |
| 32 | |
| 33 | ## Optimizing Clang with BOLT |
| 34 | |
| 35 | We will use the setup described in [Bootstrapping Clang-7 with PGO and LTO](#bootstrapping-clang-7-with-pgo-and-lto). |
| 36 | Adjust the steps accordingly if you skipped that section. We will also assume that `llvm-bolt` is present in your `$PATH`. |
| 37 | |
| 38 | Before we can run BOLT optimizations, we need to collect the profile for Clang, and we will use |
| 39 | Clang/LLVM sources for that. |
| 40 | Collecting accurate profile requires running `perf` on a hardware that |
| 41 | implements taken branch sampling (`-b/-j` flag). For that reason, it may not be possible to |
| 42 | collect the accurate profile in a virtualized environment, e.g. in the cloud. |
| 43 | We do support regular sampling profiles, but the performance |
Louis Dionne | 4ae83bb | 2022-02-09 17:08:44 | [diff] [blame] | 44 | improvements are expected to be more modest. |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 45 | |
| 46 | ```bash |
| 47 | $ mkdir ${TOPLEV}/stage3 |
| 48 | $ cd ${TOPLEV}/stage3 |
| 49 | $ CPATH=${TOPLEV}/stage2-prof-use-lto/install/bin/ |
| 50 | $ cmake -G Ninja ${TOPLEV}/llvm -DLLVM_TARGETS_TO_BUILD=X86 -DCMAKE_BUILD_TYPE=Release \ |
| 51 | -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \ |
Itis-hard2name | 7f56323 | 2024-07-19 23:55:21 | [diff] [blame] | 52 | -DLLVM_ENABLE_PROJECTS="clang" \ |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 53 | -DLLVM_USE_LINKER=lld -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage3/install |
| 54 | $ perf record -e cycles:u -j any,u -- ninja clang |
| 55 | ``` |
| 56 | |
| 57 | Once the last command is finished, it will create a `perf.data` file larger than 10GiB. |
| 58 | We will first convert this profile into a more compact aggregated |
| 59 | form suitable to be consumed by BOLT: |
| 60 | ```bash |
| 61 | $ perf2bolt $CPATH/clang-7 -p perf.data -o clang-7.fdata -w clang-7.yaml |
| 62 | ``` |
| 63 | Notice that we are passing `clang-7` to `perf2bolt` which is the real binary that |
| 64 | `clang` and `clang++` are symlinking to. The next step will optimize Clang using |
| 65 | the generated profile: |
| 66 | ```bash |
| 67 | $ llvm-bolt $CPATH/clang-7 -o $CPATH/clang-7.bolt -b clang-7.yaml \ |
Fabian Parzefall | 96f6ec5 | 2022-06-25 00:00:20 | [diff] [blame] | 68 | -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions \ |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 69 | -split-all-cold -dyno-stats -icf=1 -use-gnu-stack |
| 70 | ``` |
| 71 | The output will look similar to the one below: |
| 72 | ```t |
| 73 | ... |
| 74 | BOLT-INFO: enabling relocation mode |
| 75 | BOLT-INFO: 11415 functions out of 104526 simple functions (10.9%) have non-empty execution profile. |
| 76 | ... |
| 77 | BOLT-INFO: ICF folded 29144 out of 105177 functions in 8 passes. 82 functions had jump tables. |
| 78 | BOLT-INFO: Removing all identical functions will save 5466.69 KB of code space. Folded functions were called 2131985 times based on profile. |
| 79 | BOLT-INFO: basic block reordering modified layout of 7848 (10.32%) functions |
| 80 | ... |
| 81 | 660155947 : executed forward branches (-2.3%) |
| 82 | 48252553 : taken forward branches (-57.2%) |
| 83 | 129897961 : executed backward branches (+13.8%) |
| 84 | 52389551 : taken backward branches (-19.5%) |
| 85 | 35650038 : executed unconditional branches (-33.2%) |
| 86 | 128338874 : all function calls (=) |
| 87 | 19010563 : indirect calls (=) |
| 88 | 9918250 : PLT calls (=) |
| 89 | 6113398840 : executed instructions (-0.6%) |
| 90 | 1519537463 : executed load instructions (=) |
| 91 | 943321306 : executed store instructions (=) |
| 92 | 20467109 : taken jump table branches (=) |
| 93 | 825703946 : total branches (-2.1%) |
| 94 | 136292142 : taken branches (-41.1%) |
| 95 | 689411804 : non-taken conditional branches (+12.6%) |
| 96 | 100642104 : taken conditional branches (-43.4%) |
| 97 | 790053908 : all conditional branches (=) |
| 98 | ... |
| 99 | ``` |
| 100 | The statistics in the output is based on the LBR profile collected with `perf`, and since we were using |
| 101 | the `cycles` counter, its accuracy is affected. However, the relative improvement in `taken conditional |
| 102 | branches` is a good indication that BOLT was able to straighten out the code even after PGO. |
| 103 | |
| 104 | ## Measuring Compile-time Improvement |
| 105 | |
| 106 | `clang-7.bolt` can be used as a replacement for *PGO+LTO* Clang: |
| 107 | ```bash |
| 108 | $ mv $CPATH/clang-7 $CPATH/clang-7.org |
| 109 | $ ln -fs $CPATH/clang-7.bolt $CPATH/clang-7 |
| 110 | ``` |
| 111 | Doing a new build of Clang using the new binary shows a significant overall |
| 112 | build time reduction on a 48-core Haswell system: |
| 113 | ```bash |
| 114 | $ ln -fs $CPATH/clang-7.org $CPATH/clang-7 |
| 115 | $ ninja clean && /bin/time -f %e ninja clang -j48 |
| 116 | 202.72 |
| 117 | $ ln -fs $CPATH/clang-7.bolt $CPATH/clang-7 |
| 118 | $ ninja clean && /bin/time -f %e ninja clang -j48 |
| 119 | 180.11 |
| 120 | ``` |
| 121 | That's 22.61 seconds (or 12%) faster compared to the *PGO+LTO* build. |
| 122 | Notice that we are measuring an improvement of the total build time, which includes the time spent in the linker. |
| 123 | Compilation time improvements for individual files differ, and speedups over 15% are not uncommon. |
| 124 | If we run BOLT on a Clang binary compiled without *PGO+LTO* (in which case the build is finished in 253.32 seconds), |
| 125 | the gains we see are over 50 seconds (25%), |
| 126 | but, as expected, the result is still slower than *PGO+LTO+BOLT* build. |
| 127 | |
| 128 | ## Source of the Wins |
| 129 | |
| 130 | We mentioned that Clang suffers from considerable instruction cache misses. This can be measured with `perf`: |
| 131 | ```bash |
| 132 | $ ln -fs $CPATH/clang-7.org $CPATH/clang-7 |
| 133 | $ ninja clean && perf stat -e instructions,L1-icache-misses -- ninja clang -j48 |
| 134 | ... |
| 135 | 16,366,101,626,647 instructions |
| 136 | 359,996,216,537 L1-icache-misses |
| 137 | ``` |
| 138 | That's about 22 instruction cache misses per thousand instructions. As a rule of thumb, if the application |
| 139 | has over 10 misses per thousand instructions, it is a good indication that it will be improved by BOLT. |
| 140 | Now let's see how many misses are in the BOLTed binary: |
| 141 | ```bash |
| 142 | $ ln -fs $CPATH/clang-7.bolt $CPATH/clang-7 |
| 143 | $ ninja clean && perf stat -e instructions,L1-icache-misses -- ninja clang -j48 |
| 144 | ... |
| 145 | 16,319,818,488,769 instructions |
| 146 | 244,888,677,972 L1-icache-misses |
| 147 | ``` |
| 148 | The number of misses per thousand instructions went down from 22 to 15, significantly reducing |
| 149 | the number of stalls in the CPU front-end. |
| 150 | Notice how the number of executed instructions stayed roughly the same. That's because we didn't |
| 151 | run any optimizations beyond the ones affecting the code layout. Other than instruction cache misses, |
| 152 | BOLT also improves branch mispredictions, iTLB misses, and misses in L2 and L3. |
| 153 | |
| 154 | ## Using Clang for Other Applications |
| 155 | |
| 156 | We have collected profile for Clang using its own source code. Would it be enough to speed up |
| 157 | the compilation of other projects? We picked `mysqld`, an open-source database, to do the test. |
| 158 | |
| 159 | On our 48-core Haswell system using the *PGO+LTO* Clang, the build finished in 136.06 seconds, while using the *PGO+LTO+BOLT* Clang, 126.10 seconds. |
| 160 | That's a noticeable improvement, but not as significant as the one we saw on Clang itself. |
| 161 | This is partially because the number of instruction cache misses is slightly lower on this scenario : 19 vs 22. |
| 162 | Another reason is that Clang is run with a different set of options while building `mysqld` compared |
| 163 | to the training run. |
| 164 | |
| 165 | Different options exercise different code paths, and |
| 166 | if we trained without a specific option, we may have misplaced parts of the code responsible for handling it. |
| 167 | To test this theory, we have collected another `perf` profile while building `mysqld`, and merged it with an existing profile |
| 168 | using the `merge-fdata` utility that comes with BOLT. Optimized with that profile, the *PGO+LTO+BOLT* Clang was able |
| 169 | to perform the `mysqld` build in 124.74 seconds, i.e. 11 seconds or 9% faster compared to *PGO+LGO* Clang. |
| 170 | The merged profile didn't make the original Clang compilation slower either, while the number of profiled functions in Clang increased from 11,415 to 14,025. |
| 171 | |
| 172 | Ideally, the profile run has to be done with a superset of all commonly used options. However, the main improvement is expected with just the basic set. |
| 173 | |
| 174 | ## Summary |
| 175 | |
| 176 | In this tutorial we demonstrated how to use BOLT to improve the |
| 177 | performance of the Clang compiler. Similarly, BOLT could be used to improve the performance |
| 178 | of GCC, or any other application suffering from a high number of instruction |
| 179 | cache misses. |
| 180 | |
| 181 | ---- |
| 182 | # Appendix |
| 183 | |
| 184 | ## Bootstrapping Clang-7 with PGO and LTO |
| 185 | |
Amir Ayupov | 4eb237e | 2021-12-20 20:58:53 | [diff] [blame] | 186 | Below we describe detailed steps to build Clang, and make it ready for BOLT |
| 187 | optimizations. If you already have the build setup, you can skip this section, |
| 188 | except for the last step that adds `-Wl,-q` linker flag to the final build. |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 189 | |
| 190 | ### Getting Clang-7 Sources |
| 191 | |
| 192 | Set `$TOPLEV` to the directory of your preference where you would like to do |
Amir Ayupov | 4eb237e | 2021-12-20 20:58:53 | [diff] [blame] | 193 | builds. E.g. `TOPLEV=~/clang-7/`. Follow with commands to clone the `release_70` |
| 194 | branch of LLVM monorepo: |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 195 | ```bash |
Amir Ayupov | 4eb237e | 2021-12-20 20:58:53 | [diff] [blame] | 196 | $ mkdir ${TOPLEV} |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 197 | $ cd ${TOPLEV} |
Amir Ayupov | 4eb237e | 2021-12-20 20:58:53 | [diff] [blame] | 198 | $ git clone --branch=release/7.x https://github.com/llvm/llvm-project.git |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 199 | ``` |
| 200 | |
| 201 | ### Building Stage 1 Compiler |
| 202 | |
| 203 | Stage 1 will be the first build we are going to do, and we will be using the |
Amir Ayupov | 4eb237e | 2021-12-20 20:58:53 | [diff] [blame] | 204 | default system compiler to build Clang. If your system lacks a compiler, use |
| 205 | your distribution package manager to install one that supports C++11. In this |
| 206 | example we are going to use GCC. In addition to the compiler, you will need the |
| 207 | `cmake` and `ninja` packages. Note that we disable the build of certain |
| 208 | compiler-rt components that are known to cause build issues at release/7.x. |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 209 | ```bash |
Amir Ayupov | 4eb237e | 2021-12-20 20:58:53 | [diff] [blame] | 210 | $ mkdir ${TOPLEV}/stage1 |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 211 | $ cd ${TOPLEV}/stage1 |
Amir Ayupov | 4eb237e | 2021-12-20 20:58:53 | [diff] [blame] | 212 | $ cmake -G Ninja ${TOPLEV}/llvm-project/llvm -DLLVM_TARGETS_TO_BUILD=X86 \ |
| 213 | -DCMAKE_BUILD_TYPE=Release \ |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 214 | -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_ASM_COMPILER=gcc \ |
Louis Dionne | 4ae83bb | 2022-02-09 17:08:44 | [diff] [blame] | 215 | -DLLVM_ENABLE_PROJECTS="clang;lld" \ |
| 216 | -DLLVM_ENABLE_RUNTIMES="compiler-rt" \ |
Amir Ayupov | 4eb237e | 2021-12-20 20:58:53 | [diff] [blame] | 217 | -DCOMPILER_RT_BUILD_SANITIZERS=OFF -DCOMPILER_RT_BUILD_XRAY=OFF \ |
| 218 | -DCOMPILER_RT_BUILD_LIBFUZZER=OFF \ |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 219 | -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage1/install |
| 220 | $ ninja install |
| 221 | ``` |
| 222 | |
| 223 | ### Building Stage 2 Compiler With Instrumentation |
| 224 | |
Amir Ayupov | 4eb237e | 2021-12-20 20:58:53 | [diff] [blame] | 225 | Using the freshly-baked stage 1 Clang compiler, we are going to build Clang with |
| 226 | profile generation capabilities: |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 227 | ```bash |
| 228 | $ mkdir ${TOPLEV}/stage2-prof-gen |
| 229 | $ cd ${TOPLEV}/stage2-prof-gen |
| 230 | $ CPATH=${TOPLEV}/stage1/install/bin/ |
Amir Ayupov | 4eb237e | 2021-12-20 20:58:53 | [diff] [blame] | 231 | $ cmake -G Ninja ${TOPLEV}/llvm-project/llvm -DLLVM_TARGETS_TO_BUILD=X86 \ |
| 232 | -DCMAKE_BUILD_TYPE=Release \ |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 233 | -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \ |
Amir Ayupov | 4eb237e | 2021-12-20 20:58:53 | [diff] [blame] | 234 | -DLLVM_ENABLE_PROJECTS="clang;lld" \ |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 235 | -DLLVM_USE_LINKER=lld -DLLVM_BUILD_INSTRUMENTED=ON \ |
| 236 | -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage2-prof-gen/install |
| 237 | $ ninja install |
| 238 | ``` |
| 239 | |
| 240 | ### Generating Profile for PGO |
| 241 | |
Amir Ayupov | 4eb237e | 2021-12-20 20:58:53 | [diff] [blame] | 242 | While there are many ways to obtain the profile data, we are going to use the |
| 243 | source code already at our disposal, i.e. we are going to collect the profile |
| 244 | while building Clang itself: |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 245 | ```bash |
| 246 | $ mkdir ${TOPLEV}/stage3-train |
| 247 | $ cd ${TOPLEV}/stage3-train |
| 248 | $ CPATH=${TOPLEV}/stage2-prof-gen/install/bin |
Amir Ayupov | 4eb237e | 2021-12-20 20:58:53 | [diff] [blame] | 249 | $ cmake -G Ninja ${TOPLEV}/llvm-project/llvm -DLLVM_TARGETS_TO_BUILD=X86 \ |
| 250 | -DCMAKE_BUILD_TYPE=Release \ |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 251 | -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \ |
Amir Ayupov | 4eb237e | 2021-12-20 20:58:53 | [diff] [blame] | 252 | -DLLVM_ENABLE_PROJECTS="clang" \ |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 253 | -DLLVM_USE_LINKER=lld -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage3-train/install |
| 254 | $ ninja clang |
| 255 | ``` |
Amir Ayupov | 4eb237e | 2021-12-20 20:58:53 | [diff] [blame] | 256 | Once the build is completed, the profile files will be saved under |
| 257 | `${TOPLEV}/stage2-prof-gen/profiles`. We will merge them before they can be |
| 258 | passed back into Clang: |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 259 | ```bash |
| 260 | $ cd ${TOPLEV}/stage2-prof-gen/profiles |
| 261 | $ ${TOPLEV}/stage1/install/bin/llvm-profdata merge -output=clang.profdata * |
| 262 | ``` |
| 263 | |
| 264 | ### Building Clang with PGO and LTO |
| 265 | |
Amir Ayupov | 4eb237e | 2021-12-20 20:58:53 | [diff] [blame] | 266 | Now the profile can be used to guide optimizations to produce better code for |
| 267 | our scenario, i.e. building Clang. We will also enable link-time optimizations |
| 268 | to allow cross-module inlining and other optimizations. Finally, we are going to |
| 269 | add one extra step that is useful for BOLT: a linker flag instructing it to |
| 270 | preserve relocations in the output binary. Note that this flag does not affect |
| 271 | the generated code or data used at runtime, it only writes metadata to the file |
| 272 | on disk: |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 273 | ```bash |
| 274 | $ mkdir ${TOPLEV}/stage2-prof-use-lto |
| 275 | $ cd ${TOPLEV}/stage2-prof-use-lto |
| 276 | $ CPATH=${TOPLEV}/stage1/install/bin/ |
| 277 | $ export LDFLAGS="-Wl,-q" |
Amir Ayupov | 4eb237e | 2021-12-20 20:58:53 | [diff] [blame] | 278 | $ cmake -G Ninja ${TOPLEV}/llvm-project/llvm -DLLVM_TARGETS_TO_BUILD=X86 \ |
| 279 | -DCMAKE_BUILD_TYPE=Release \ |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 280 | -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \ |
Amir Ayupov | 4eb237e | 2021-12-20 20:58:53 | [diff] [blame] | 281 | -DLLVM_ENABLE_PROJECTS="clang;lld" \ |
| 282 | -DLLVM_ENABLE_LTO=Full \ |
| 283 | -DLLVM_PROFDATA_FILE=${TOPLEV}/stage2-prof-gen/profiles/clang.profdata \ |
| 284 | -DLLVM_USE_LINKER=lld \ |
| 285 | -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage2-prof-use-lto/install |
Amir Ayupov | 1c5d3a0 | 2020-12-02 00:29:39 | [diff] [blame] | 286 | $ ninja install |
| 287 | ``` |
Amir Ayupov | 4eb237e | 2021-12-20 20:58:53 | [diff] [blame] | 288 | Now we have a Clang compiler that can build itself much faster. As we will see, |
| 289 | it builds other applications faster as well, and, with BOLT, the compile time |
| 290 | can be improved even further. |