blob: 685fcc2b738faa2500a3a5091ecf64b7a0c0f7b8 [file] [log] [blame] [view]
Amir Ayupov1c5d3a02020-12-02 00:29:391# Optimizing Clang : A Practical Example of Applying BOLT
2
3## Preface
4
5*BOLT* (Binary Optimization and Layout Tool) is designed to improve the application
6performance by laying out code in a manner that helps CPU better utilize its caching and
7branch predicting resources.
8
9The most obvious candidates for BOLT optimizations
10are programs that suffer from many instruction cache and iTLB misses, such as
11large applications measuring over hundreds of megabytes in size. However, medium-sized
12programs can benefit too. Clang, one of the most popular open-source C/C++ compilers,
13is a good example of the latter. Its code size could easily be in the order of tens of megabytes.
14As we will see, the Clang binary suffers from many instruction cache
15misses and can be significantly improved with BOLT, even on top of profile-guided and
16link-time optimizations.
17
18In this tutorial we will first build Clang with PGO and LTO, and then will show steps on how to
19apply BOLT optimizations to make Clang up to 15% faster. We will also analyze where
20the compile-time performance gains are coming from, and verify that the speed-ups are
21sustainable while building other applications.
22
23## Building Clang
24
25The process of getting Clang sources and performing the build is very similar to the
26one described at http://clang.llvm.org/get_started.html. For completeness, we provide the detailed steps
27on how to obtain and build Clang in [Bootstrapping Clang-7 with PGO and LTO](#bootstrapping-clang-7-with-pgo-and-lto) section.
28
29The only difference from the standard Clang build is that we require the `-Wl,-q` flag to be present during
30the final link. This option saves relocation metadata in the executable file, but does not affect
31the generated code in any way.
32
33## Optimizing Clang with BOLT
34
35We will use the setup described in [Bootstrapping Clang-7 with PGO and LTO](#bootstrapping-clang-7-with-pgo-and-lto).
36Adjust the steps accordingly if you skipped that section. We will also assume that `llvm-bolt` is present in your `$PATH`.
37
38Before we can run BOLT optimizations, we need to collect the profile for Clang, and we will use
39Clang/LLVM sources for that.
40Collecting accurate profile requires running `perf` on a hardware that
41implements taken branch sampling (`-b/-j` flag). For that reason, it may not be possible to
42collect the accurate profile in a virtualized environment, e.g. in the cloud.
43We do support regular sampling profiles, but the performance
Louis Dionne4ae83bb2022-02-09 17:08:4444improvements are expected to be more modest.
Amir Ayupov1c5d3a02020-12-02 00:29:3945
46```bash
47$ mkdir ${TOPLEV}/stage3
48$ cd ${TOPLEV}/stage3
49$ CPATH=${TOPLEV}/stage2-prof-use-lto/install/bin/
50$ cmake -G Ninja ${TOPLEV}/llvm -DLLVM_TARGETS_TO_BUILD=X86 -DCMAKE_BUILD_TYPE=Release \
51 -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \
Itis-hard2name7f563232024-07-19 23:55:2152 -DLLVM_ENABLE_PROJECTS="clang" \
Amir Ayupov1c5d3a02020-12-02 00:29:3953 -DLLVM_USE_LINKER=lld -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage3/install
54$ perf record -e cycles:u -j any,u -- ninja clang
55```
56
57Once the last command is finished, it will create a `perf.data` file larger than 10GiB.
58We will first convert this profile into a more compact aggregated
59form suitable to be consumed by BOLT:
60```bash
61 $ perf2bolt $CPATH/clang-7 -p perf.data -o clang-7.fdata -w clang-7.yaml
62```
63Notice that we are passing `clang-7` to `perf2bolt` which is the real binary that
64`clang` and `clang++` are symlinking to. The next step will optimize Clang using
65the generated profile:
66```bash
67$ llvm-bolt $CPATH/clang-7 -o $CPATH/clang-7.bolt -b clang-7.yaml \
Fabian Parzefall96f6ec52022-06-25 00:00:2068 -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions \
Amir Ayupov1c5d3a02020-12-02 00:29:3969 -split-all-cold -dyno-stats -icf=1 -use-gnu-stack
70```
71The output will look similar to the one below:
72```t
73...
74BOLT-INFO: enabling relocation mode
75BOLT-INFO: 11415 functions out of 104526 simple functions (10.9%) have non-empty execution profile.
76...
77BOLT-INFO: ICF folded 29144 out of 105177 functions in 8 passes. 82 functions had jump tables.
78BOLT-INFO: Removing all identical functions will save 5466.69 KB of code space. Folded functions were called 2131985 times based on profile.
79BOLT-INFO: basic block reordering modified layout of 7848 (10.32%) functions
80...
81 660155947 : executed forward branches (-2.3%)
82 48252553 : taken forward branches (-57.2%)
83 129897961 : executed backward branches (+13.8%)
84 52389551 : taken backward branches (-19.5%)
85 35650038 : executed unconditional branches (-33.2%)
86 128338874 : all function calls (=)
87 19010563 : indirect calls (=)
88 9918250 : PLT calls (=)
89 6113398840 : executed instructions (-0.6%)
90 1519537463 : executed load instructions (=)
91 943321306 : executed store instructions (=)
92 20467109 : taken jump table branches (=)
93 825703946 : total branches (-2.1%)
94 136292142 : taken branches (-41.1%)
95 689411804 : non-taken conditional branches (+12.6%)
96 100642104 : taken conditional branches (-43.4%)
97 790053908 : all conditional branches (=)
98...
99```
100The statistics in the output is based on the LBR profile collected with `perf`, and since we were using
101the `cycles` counter, its accuracy is affected. However, the relative improvement in `taken conditional
102 branches` is a good indication that BOLT was able to straighten out the code even after PGO.
103
104## Measuring Compile-time Improvement
105
106`clang-7.bolt` can be used as a replacement for *PGO+LTO* Clang:
107```bash
108$ mv $CPATH/clang-7 $CPATH/clang-7.org
109$ ln -fs $CPATH/clang-7.bolt $CPATH/clang-7
110```
111Doing a new build of Clang using the new binary shows a significant overall
112build time reduction on a 48-core Haswell system:
113```bash
114$ ln -fs $CPATH/clang-7.org $CPATH/clang-7
115$ ninja clean && /bin/time -f %e ninja clang -j48
116202.72
117$ ln -fs $CPATH/clang-7.bolt $CPATH/clang-7
118$ ninja clean && /bin/time -f %e ninja clang -j48
119180.11
120```
121That's 22.61 seconds (or 12%) faster compared to the *PGO+LTO* build.
122Notice that we are measuring an improvement of the total build time, which includes the time spent in the linker.
123Compilation time improvements for individual files differ, and speedups over 15% are not uncommon.
124If we run BOLT on a Clang binary compiled without *PGO+LTO* (in which case the build is finished in 253.32 seconds),
125the gains we see are over 50 seconds (25%),
126but, as expected, the result is still slower than *PGO+LTO+BOLT* build.
127
128## Source of the Wins
129
130We mentioned that Clang suffers from considerable instruction cache misses. This can be measured with `perf`:
131```bash
132$ ln -fs $CPATH/clang-7.org $CPATH/clang-7
133$ ninja clean && perf stat -e instructions,L1-icache-misses -- ninja clang -j48
134 ...
135 16,366,101,626,647 instructions
136 359,996,216,537 L1-icache-misses
137```
138That's about 22 instruction cache misses per thousand instructions. As a rule of thumb, if the application
139has over 10 misses per thousand instructions, it is a good indication that it will be improved by BOLT.
140Now let's see how many misses are in the BOLTed binary:
141```bash
142$ ln -fs $CPATH/clang-7.bolt $CPATH/clang-7
143$ ninja clean && perf stat -e instructions,L1-icache-misses -- ninja clang -j48
144 ...
145 16,319,818,488,769 instructions
146 244,888,677,972 L1-icache-misses
147```
148The number of misses per thousand instructions went down from 22 to 15, significantly reducing
149the number of stalls in the CPU front-end.
150Notice how the number of executed instructions stayed roughly the same. That's because we didn't
151run any optimizations beyond the ones affecting the code layout. Other than instruction cache misses,
152BOLT also improves branch mispredictions, iTLB misses, and misses in L2 and L3.
153
154## Using Clang for Other Applications
155
156We have collected profile for Clang using its own source code. Would it be enough to speed up
157the compilation of other projects? We picked `mysqld`, an open-source database, to do the test.
158
159On our 48-core Haswell system using the *PGO+LTO* Clang, the build finished in 136.06 seconds, while using the *PGO+LTO+BOLT* Clang, 126.10 seconds.
160That's a noticeable improvement, but not as significant as the one we saw on Clang itself.
161This is partially because the number of instruction cache misses is slightly lower on this scenario : 19 vs 22.
162Another reason is that Clang is run with a different set of options while building `mysqld` compared
163to the training run.
164
165Different options exercise different code paths, and
166if we trained without a specific option, we may have misplaced parts of the code responsible for handling it.
167To test this theory, we have collected another `perf` profile while building `mysqld`, and merged it with an existing profile
168using the `merge-fdata` utility that comes with BOLT. Optimized with that profile, the *PGO+LTO+BOLT* Clang was able
169to perform the `mysqld` build in 124.74 seconds, i.e. 11 seconds or 9% faster compared to *PGO+LGO* Clang.
170The merged profile didn't make the original Clang compilation slower either, while the number of profiled functions in Clang increased from 11,415 to 14,025.
171
172Ideally, the profile run has to be done with a superset of all commonly used options. However, the main improvement is expected with just the basic set.
173
174## Summary
175
176In this tutorial we demonstrated how to use BOLT to improve the
177performance of the Clang compiler. Similarly, BOLT could be used to improve the performance
178of GCC, or any other application suffering from a high number of instruction
179cache misses.
180
181----
182# Appendix
183
184## Bootstrapping Clang-7 with PGO and LTO
185
Amir Ayupov4eb237e2021-12-20 20:58:53186Below we describe detailed steps to build Clang, and make it ready for BOLT
187optimizations. If you already have the build setup, you can skip this section,
188except for the last step that adds `-Wl,-q` linker flag to the final build.
Amir Ayupov1c5d3a02020-12-02 00:29:39189
190### Getting Clang-7 Sources
191
192Set `$TOPLEV` to the directory of your preference where you would like to do
Amir Ayupov4eb237e2021-12-20 20:58:53193builds. E.g. `TOPLEV=~/clang-7/`. Follow with commands to clone the `release_70`
194branch of LLVM monorepo:
Amir Ayupov1c5d3a02020-12-02 00:29:39195```bash
Amir Ayupov4eb237e2021-12-20 20:58:53196$ mkdir ${TOPLEV}
Amir Ayupov1c5d3a02020-12-02 00:29:39197$ cd ${TOPLEV}
Amir Ayupov4eb237e2021-12-20 20:58:53198$ git clone --branch=release/7.x https://github.com/llvm/llvm-project.git
Amir Ayupov1c5d3a02020-12-02 00:29:39199```
200
201### Building Stage 1 Compiler
202
203Stage 1 will be the first build we are going to do, and we will be using the
Amir Ayupov4eb237e2021-12-20 20:58:53204default system compiler to build Clang. If your system lacks a compiler, use
205your distribution package manager to install one that supports C++11. In this
206example we are going to use GCC. In addition to the compiler, you will need the
207`cmake` and `ninja` packages. Note that we disable the build of certain
208compiler-rt components that are known to cause build issues at release/7.x.
Amir Ayupov1c5d3a02020-12-02 00:29:39209```bash
Amir Ayupov4eb237e2021-12-20 20:58:53210$ mkdir ${TOPLEV}/stage1
Amir Ayupov1c5d3a02020-12-02 00:29:39211$ cd ${TOPLEV}/stage1
Amir Ayupov4eb237e2021-12-20 20:58:53212$ cmake -G Ninja ${TOPLEV}/llvm-project/llvm -DLLVM_TARGETS_TO_BUILD=X86 \
213 -DCMAKE_BUILD_TYPE=Release \
Amir Ayupov1c5d3a02020-12-02 00:29:39214 -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_ASM_COMPILER=gcc \
Louis Dionne4ae83bb2022-02-09 17:08:44215 -DLLVM_ENABLE_PROJECTS="clang;lld" \
216 -DLLVM_ENABLE_RUNTIMES="compiler-rt" \
Amir Ayupov4eb237e2021-12-20 20:58:53217 -DCOMPILER_RT_BUILD_SANITIZERS=OFF -DCOMPILER_RT_BUILD_XRAY=OFF \
218 -DCOMPILER_RT_BUILD_LIBFUZZER=OFF \
Amir Ayupov1c5d3a02020-12-02 00:29:39219 -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage1/install
220$ ninja install
221```
222
223### Building Stage 2 Compiler With Instrumentation
224
Amir Ayupov4eb237e2021-12-20 20:58:53225Using the freshly-baked stage 1 Clang compiler, we are going to build Clang with
226profile generation capabilities:
Amir Ayupov1c5d3a02020-12-02 00:29:39227```bash
228$ mkdir ${TOPLEV}/stage2-prof-gen
229$ cd ${TOPLEV}/stage2-prof-gen
230$ CPATH=${TOPLEV}/stage1/install/bin/
Amir Ayupov4eb237e2021-12-20 20:58:53231$ cmake -G Ninja ${TOPLEV}/llvm-project/llvm -DLLVM_TARGETS_TO_BUILD=X86 \
232 -DCMAKE_BUILD_TYPE=Release \
Amir Ayupov1c5d3a02020-12-02 00:29:39233 -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \
Amir Ayupov4eb237e2021-12-20 20:58:53234 -DLLVM_ENABLE_PROJECTS="clang;lld" \
Amir Ayupov1c5d3a02020-12-02 00:29:39235 -DLLVM_USE_LINKER=lld -DLLVM_BUILD_INSTRUMENTED=ON \
236 -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage2-prof-gen/install
237$ ninja install
238```
239
240### Generating Profile for PGO
241
Amir Ayupov4eb237e2021-12-20 20:58:53242While there are many ways to obtain the profile data, we are going to use the
243source code already at our disposal, i.e. we are going to collect the profile
244while building Clang itself:
Amir Ayupov1c5d3a02020-12-02 00:29:39245```bash
246$ mkdir ${TOPLEV}/stage3-train
247$ cd ${TOPLEV}/stage3-train
248$ CPATH=${TOPLEV}/stage2-prof-gen/install/bin
Amir Ayupov4eb237e2021-12-20 20:58:53249$ cmake -G Ninja ${TOPLEV}/llvm-project/llvm -DLLVM_TARGETS_TO_BUILD=X86 \
250 -DCMAKE_BUILD_TYPE=Release \
Amir Ayupov1c5d3a02020-12-02 00:29:39251 -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \
Amir Ayupov4eb237e2021-12-20 20:58:53252 -DLLVM_ENABLE_PROJECTS="clang" \
Amir Ayupov1c5d3a02020-12-02 00:29:39253 -DLLVM_USE_LINKER=lld -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage3-train/install
254$ ninja clang
255```
Amir Ayupov4eb237e2021-12-20 20:58:53256Once the build is completed, the profile files will be saved under
257`${TOPLEV}/stage2-prof-gen/profiles`. We will merge them before they can be
258passed back into Clang:
Amir Ayupov1c5d3a02020-12-02 00:29:39259```bash
260$ cd ${TOPLEV}/stage2-prof-gen/profiles
261$ ${TOPLEV}/stage1/install/bin/llvm-profdata merge -output=clang.profdata *
262```
263
264### Building Clang with PGO and LTO
265
Amir Ayupov4eb237e2021-12-20 20:58:53266Now the profile can be used to guide optimizations to produce better code for
267our scenario, i.e. building Clang. We will also enable link-time optimizations
268to allow cross-module inlining and other optimizations. Finally, we are going to
269add one extra step that is useful for BOLT: a linker flag instructing it to
270preserve relocations in the output binary. Note that this flag does not affect
271the generated code or data used at runtime, it only writes metadata to the file
272on disk:
Amir Ayupov1c5d3a02020-12-02 00:29:39273```bash
274$ mkdir ${TOPLEV}/stage2-prof-use-lto
275$ cd ${TOPLEV}/stage2-prof-use-lto
276$ CPATH=${TOPLEV}/stage1/install/bin/
277$ export LDFLAGS="-Wl,-q"
Amir Ayupov4eb237e2021-12-20 20:58:53278$ cmake -G Ninja ${TOPLEV}/llvm-project/llvm -DLLVM_TARGETS_TO_BUILD=X86 \
279 -DCMAKE_BUILD_TYPE=Release \
Amir Ayupov1c5d3a02020-12-02 00:29:39280 -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \
Amir Ayupov4eb237e2021-12-20 20:58:53281 -DLLVM_ENABLE_PROJECTS="clang;lld" \
282 -DLLVM_ENABLE_LTO=Full \
283 -DLLVM_PROFDATA_FILE=${TOPLEV}/stage2-prof-gen/profiles/clang.profdata \
284 -DLLVM_USE_LINKER=lld \
285 -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage2-prof-use-lto/install
Amir Ayupov1c5d3a02020-12-02 00:29:39286$ ninja install
287```
Amir Ayupov4eb237e2021-12-20 20:58:53288Now we have a Clang compiler that can build itself much faster. As we will see,
289it builds other applications faster as well, and, with BOLT, the compile time
290can be improved even further.