Blame - bolt/README.md - external/github.com/llvm/llvm-project.git

blob: fe54bd82a356a0d9946dc037640c73b89e039feb [file] [log] [blame] [view]

Amir Ayupov	1c5d3a0	2020-12-02 00:29:39	[diff] [blame]	1	# BOLT
				2
				3	BOLT is a post-link optimizer developed to speed up large applications.
				4	It achieves the improvements by optimizing application's code layout based on
				5	execution profile gathered by sampling profiler, such as Linux `perf` tool.
				6	An overview of the ideas implemented in BOLT along with a discussion of its
				7	potential and current results is available in
				8	[CGO'19 paper](https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/).
				9
				10	## Input Binary Requirements
				11
				12	BOLT operates on X86-64 and AArch64 ELF binaries. At the minimum, the binaries
				13	should have an unstripped symbol table, and, to get maximum performance gains,
				14	they should be linked with relocations (`--emit-relocs` or `-q` linker flag).
				15
				16	BOLT disassembles functions and reconstructs the control flow graph (CFG)
				17	before it runs optimizations. Since this is a nontrivial task,
				18	especially when indirect branches are present, we rely on certain heuristics
				19	to accomplish it. These heuristics have been tested on a code generated with
				20	Clang and GCC compilers. The main requirement for C/C++ code is not to rely
				21	on code layout properties, such as function pointer deltas.
				22	Assembly code can be processed too. Requirements for it include a clear
				23	separation of code and data, with data objects being placed into data
				24	sections/segments. If indirect jumps are used for intra-function control
				25	transfer (e.g., jump tables), the code patterns should be matching those
				26	generated by Clang/GCC.
				27
				28	NOTE: BOLT is currently incompatible with the `-freorder-blocks-and-partition`
				29	compiler option. Since GCC8 enables this option by default, you have to
				30	explicitly disable it by adding `-fno-reorder-blocks-and-partition` flag if
Amir Aupov	4ed8711	2022-01-12 05:23:26	[diff] [blame]	31	you are compiling with GCC8 or above.
Amir Ayupov	1c5d3a0	2020-12-02 00:29:39	[diff] [blame]	32
				33	PIE and .so support has been added recently. Please report bugs if you
				34	encounter any issues.
				35
				36	## Installation
				37
				38	### Docker Image
				39
Shoaib Meenai	d9b2983	2022-01-13 01:28:25	[diff] [blame]	40	You can build and use the docker image containing BOLT using our [docker file](utils/docker/Dockerfile).
Amir Ayupov	1c5d3a0	2020-12-02 00:29:39	[diff] [blame]	41	Alternatively, you can build BOLT manually using the steps below.
				42
				43	### Manual Build
				44
				45	BOLT heavily uses LLVM libraries, and by design, it is built as one of LLVM
				46	tools. The build process is not much different from a regular LLVM build.
				47	The following instructions are assuming that you are running under Linux.
				48
Amir Ayupov	65d3994	2022-01-12 05:26:01	[diff] [blame]	49	Start with cloning LLVM repo:
Amir Ayupov	1c5d3a0	2020-12-02 00:29:39	[diff] [blame]	50
				51	```
Amir Ayupov	65d3994	2022-01-12 05:26:01	[diff] [blame]	52	> git clone https://github.com/llvm/llvm-project.git
Amir Ayupov	1c5d3a0	2020-12-02 00:29:39	[diff] [blame]	53	> mkdir build
				54	> cd build
Amir Ayupov	65d3994	2022-01-12 05:26:01	[diff] [blame]	55	> cmake -G Ninja ../llvm-project/llvm -DLLVM_TARGETS_TO_BUILD="X86;AArch64" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_ENABLE_PROJECTS="bolt"
				56	> ninja bolt
Amir Ayupov	1c5d3a0	2020-12-02 00:29:39	[diff] [blame]	57	```
				58
				59	`llvm-bolt` will be available under `bin/`. Add this directory to your path to
				60	ensure the rest of the commands in this tutorial work.
				61
Amir Ayupov	1c5d3a0	2020-12-02 00:29:39	[diff] [blame]	62	## Optimizing BOLT's Performance
				63
				64	BOLT runs many internal passes in parallel. If you foresee heavy usage of
				65	BOLT, you can improve the processing time by linking against one of memory
				66	allocation libraries with good support for concurrency. E.g. to use jemalloc:
				67
				68	```
				69	> sudo yum install jemalloc-devel
				70	> LD_PRELOAD=/usr/lib64/libjemalloc.so llvm-bolt ....
				71	```
				72	Or if you rather use tcmalloc:
				73	```
				74	> sudo yum install gperftools-devel
				75	> LD_PRELOAD=/usr/lib64/libtcmalloc_minimal.so llvm-bolt ....
				76	```
				77
				78	## Usage
				79
Shoaib Meenai	d9b2983	2022-01-13 01:28:25	[diff] [blame]	80	For a complete practical guide of using BOLT see [Optimizing Clang with BOLT](docs/OptimizingClang.md).
Amir Ayupov	1c5d3a0	2020-12-02 00:29:39	[diff] [blame]	81
				82	### Step 0
				83
				84	In order to allow BOLT to re-arrange functions (in addition to re-arranging
				85	code within functions) in your program, it needs a little help from the linker.
				86	Add `--emit-relocs` to the final link step of your application. You can verify
				87	the presence of relocations by checking for `.rela.text` section in the binary.
				88	BOLT will also report if it detects relocations while processing the binary.
				89
				90	### Step 1: Collect Profile
				91
				92	This step is different for different kinds of executables. If you can invoke
				93	your program to run on a representative input from a command line, then check
				94	For Applications section below. If your program typically runs as a
				95	server/service, then skip to For Services section.
				96
				97	The version of `perf` command used for the following steps has to support
				98	`-F brstack` option. We recommend using `perf` version 4.5 or later.
				99
				100	#### For Applications
				101
				102	This assumes you can run your program from a command line with a typical input.
				103	In this case, simply prepend the command line invocation with `perf`:
				104	```
				105	$ perf record -e cycles:u -j any,u -o perf.data -- <executable> <args> ...
				106	```
				107
				108	#### For Services
				109
				110	Once you get the service deployed and warmed-up, it is time to collect perf
				111	data with LBR (branch information). The exact perf command to use will depend
				112	on the service. E.g., to collect the data for all processes running on the
				113	server for the next 3 minutes use:
				114	```
				115	$ perf record -e cycles:u -j any,u -a -o perf.data -- sleep 180
				116	```
				117
				118	Depending on the application, you may need more samples to be included with
				119	your profile. It's hard to tell upfront what would be a sweet spot for your
				120	application. We recommend the profile to cover 1B instructions as reported
				121	by BOLT `-dyno-stats` option. If you need to increase the number of samples
				122	in the profile, you can either run the `sleep` command for longer and use
				123	`-F<N>` option with `perf` to increase sampling frequency.
				124
				125	Note that for profile collection we recommend using cycle events and not
				126	`BR_INST_RETIRED.*`. Empirically we found it to produce better results.
				127
				128	If the collection of a profile with branches is not available, e.g., when you run on
				129	a VM or on hardware that does not support it, then you can use only sample
				130	events, such as cycles. In this case, the quality of the profile information
				131	would not be as good, and performance gains with BOLT are expected to be lower.
				132
Vasily Leonenko	285ac26	2021-06-25 08:27:47	[diff] [blame]	133	#### With instrumentation
Amir Ayupov	1c5d3a0	2020-12-02 00:29:39	[diff] [blame]	134
				135	If perf record is not available to you, you may collect profile by first
				136	instrumenting the binary with BOLT and then running it.
				137	```
				138	llvm-bolt <executable> -instrument -o <instrumented-executable>
				139	```
				140
				141	After you run instrumented-executable with the desired workload, its BOLT
				142	profile should be ready for you in `/tmp/prof.fdata` and you can skip
				143	Step 2.
				144
				145	Run BOLT with the `-help` option and check the category "BOLT instrumentation
Vasily Leonenko	285ac26	2021-06-25 08:27:47	[diff] [blame]	146	options" for a quick reference on instrumentation knobs.
Amir Ayupov	1c5d3a0	2020-12-02 00:29:39	[diff] [blame]	147
				148	### Step 2: Convert Profile to BOLT Format
				149
				150	NOTE: you can skip this step and feed `perf.data` directly to BOLT using
				151	experimental `-p perf.data` option.
				152
				153	For this step, you will need `perf.data` file collected from the previous step and
				154	a copy of the binary that was running. The binary has to be either
				155	unstripped, or should have a symbol table intact (i.e., running `strip -g` is
				156	okay).
				157
				158	Make sure `perf` is in your `PATH`, and execute `perf2bolt`:
				159	```
				160	$ perf2bolt -p perf.data -o perf.fdata <executable>
				161	```
				162
				163	This command will aggregate branch data from `perf.data` and store it in a
				164	format that is both more compact and more resilient to binary modifications.
				165
				166	If the profile was collected without LBRs, you will need to add `-nl` flag to
				167	the command line above.
				168
				169	### Step 3: Optimize with BOLT
				170
				171	Once you have `perf.fdata` ready, you can use it for optimizations with
				172	BOLT. Assuming your environment is setup to include the right path, execute
				173	`llvm-bolt`:
				174	```
Fabian Parzefall	96f6ec5	2022-06-25 00:00:20	[diff] [blame]	175	$ llvm-bolt <executable> -o <executable>.bolt -data=perf.fdata -reorder-blocks=ext-tsp -reorder-functions=hfsort -split-functions -split-all-cold -split-eh -dyno-stats
Amir Ayupov	1c5d3a0	2020-12-02 00:29:39	[diff] [blame]	176	```
				177
				178	If you do need an updated debug info, then add `-update-debug-sections` option
				179	to the command above. The processing time will be slightly longer.
				180
				181	For a full list of options see `-help`/`-help-hidden` output.
				182
				183	The input binary for this step does not have to 100% match the binary used for
				184	profile collection in Step 1. This could happen when you are doing active
				185	development, and the source code constantly changes, yet you want to benefit
				186	from profile-guided optimizations. However, since the binary is not precisely the
				187	same, the profile information could become invalid or stale, and BOLT will
				188	report the number of functions with a stale profile. The higher the
				189	number, the less performance improvement should be expected. Thus, it is
				190	crucial to update `.fdata` for release branches.
				191
				192	## Multiple Profiles
				193
				194	Suppose your application can run in different modes, and you can generate
				195	multiple profiles for each one of them. To generate a single binary that can
				196	benefit all modes (assuming the profiles don't contradict each other) you can
				197	use `merge-fdata` tool:
				198	```
				199	$ merge-fdata *.fdata > combined.fdata
				200	```
				201	Use `combined.fdata` for Step 3 above to generate a universally optimized
				202	binary.
				203
				204	## License
				205
Rafael Auler	da752c9c	2021-03-17 22:04:19	[diff] [blame]	206	BOLT is licensed under the [Apache License v2.0 with LLVM Exceptions](./LICENSE.TXT).