Blame - libc/benchmarks/RATIONALE.md - external/github.com/llvm/llvm-project.git

blob: 1ebc5ec02da2fa95fabf3dba0aac793a31a5616c [file] [log] [blame] [view]

Guillaume Chatelet	aba80d0	2020-01-06 12:17:04	[diff] [blame]	1	# Benchmarking `llvm-libc`'s memory functions
				2
				3	## Foreword
				4
				5	Microbenchmarks are valuable tools to assess and compare the performance of
				6	isolated pieces of code. However they don't capture all interactions of complex
				7	systems; and so other metrics can be equally important:
				8
				9	- code size (to reduce instruction cache pressure),
				10	- Profile Guided Optimization friendliness,
				11	- hyperthreading / multithreading friendliness.
				12
				13	## Rationale
				14
				15	The goal here is to satisfy the [Benchmarking
Cheng Wang	7abd8f6	2021-09-04 12:14:54	[diff] [blame]	16	Principles](https://en.wikipedia.org/wiki/Benchmark_(computing)#Benchmarking_Principles).
Guillaume Chatelet	aba80d0	2020-01-06 12:17:04	[diff] [blame]	17
				18	1. Relevance: Benchmarks should measure relatively vital features.
				19	2. Representativeness: Benchmark performance metrics should be broadly
				20	accepted by industry and academia.
				21	3. Equity: All systems should be fairly compared.
				22	4. Repeatability: Benchmark results can be verified.
				23	5. Cost-effectiveness: Benchmark tests are economical.
				24	6. Scalability: Benchmark tests should measure from single server to
				25	multiple servers.
				26	7. Transparency: Benchmark metrics should be easy to understand.
				27
				28	Benchmarking is a [subtle
Cheng Wang	7abd8f6	2021-09-04 12:14:54	[diff] [blame]	29	art](https://en.wikipedia.org/wiki/Benchmark_(computing)#Challenges) and
Guillaume Chatelet	aba80d0	2020-01-06 12:17:04	[diff] [blame]	30	benchmarking memory functions is no exception. Here we'll dive into
				31	peculiarities of designing good microbenchmarks for `llvm-libc` memory
				32	functions.
				33
				34	## Challenges
				35
Guillaume Chatelet	deae7e9	2020-12-17 13:16:14	[diff] [blame]	36	As seen in the [README.md](README.md#stochastic-mode) the microbenchmarking
Guillaume Chatelet	aba80d0	2020-01-06 12:17:04	[diff] [blame]	37	facility should focus on measuring low latency code. If copying a few bytes
				38	takes in the order of a few cycles, the benchmark should be able to **measure
				39	accurately down to the cycle**.
				40
				41	### Measuring instruments
				42
				43	There are different sources of time in a computer (ordered from high to low resolution)
				44	- [Performance
				45	Counters](https://en.wikipedia.org/wiki/Hardware_performance_counter): used to
				46	introspect the internals of the CPU,
				47	- [High Precision Event
				48	Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer): used to
				49	trigger short lived actions,
				50	- [Real-Time Clocks (RTC)](https://en.wikipedia.org/wiki/Real-time_clock): used
				51	to keep track of the computer's time.
				52
				53	In theory Performance Counters provide cycle accurate measurement via the
				54	`cpu cycles` event. But as we'll see, they are not really practical in this
				55	context.
				56
				57	### Performance counters and modern processor architecture
				58
				59	Modern CPUs are [out of
				60	order](https://en.wikipedia.org/wiki/Out-of-order_execution) and
				61	[superscalar](https://en.wikipedia.org/wiki/Superscalar_processor) as a
				62	consequence it is [hard to know what is included when the counter is
				63	read](https://en.wikipedia.org/wiki/Hardware_performance_counter#Instruction_based_sampling),
				64	some instructions may still be in flight, some others may be executing
				65	[speculatively](https://en.wikipedia.org/wiki/Speculative_execution). As a
				66	matter of fact **on the same machine, measuring twice the same piece of code will yield
				67	different results.**
				68
				69	### Performance counters semantics inconsistencies and availability
				70
				71	Although they have the same name, the exact semantics of performance counters
				72	are micro-architecture dependent: **it is generally not possible to compare two
				73	micro-architectures exposing the same performance counters.**
				74
				75	Each vendor decides which performance counters to implement and their exact
				76	meaning. Although we want to benchmark `llvm-libc` memory functions for all
				77	available [target
				78	triples](https://clang.llvm.org/docs/CrossCompilation.html#target-triple), there
Guillaume Chatelet	deae7e9	2020-12-17 13:16:14	[diff] [blame]	79	are no guarantees that the counter we're interested in is available.
Guillaume Chatelet	aba80d0	2020-01-06 12:17:04	[diff] [blame]	80
				81	### Additional imprecisions
				82
				83	- Reading performance counters is done through Kernel [System
				84	calls](https://en.wikipedia.org/wiki/System_call). The System call itself
				85	is costly (hundreds of cycles) and will perturbate the counter's value.
				86	- [Interruptions](https://en.wikipedia.org/wiki/Interrupt#Processor_response)
				87	can occur during measurement.
				88	- If the system is already under monitoring (virtual machines or system wide
				89	profiling) the kernel can decide to multiplex the performance counters
				90	leading to lower precision or even completely missing the measurement.
				91	- The Kernel can decide to [migrate the
				92	process](https://en.wikipedia.org/wiki/Process_migration) to a different
				93	core.
				94	- [Dynamic frequency
				95	scaling](https://en.wikipedia.org/wiki/Dynamic_frequency_scaling) can kick
				96	in during the measurement and change the ticking duration. **Ultimately we
				97	care about the amount of work over a period of time**. This removes some
				98	legitimacy of measuring cycles rather than raw time.
				99
				100	### Cycle accuracy conclusion
				101
				102	We have seen that performance counters are: not widely available, semantically
				103	inconsistent across micro-architectures and imprecise on modern CPUs for small
				104	snippets of code.
				105
				106	## Design decisions
				107
				108	In order to achieve the needed precision we would need to resort on more widely
				109	available counters and derive the time from a high number of runs: going from a
				110	single deterministic measure to a probabilistic one.
				111
				112	**To get a good signal to noise ratio we need the running time of the piece of
				113	code to be orders of magnitude greater than the measurement precision.**
				114
				115	For instance, if measurement precision is of 10 cycles, we need the function
				116	runtime to take more than 1000 cycles to achieve 1%
				117	[SNR](https://en.wikipedia.org/wiki/Signal-to-noise_ratio).
				118
				119	### Repeating code N-times until precision is sufficient
				120
				121	The algorithm is as follows:
				122
				123	- We measure the time it takes to run the code _N_ times (Initially _N_ is 10
				124	for instance)
				125	- We deduce an approximation of the runtime of one iteration (= _runtime_ /
				126	_N_).
				127	- We increase _N_ by _X%_ and repeat the measurement (geometric progression).
				128	- We keep track of the _one iteration runtime approximation_ and build a
				129	weighted mean of all the samples so far (weight is proportional to _N_)
				130	- We stop the process when the difference between the weighted mean and the
				131	last estimation is smaller than _ε_ or when other stopping conditions are
				132	met (total runtime, maximum iterations or maximum sample count).
				133
				134	This method allows us to be as precise as needed provided that the measured
				135	runtime is proportional to _N_. Longer run times also smooth out imprecision
				136	related to _interrupts_ and _context switches_.
				137
				138	Note: When measuring longer runtimes (e.g. copying several megabytes of data)
				139	the above assumption doesn't hold anymore and the _ε_ precision cannot be
				140	reached by increasing iterations. The whole benchmarking process becomes
				141	prohibitively slow. In this case the algorithm is limited to a single sample and
				142	repeated several times to get a decent 95% confidence interval.
				143
				144	### Effect of branch prediction
				145
				146	When measuring code with branches, repeating the same call again and again will
				147	allow the processor to learn the branching patterns and perfectly predict all
				148	the branches, leading to unrealistic results.
				149
				150	**Decision: When benchmarking small buffer sizes, the function parameters should
				151	be randomized between calls to prevent perfect branch predictions.**
				152
				153	### Effect of the memory subsystem
				154
				155	The CPU is tightly coupled to the memory subsystem. It is common to see `L1`,
				156	`L2` and `L3` data caches.
				157
				158	We may be tempted to randomize data accesses widely to exercise all the caching
				159	layers down to RAM but the [cost of accessing lower layers of
				160	memory](https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html)
				161	completely dominates the runtime for small sizes.
				162
				163	So to respect Equity and Repeatability principles we should make sure we
				164	do not depend on the memory subsystem.
				165
				166	**Decision: When benchmarking small buffer sizes, the data accessed by the
				167	function should stay in `L1`.**
				168
				169	### Effect of prefetching
				170
				171	In case of small buffer sizes,
				172	[prefetching](https://en.wikipedia.org/wiki/Cache_prefetching) should not kick
				173	in but in case of large buffers it may introduce a bias.
				174
				175	**Decision: When benchmarking large buffer sizes, the data should be accessed in
				176	a random fashion to lower the impact of prefetching between calls.**
				177
				178	### Effect of dynamic frequency scaling
				179
				180	Modern processors implement [dynamic frequency
				181	scaling](https://en.wikipedia.org/wiki/Dynamic_frequency_scaling). In so-called
				182	`performance` mode the CPU will increase its frequency and run faster than usual
				183	within [some limits](https://en.wikipedia.org/wiki/Intel_Turbo_Boost) : _"The
				184	increased clock rate is limited by the processor's power, current, and thermal
				185	limits, the number of cores currently in use, and the maximum frequency of the
				186	active cores."_
				187
				188	**Decision: When benchmarking we want to make sure the dynamic frequency scaling
				189	is always set to `performance`. We also want to make sure that the time based
				190	events are not impacted by frequency scaling.**
				191
Cheng Wang	7abd8f6	2021-09-04 12:14:54	[diff] [blame]	192	See [README.md](README.md) on how to set this up.
Guillaume Chatelet	aba80d0	2020-01-06 12:17:04	[diff] [blame]	193
				194	### Reserved and pinned cores
				195
				196	Some operating systems allow [core
				197	reservation](https://stackoverflow.com/questions/13583146/whole-one-core-dedicated-to-single-process).
				198	It removes a set of perturbation sources like: process migration, context
				199	switches and interrupts. When a core is hyperthreaded, both cores should be
				200	reserved.
				201
				202	## Microbenchmarks limitations
				203
				204	As stated in the Foreword section a number of effects do play a role in
				205	production but are not directly measurable through microbenchmarks. The code
				206	size of the benchmark is (much) smaller than the hot code of real applications
				207	and doesn't exhibit instruction cache pressure as much.
				208
				209	### iCache pressure
				210
				211	Fundamental functions that are called frequently will occupy the L1 iCache
				212	([illustration](https://en.wikipedia.org/wiki/CPU_cache#Example:_the_K8)). If
				213	they are too big they will prevent other hot code to stay in the cache and incur
				214	[stalls](https://en.wikipedia.org/wiki/CPU_cache#CPU_stalls). So the memory
				215	functions should be as small as possible.
				216
				217	### iTLB pressure
				218
				219	The same reasoning goes for instruction Translation Lookaside Buffer
				220	([iTLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer)) incurring
				221	[TLB
				222	misses](https://en.wikipedia.org/wiki/Translation_lookaside_buffer#TLB-miss_handling).
				223
				224	## FAQ
				225
				226	1. Why don't you use Google Benchmark directly?
				227
				228	We reuse some parts of Google Benchmark (detection of frequency scaling, CPU
				229	cache hierarchy informations) but when it comes to measuring memory
				230	functions Google Benchmark have a few issues:
				231
				232	- Google Benchmark privileges code based configuration via macros and
				233	builders. It is typically done in a static manner. In our case the
				234	parameters we need to setup are a mix of what's usually controlled by
				235	the framework (number of trials, maximum number of iterations, size
				236	ranges) and parameters that are more tied to the function under test
				237	(randomization strategies, custom values). Achieving this with Google
				238	Benchmark is cumbersome as it involves templated benchmarks and
				239	duplicated code. In the end, the configuration would be spread across
				240	command line flags (via framework's option or custom flags), and code
				241	constants.
				242	- Output of the measurements is done through a `BenchmarkReporter` class,
				243	that makes it hard to access the parameters discussed above.