AT - Better C Code For ARM Devices
AT - Better C Code For ARM Devices
Devices
Satyaki Mukherjee
1 1
Efficient C Code for ARM Devices
Abstract
“You can make your C code better quickly, cheaply and
easily. Simple techniques are capable of yielding surprising
improvements in system performance, code size and power
consumption. This session looks at how software applications
can make most efficient use of the instruction set, memory
systems, hardware accelerators and power-saving hardware
features in order to deliver significantly lower power
consumption. You will learn tricks which you can use the day
you get back to your office!"
2
There are two sides to this coin
Application OS
Minimal instruction count System power management
Minimal memory access Subsystem power
Cache-friendly data accesses DVFS
Line length and boundary Multicore power
Efficient use of stack
Parameter count Power-friendly spinlocks
Return in regs Sensible task scheduling
Task/thread partition Cache configuration and
usage
SIMD/Vectorization
3
Before we begin...
Some things are so basic that we have to assume you are
doing them before we can talk about anything else...
4
...some things are vital!
In our software optimization course we show how to improve
performance of a common video codec suite by 200-250%
using a combination of
Correct compiler configuration
Software optimization
Architectural optimization
System optimization
NEON vectorization
This is impressive BUT it is dwarfed completely by the
penalty of not configuring the cache correctly
Turning on the data cache can improves performance by up to
5000%!
5
Memory use
Memory use is expensive
Takes longer than executing instructions
Consumes significantly more power
Keep it close
Access it as little as possible
TCM
Cache
6
Speed = Power
In both senses
Increasing speed increases power consumption
BUT
More efficient code completes more quickly
Therefore – optimize for SPEED
Favour computation over communication
Only be as accurate as you need – is fixed point enough?
7
Good coding practice
Make sure things match
8
Data types
In general, unsigned, word-sized integer types are best
Sub-word items require truncation or sign-extension
Doubleword types require higher alignment, especially when passed
as parameters
Loading signed or unsigned halfwords and signed bytes takes longer
on some cores
Loading unaligned items works but there can be a performance
penalty
The compiler can “hide” these effects in many cases ...
... but not always
9
Variable selection (size)
The ARM is a 32-bit architecture, so optimal code is
generated when working with 32-bit (word) sized variables
a=a+b
ADD r2,r0,r1
ADD r0,r0,r1 LSL r2,r2,#16 ADD r2,r0,r1
ASR r0,r2,#16 BIC r0,r2,#0x10000
10
Array element sizes
To calculate the address of an element of an array, the compiler must
multiply the size of the element by the index...
&(a[i]) ≡ a + i * sizeof(a)
If the element size is a power of 2, this can be done with a simple inline
shift
11
Parameter Passing
The AAPCS has rules about 64-bit types
64-bit parameters must be 8-byte aligned in memory
64-bit arguments to functions must be passed in an even +
consecutive odd register
(i.e. r0+r1 or r2+r3) or on the stack at an 8-byte aligned location
Registers or stack will be „wasted‟ if arguments are listed in a sub-optimal
order
r0 r1 r2 r3 stack stack stack stack
Remember the hidden this argument in r0 for non-static C++ member functions
12
The C compiler doesn’t know everything
The compiler is constrained in many areas
Help the compiler out where you can
13
Looks __promising
void f(int *x, int n)
{
int i;
Tells the compiler that the loop index is greater than 0 and
divisible by 8
14
The C compiler can’t do everything
Some instructions are never automatically generated by the
C compiler
Q*, *SAT, many SIMD
15
Vectorization using NEON instrinsics
The following function processes an array of data in memory
:
MOV r2,#0
LDR r3,[r1,r2,LSL #2]
for(i = 0; i < 8; i++) LDR r4,[r0,r2,LSL #2]
{ MUL r3,r3,r4
dst[i] = src[i] * dst[i]; 8x STR r3,[r0,r2,LSL #2]
ADD r2,r2,#1
} CMP r2,#8
BLT {pc}-0x18 ; 0x38
:
16
Use your cache wisely
Data which is aligned to cache line boundaries increases
effectiveness of manual and automatic preload)
18
Power management
The great debate
19
Energy benefits of cache
EDP
IPC
Power
20
Should we turn it on?
FP = Floating Point Unit
I$ = Instruction Cache
30
25
20
Energy
15 13MHz
104MHz
10
0
FP off, I$ off FP off, I$ on FP on, I$ off FP on, I$ on
21
Minimizing memory access
Memory access is expensive
22
Systems-wide power management
Implementation-dependent power controllers
Standard components e.g. NEON
Multicore
23
Power modes in ARM cores
Some variant of the following
Run
Normal operation, fully powered (but NEON could be off)
Standby
Clock stopped, still powered, restart from next instruction, transparent
to program
Dormant
Powered down, RAM retained, context save required, restart from
reset vector
Power-Off
Core, cache, RAM all powered down, full restart from reset vector
required
24
SMP Linux support for power-saving
Current versions of SMP Linux support
CPU hotplug
Load-balancing and dynamic re-prioritization
Intelligent scheduling to minimize cache migration/thrashing (strong
affinity between processes and cores)
DVFS per CPU
Individual CPU power state management
Handles interface with external power management controller
25
Example – Freescale i.MX51
Freescale i.MX51 contains HW NEON monitor which
automatically powers down NEON when unused
Countdown timer set by program
Decrements automatically
Reset when NEON instruction is executed
Raises interrupt on timeout, triggering state retention powerdown
Subsequent NEON instruction causes an Undef exception
Undef handler powers up and restores NEON
Return to re-execute NEON instruction
MP3/WMA decode power consumption down by 20%*
*Freescale figures
26
Example – TI OMAP 4
Dual-core Cortex™-A9 MPCore™ up to 1GHz
Supports
DVFS – Dynamic Voltage and Frequency Scaling
AVS – Adaptive Voltage Scaling
DPS – Dynamic Power Switching
SLM – Static Leakage Management
OS can select from a number of OPPs (Operating
Performance Points)
OS can drive DVFS depending on current load
Power consumption from 600uW to 600mW depending on
load
27
In multicore systems
System efficiency
Intelligent/dynamic task prioritization
Load balancing
Power-conscious spinlocks
Computational efficiency
Data, task and functional parallelism
Low synchronization overhead
Data efficiency
Efficient use of memory system
Efficient use of SMP caches (minimize trashing, false sharing)
28
Conclusions
Configure properly
Tools and platform
Memory
Minimize expensive external accesses
Instruction count
Optimize for speed
Subsystems
Take every opportunity to power down as far and as often as possible
29
Thank You
30