0% found this document useful (0 votes)
154 views

AT - Better C Code For ARM Devices

The document discusses techniques for writing efficient C code for ARM devices. It covers topics like minimizing instruction count, memory access, and cache-friendly data access. It also discusses efficient use of data types, variables, arrays, parameters, and vectorization using NEON intrinsics.

Uploaded by

jvgediya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
154 views

AT - Better C Code For ARM Devices

The document discusses techniques for writing efficient C code for ARM devices. It covers topics like minimizing instruction count, memory access, and cache-friendly data access. It also discusses efficient use of data types, variables, arrays, parameters, and vectorization using NEON intrinsics.

Uploaded by

jvgediya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Efficient C Code for ARM

Devices

Satyaki Mukherjee

1 1
Efficient C Code for ARM Devices

Abstract
“You can make your C code better quickly, cheaply and
easily. Simple techniques are capable of yielding surprising
improvements in system performance, code size and power
consumption. This session looks at how software applications
can make most efficient use of the instruction set, memory
systems, hardware accelerators and power-saving hardware
features in order to deliver significantly lower power
consumption. You will learn tricks which you can use the day
you get back to your office!"

2
There are two sides to this coin
 Application  OS
 Minimal instruction count  System power management
 Minimal memory access  Subsystem power
 Cache-friendly data accesses  DVFS
 Line length and boundary  Multicore power
 Efficient use of stack
 Parameter count  Power-friendly spinlocks
 Return in regs  Sensible task scheduling
 Task/thread partition  Cache configuration and
usage
 SIMD/Vectorization

3
Before we begin...
 Some things are so basic that we have to assume you are
doing them before we can talk about anything else...

4
...some things are vital!
 In our software optimization course we show how to improve
performance of a common video codec suite by 200-250%
using a combination of
 Correct compiler configuration
 Software optimization
 Architectural optimization
 System optimization
 NEON vectorization
 This is impressive BUT it is dwarfed completely by the
penalty of not configuring the cache correctly
 Turning on the data cache can improves performance by up to
5000%!

5
Memory use
 Memory use is expensive
 Takes longer than executing instructions
 Consumes significantly more power
 Keep it close
 Access it as little as possible

TCM

Core External RAM

Cache

6
Speed = Power
 In both senses
 Increasing speed increases power consumption
BUT
 More efficient code completes more quickly
 Therefore – optimize for SPEED
 Favour computation over communication
 Only be as accurate as you need – is fixed point enough?

 BUT smaller code might cache more efficiently – avoiding


memory accesses

7
Good coding practice
 Make sure things match

Data types Architecture


Code Instruction Set
Memory use Platform
Conventions Tools

 Write sensible code


 Make sure you know what the tools are doing

8
Data types
 In general, unsigned, word-sized integer types are best
 Sub-word items require truncation or sign-extension
 Doubleword types require higher alignment, especially when passed
as parameters
 Loading signed or unsigned halfwords and signed bytes takes longer
on some cores
 Loading unaligned items works but there can be a performance
penalty
 The compiler can “hide” these effects in many cases ...
... but not always

9
Variable selection (size)
 The ARM is a 32-bit architecture, so optimal code is
generated when working with 32-bit (word) sized variables

a=a+b

unsigned int a, b signed short a, b unsigned short a, b

ADD r2,r0,r1
ADD r0,r0,r1 LSL r2,r2,#16 ADD r2,r0,r1
ASR r0,r2,#16 BIC r0,r2,#0x10000

10
Array element sizes
 To calculate the address of an element of an array, the compiler must
multiply the size of the element by the index...

&(a[i]) ≡ a + i * sizeof(a)

 If the element size is a power of 2, this can be done with a simple inline
shift

 For an array at [r3], to access the first word of element number r1


 Element size = 12:
ADD r1, r1, r1, LSL #1 ; r1 = 3 * r1
LDR r0, [r3, r1, LSL #2] ; r0 = *(r1 + 4 * r1)
 Element size = 16:
LDR r0, [r3, r1, LSL #4] ; r0 = *(r3 + 16 * r1)

11
Parameter Passing
 The AAPCS has rules about 64-bit types
 64-bit parameters must be 8-byte aligned in memory
 64-bit arguments to functions must be passed in an even +
consecutive odd register
(i.e. r0+r1 or r2+r3) or on the stack at an 8-byte aligned location
 Registers or stack will be „wasted‟ if arguments are listed in a sub-optimal
order
r0 r1 r2 r3 stack stack stack stack

fx(int a, double b, int


c) a unused b b c unused

fy(int a, int c, double


b) a c b b

 Remember the hidden this argument in r0 for non-static C++ member functions

12
The C compiler doesn’t know everything
 The compiler is constrained in many areas
 Help the compiler out where you can

 Some things are not ARM-specific...


 For instance, using do-while when termination condition will always
pass on first iteration
...but some are

 The ARM compiler provides helpful things like


__pure
__restrict (equivalent to C99 “restrict”)
__promise

13
Looks __promising
void f(int *x, int n)
{
int i;

__promise((n > 0) && ((n & 7) == 0));

for (i = 0; i < n; i++)


{
x[i]++;
}
}

Tells the compiler that the loop index is greater than 0 and
divisible by 8

14
The C compiler can’t do everything
 Some instructions are never automatically generated by the
C compiler
 Q*, *SAT, many SIMD

 You may need to resort to hand-coding in assembler to get


access to these
 You can get a long way with intrinsics

unsigned int SMUADop(unsigned int val1, unsigned int val2)


{
return(__smuad(val1,val2));
}

15
Vectorization using NEON instrinsics
 The following function processes an array of data in memory
:
MOV r2,#0
LDR r3,[r1,r2,LSL #2]
for(i = 0; i < 8; i++) LDR r4,[r0,r2,LSL #2]
{ MUL r3,r3,r4
dst[i] = src[i] * dst[i]; 8x STR r3,[r0,r2,LSL #2]
ADD r2,r2,#1
} CMP r2,#8
BLT {pc}-0x18 ; 0x38
:

 Vectorizing this using NEON™ C intrinsics


for(i = 0; i < 8; i+=4) :
{ MOV r2,#0
ADD r4,r1,r2,LSL #2
n_dst = vld1q_s32(dst+i); ADD r3,r0,r2,LSL #2
ADD r2,r2,#4
n_src = vld1q_s32(src+i); VLD1.32 {d0,d1},[r4]
2x CMP r2,#8
n_dst = vmulq_s32(n_dst,n_src); VLD1.32 {d2,d3},[r3]
VMUL.I32 q0,q1,q0
VST1.32 {d0,d1},[r3]
vst1q_s32(dst+i,n_dst); BLT {pc}-0x20 ; 0x8
} :

16
Use your cache wisely
 Data which is aligned to cache line boundaries increases
effectiveness of manual and automatic preload)

 Access data in a cache-friendly manner (i.e. sequentially)


int myarray[16] __attribute__((aligned(64)));
for (i = 0; i < SIZE; i++)
{
for (j = 0; j < SIZE; j++)
{
GOOD BAD
for (k = 0; k < SIZE; k++)
{
a[i][j] += b[i][k] * c[k][j];
}
}
}
 Make data structures match cache line length
 Use preload: PLD, PLE
17
Base Pointers
extern int a; int a;
extern int b; int b;

int func(void) int func(void)


{ {
return (a+b); return (a+b);
} }

a and b defined within the


a and b defined externally module in which they are
b used

LDR r0, [pc,#16] a


LDR r0, [pc,#12]
LDR r0, [r0,#0] LDR r1, [r0,#0]
LDR r1, [pc,#12] LDR r0, [r0,#4]
LDR r1, [r1,#0] ADD r0,r0,r1
ADD r0,r0,r1 BX lr
BX lr
DCD “base address of a and b”
DCD “address of a”
DCD “address of b” Note that this is not done at -O0

18
Power management
 The great debate

 Race to completion vs Just-in-time


 Static power vs. Dynamic power
 As geometry gets smaller, static power becomes larger proportion.
Actually larger at 65nm
 There are no standard answers!
 Energy Delay Product (EDP)
 Metric which combines measure of energy consumption and timely
completion
 Goal is to minimize EDP

19
Energy benefits of cache

EDP
IPC
Power

256k 128k 64k


32k
16k
8k

20
Should we turn it on?
FP = Floating Point Unit
I$ = Instruction Cache
30

25

20
Energy

15 13MHz
104MHz
10

0
FP off, I$ off FP off, I$ on FP on, I$ off FP on, I$ on

21
Minimizing memory access
 Memory access is expensive

 Use registers as much as possible


 Minimize live variables, don‟t take address
of automatics
 Inlining can reduce stack usage
 Make best use of cache
 WB/WA for stack/heap/globals, WB/RA for buffers
 Multi-level L1 WT/RA, L2 WB/WA
 Beware of false-sharing in multicore systems
 MESI is a “write-invalidate coherency protocol”
 Two processes modifying same cache line will bounce it back and
forth – solution is put volatile shared data in separate cache lines

22
Systems-wide power management
 Implementation-dependent power controllers
 Standard components e.g. NEON
 Multicore

23
Power modes in ARM cores
 Some variant of the following
 Run
 Normal operation, fully powered (but NEON could be off)
 Standby
 Clock stopped, still powered, restart from next instruction, transparent
to program
 Dormant
 Powered down, RAM retained, context save required, restart from
reset vector
 Power-Off
 Core, cache, RAM all powered down, full restart from reset vector
required

24
SMP Linux support for power-saving
 Current versions of SMP Linux support
 CPU hotplug
 Load-balancing and dynamic re-prioritization
 Intelligent scheduling to minimize cache migration/thrashing (strong
affinity between processes and cores)
 DVFS per CPU
 Individual CPU power state management
 Handles interface with external power management controller

25
Example – Freescale i.MX51
 Freescale i.MX51 contains HW NEON monitor which
automatically powers down NEON when unused
 Countdown timer set by program
 Decrements automatically
 Reset when NEON instruction is executed
 Raises interrupt on timeout, triggering state retention powerdown
 Subsequent NEON instruction causes an Undef exception
 Undef handler powers up and restores NEON
 Return to re-execute NEON instruction
 MP3/WMA decode power consumption down by 20%*

*Freescale figures

26
Example – TI OMAP 4
 Dual-core Cortex™-A9 MPCore™ up to 1GHz
 Supports
 DVFS – Dynamic Voltage and Frequency Scaling
 AVS – Adaptive Voltage Scaling
 DPS – Dynamic Power Switching
 SLM – Static Leakage Management
 OS can select from a number of OPPs (Operating
Performance Points)
 OS can drive DVFS depending on current load
 Power consumption from 600uW to 600mW depending on
load

27
In multicore systems
 System efficiency
 Intelligent/dynamic task prioritization
 Load balancing
 Power-conscious spinlocks
 Computational efficiency
 Data, task and functional parallelism
 Low synchronization overhead
 Data efficiency
 Efficient use of memory system
 Efficient use of SMP caches (minimize trashing, false sharing)

28
Conclusions
 Configure properly
 Tools and platform
 Memory
 Minimize expensive external accesses
 Instruction count
 Optimize for speed
 Subsystems
 Take every opportunity to power down as far and as often as possible

29
Thank You

Please visit www.arm.com for ARM related technical details

For any queries contact < [email protected] >

30

You might also like