0% found this document useful (0 votes)

154 views

AT - Better C Code For ARM Devices

The document discusses techniques for writing efficient C code for ARM devices. It covers topics like minimizing instruction count, memory access, and cache-friendly data access. It also discusses efficient use of data types, variables, arrays, parameters, and vectorization using NEON intrinsics.

Uploaded by

jvgediya

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

154 views

AT - Better C Code For ARM Devices

Uploaded by

jvgediya

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Efficient C Code for ARM

Devices

Satyaki Mukherjee

1 1
Efficient C Code for ARM Devices

Abstract
“You can make your C code better quickly, cheaply and
easily. Simple techniques are capable of yielding surprising
improvements in system performance, code size and power
consumption. This session looks at how software applications
can make most efficient use of the instruction set, memory
systems, hardware accelerators and power-saving hardware
features in order to deliver significantly lower power
consumption. You will learn tricks which you can use the day
you get back to your office!"

2
There are two sides to this coin
 Application  OS
 Minimal instruction count  System power management
 Minimal memory access  Subsystem power
 Cache-friendly data accesses  DVFS
 Line length and boundary  Multicore power
 Efficient use of stack
 Parameter count  Power-friendly spinlocks
 Return in regs  Sensible task scheduling
 Task/thread partition  Cache configuration and
usage
 SIMD/Vectorization

3
Before we begin...
 Some things are so basic that we have to assume you are
doing them before we can talk about anything else...

4
...some things are vital!
 In our software optimization course we show how to improve
performance of a common video codec suite by 200-250%
using a combination of
 Correct compiler configuration
 Software optimization
 Architectural optimization
 System optimization
 NEON vectorization
 This is impressive BUT it is dwarfed completely by the
penalty of not configuring the cache correctly
 Turning on the data cache can improves performance by up to
5000%!

5
Memory use
 Memory use is expensive
 Takes longer than executing instructions
 Consumes significantly more power
 Keep it close
 Access it as little as possible

TCM

Core External RAM

Cache

6
Speed = Power
 In both senses
 Increasing speed increases power consumption
BUT
 More efficient code completes more quickly
 Therefore – optimize for SPEED
 Favour computation over communication
 Only be as accurate as you need – is fixed point enough?

 BUT smaller code might cache more efficiently – avoiding

memory accesses

7
Good coding practice
 Make sure things match

Data types Architecture

Code Instruction Set
Memory use Platform
Conventions Tools

 Write sensible code

 Make sure you know what the tools are doing

8
Data types
 In general, unsigned, word-sized integer types are best
 Sub-word items require truncation or sign-extension
 Doubleword types require higher alignment, especially when passed
as parameters
 Loading signed or unsigned halfwords and signed bytes takes longer
on some cores
 Loading unaligned items works but there can be a performance
penalty
 The compiler can “hide” these effects in many cases ...
... but not always

9
Variable selection (size)
 The ARM is a 32-bit architecture, so optimal code is
generated when working with 32-bit (word) sized variables

a=a+b

unsigned int a, b signed short a, b unsigned short a, b

ADD r2,r0,r1
ADD r0,r0,r1 LSL r2,r2,#16 ADD r2,r0,r1
ASR r0,r2,#16 BIC r0,r2,#0x10000

10
Array element sizes
 To calculate the address of an element of an array, the compiler must
multiply the size of the element by the index...

&(a[i]) ≡ a + i * sizeof(a)

 If the element size is a power of 2, this can be done with a simple inline
shift

 For an array at [r3], to access the first word of element number r1

 Element size = 12:
ADD r1, r1, r1, LSL #1 ; r1 = 3 * r1
LDR r0, [r3, r1, LSL #2] ; r0 = *(r1 + 4 * r1)
 Element size = 16:
LDR r0, [r3, r1, LSL #4] ; r0 = *(r3 + 16 * r1)

11
Parameter Passing
 The AAPCS has rules about 64-bit types
 64-bit parameters must be 8-byte aligned in memory
 64-bit arguments to functions must be passed in an even +
consecutive odd register
(i.e. r0+r1 or r2+r3) or on the stack at an 8-byte aligned location
 Registers or stack will be „wasted‟ if arguments are listed in a sub-optimal
order
r0 r1 r2 r3 stack stack stack stack

fx(int a, double b, int

c) a unused b b c unused

fy(int a, int c, double

b) a c b b

 Remember the hidden this argument in r0 for non-static C++ member functions

12
The C compiler doesn’t know everything
 The compiler is constrained in many areas
 Help the compiler out where you can

 Some things are not ARM-specific...

 For instance, using do-while when termination condition will always
pass on first iteration
...but some are

 The ARM compiler provides helpful things like

__pure
__restrict (equivalent to C99 “restrict”)
__promise

13
Looks __promising
void f(int *x, int n)
{
int i;

__promise((n > 0) && ((n & 7) == 0));

for (i = 0; i < n; i++)

{
x[i]++;
}
}

Tells the compiler that the loop index is greater than 0 and
divisible by 8

14
The C compiler can’t do everything
 Some instructions are never automatically generated by the
C compiler
 Q*, *SAT, many SIMD

 You may need to resort to hand-coding in assembler to get

access to these
 You can get a long way with intrinsics

unsigned int SMUADop(unsigned int val1, unsigned int val2)

{
return(__smuad(val1,val2));
}

15
Vectorization using NEON instrinsics
 The following function processes an array of data in memory
:
MOV r2,#0
LDR r3,[r1,r2,LSL #2]
for(i = 0; i < 8; i++) LDR r4,[r0,r2,LSL #2]
{ MUL r3,r3,r4
dst[i] = src[i] * dst[i]; 8x STR r3,[r0,r2,LSL #2]
ADD r2,r2,#1
} CMP r2,#8
BLT {pc}-0x18 ; 0x38
:

 Vectorizing this using NEON™ C intrinsics

for(i = 0; i < 8; i+=4) :
{ MOV r2,#0
ADD r4,r1,r2,LSL #2
n_dst = vld1q_s32(dst+i); ADD r3,r0,r2,LSL #2
ADD r2,r2,#4
n_src = vld1q_s32(src+i); VLD1.32 {d0,d1},[r4]
2x CMP r2,#8
n_dst = vmulq_s32(n_dst,n_src); VLD1.32 {d2,d3},[r3]
VMUL.I32 q0,q1,q0
VST1.32 {d0,d1},[r3]
vst1q_s32(dst+i,n_dst); BLT {pc}-0x20 ; 0x8
} :

16
Use your cache wisely
 Data which is aligned to cache line boundaries increases
effectiveness of manual and automatic preload)

 Access data in a cache-friendly manner (i.e. sequentially)

int myarray[16] __attribute__((aligned(64)));
for (i = 0; i < SIZE; i++)
{
for (j = 0; j < SIZE; j++)
{
GOOD BAD
for (k = 0; k < SIZE; k++)
{
a[i][j] += b[i][k] * c[k][j];
}
}
}
 Make data structures match cache line length
 Use preload: PLD, PLE
17
Base Pointers
extern int a; int a;
extern int b; int b;

int func(void) int func(void)

{ {
return (a+b); return (a+b);
} }

a and b defined within the

a and b defined externally module in which they are
b used

LDR r0, [pc,#16] a

LDR r0, [pc,#12]
LDR r0, [r0,#0] LDR r1, [r0,#0]
LDR r1, [pc,#12] LDR r0, [r0,#4]
LDR r1, [r1,#0] ADD r0,r0,r1
ADD r0,r0,r1 BX lr
BX lr
DCD “base address of a and b”
DCD “address of a”
DCD “address of b” Note that this is not done at -O0

18
Power management
 The great debate

 Race to completion vs Just-in-time

 Static power vs. Dynamic power
 As geometry gets smaller, static power becomes larger proportion.
Actually larger at 65nm
 There are no standard answers!
 Energy Delay Product (EDP)
 Metric which combines measure of energy consumption and timely
completion
 Goal is to minimize EDP

19
Energy benefits of cache

EDP
IPC
Power

256k 128k 64k

32k
16k
8k

20
Should we turn it on?
FP = Floating Point Unit
I$ = Instruction Cache
30

20
Energy

15 13MHz
104MHz
10

0
FP off, I$ off FP off, I$ on FP on, I$ off FP on, I$ on

21
Minimizing memory access
 Memory access is expensive

 Use registers as much as possible

 Minimize live variables, don‟t take address
of automatics
 Inlining can reduce stack usage
 Make best use of cache
 WB/WA for stack/heap/globals, WB/RA for buffers
 Multi-level L1 WT/RA, L2 WB/WA
 Beware of false-sharing in multicore systems
 MESI is a “write-invalidate coherency protocol”
 Two processes modifying same cache line will bounce it back and
forth – solution is put volatile shared data in separate cache lines

22
Systems-wide power management
 Implementation-dependent power controllers
 Standard components e.g. NEON
 Multicore

23
Power modes in ARM cores
 Some variant of the following
 Run
 Normal operation, fully powered (but NEON could be off)
 Standby
 Clock stopped, still powered, restart from next instruction, transparent
to program
 Dormant
 Powered down, RAM retained, context save required, restart from
reset vector
 Power-Off
 Core, cache, RAM all powered down, full restart from reset vector
required

24
SMP Linux support for power-saving
 Current versions of SMP Linux support
 CPU hotplug
 Load-balancing and dynamic re-prioritization
 Intelligent scheduling to minimize cache migration/thrashing (strong
affinity between processes and cores)
 DVFS per CPU
 Individual CPU power state management
 Handles interface with external power management controller

25
Example – Freescale i.MX51
 Freescale i.MX51 contains HW NEON monitor which
automatically powers down NEON when unused
 Countdown timer set by program
 Decrements automatically
 Reset when NEON instruction is executed
 Raises interrupt on timeout, triggering state retention powerdown
 Subsequent NEON instruction causes an Undef exception
 Undef handler powers up and restores NEON
 Return to re-execute NEON instruction
 MP3/WMA decode power consumption down by 20%*

*Freescale figures

26
Example – TI OMAP 4
 Dual-core Cortex™-A9 MPCore™ up to 1GHz
 Supports
 DVFS – Dynamic Voltage and Frequency Scaling
 AVS – Adaptive Voltage Scaling
 DPS – Dynamic Power Switching
 SLM – Static Leakage Management
 OS can select from a number of OPPs (Operating
Performance Points)
 OS can drive DVFS depending on current load
 Power consumption from 600uW to 600mW depending on
load

27
In multicore systems
 System efficiency
 Intelligent/dynamic task prioritization
 Load balancing
 Power-conscious spinlocks
 Computational efficiency
 Data, task and functional parallelism
 Low synchronization overhead
 Data efficiency
 Efficient use of memory system
 Efficient use of SMP caches (minimize trashing, false sharing)

28
Conclusions
 Configure properly
 Tools and platform
 Memory
 Minimize expensive external accesses
 Instruction count
 Optimize for speed
 Subsystems
 Take every opportunity to power down as far and as often as possible

29
Thank You

Please visit www.arm.com for ARM related technical details

For any queries contact < [email protected] >

Department of Computer Science National Tsing Hua University EECS403000 Computer Architecture
No ratings yet
Department of Computer Science National Tsing Hua University EECS403000 Computer Architecture
5 pages
Defining The Enterprise Architecture Service Catalog Dec 2018
No ratings yet
Defining The Enterprise Architecture Service Catalog Dec 2018
12 pages
ARM Assembly Language Examples
100% (1)
ARM Assembly Language Examples
12 pages
Placa Base Asus P5GC-MX-1333
No ratings yet
Placa Base Asus P5GC-MX-1333
92 pages
Internship Report On Internet Banking Services of Janata Bank Limited
86% (79)
Internship Report On Internet Banking Services of Janata Bank Limited
67 pages
VLSM Exercises 107510
No ratings yet
VLSM Exercises 107510
12 pages
Module 3 Notes
No ratings yet
Module 3 Notes
18 pages
BCS402 Module 3 PDF
No ratings yet
BCS402 Module 3 PDF
18 pages
BCS402_MC_M3_Notes SJCIT
No ratings yet
BCS402_MC_M3_Notes SJCIT
18 pages
BCS402 M3
No ratings yet
BCS402 M3
110 pages
INTRODUCTION_UNIT
No ratings yet
INTRODUCTION_UNIT
8 pages
Experiment No 6 - DONE
No ratings yet
Experiment No 6 - DONE
8 pages
Unit-V Code Generation: 4.5. Issues in The Design of A Code Generator
No ratings yet
Unit-V Code Generation: 4.5. Issues in The Design of A Code Generator
6 pages
Embedded Systems Engineering: R.A. Prabhath Buddhika
No ratings yet
Embedded Systems Engineering: R.A. Prabhath Buddhika
36 pages
Hidden Overhead of A Function API
No ratings yet
Hidden Overhead of A Function API
158 pages
Aesd QP Sol
No ratings yet
Aesd QP Sol
26 pages
03 Assembly Language Programming
No ratings yet
03 Assembly Language Programming
33 pages
CA Project 5
No ratings yet
CA Project 5
5 pages
Module 3
No ratings yet
Module 3
21 pages
03_Assembly_Language_Programming
No ratings yet
03_Assembly_Language_Programming
33 pages
Building An LLVM Backend
No ratings yet
Building An LLVM Backend
65 pages
CSC3201 - Compiler Construction (Part II) - Lecture 5 - Code Generation
No ratings yet
CSC3201 - Compiler Construction (Part II) - Lecture 5 - Code Generation
64 pages
l18 Arm
No ratings yet
l18 Arm
71 pages
ARM Introduction & Instruction Set Architecture
100% (1)
ARM Introduction & Instruction Set Architecture
71 pages
Test-3-CS - Computer Orranization PDF
No ratings yet
Test-3-CS - Computer Orranization PDF
17 pages
Code Generation
No ratings yet
Code Generation
49 pages
CPE 221 Final Exam Solution Fall 2018
No ratings yet
CPE 221 Final Exam Solution Fall 2018
6 pages
ES UNIT-II
No ratings yet
ES UNIT-II
39 pages
Programming A DSP Processor: y (N) 0.0 For (K 0 K N K++) y (N) y (N) + C (K) X (N-K)
No ratings yet
Programming A DSP Processor: y (N) 0.0 For (K 0 K N K++) y (N) y (N) + C (K) X (N-K)
7 pages
AS400 Frequently Asked Questions - 1
No ratings yet
AS400 Frequently Asked Questions - 1
17 pages
CPSC 213, Winter 2013, Term 2 - Sample Midterm Questions: Date: February 2014 Instructor: Mike Feeley
No ratings yet
CPSC 213, Winter 2013, Term 2 - Sample Midterm Questions: Date: February 2014 Instructor: Mike Feeley
8 pages
Code Generation
No ratings yet
Code Generation
9 pages
CH5 2
No ratings yet
CH5 2
24 pages
ps1 Sol
No ratings yet
ps1 Sol
11 pages
ARM Instruction
No ratings yet
ARM Instruction
51 pages
Ece4750 T01 Proc Concepts Problems
No ratings yet
Ece4750 T01 Proc Concepts Problems
10 pages
CD Unit 5 RV
No ratings yet
CD Unit 5 RV
23 pages
S2 23 ES ZG512 Lab Session 02
No ratings yet
S2 23 ES ZG512 Lab Session 02
22 pages
Municating With People & Parallelism and Instruction Synchronization
No ratings yet
Municating With People & Parallelism and Instruction Synchronization
12 pages
CEP CO
No ratings yet
CEP CO
9 pages
Chapter Seven: Code Generation
No ratings yet
Chapter Seven: Code Generation
33 pages
ARM Introduction & Instruction Set Architecture: Aleksandar Milenkovic
No ratings yet
ARM Introduction & Instruction Set Architecture: Aleksandar Milenkovic
31 pages
20RA210 Internal 2 Answer Key
No ratings yet
20RA210 Internal 2 Answer Key
7 pages
Chapter-7
No ratings yet
Chapter-7
85 pages
V4 Errata of the First Printing Jan 2023
No ratings yet
V4 Errata of the First Printing Jan 2023
2 pages
Ee - 105 - ADSP-2100 Family C Programming Examples Guide (En)
No ratings yet
Ee - 105 - ADSP-2100 Family C Programming Examples Guide (En)
61 pages
ARM Slides Part3
No ratings yet
ARM Slides Part3
33 pages
CTCD Unit 4
No ratings yet
CTCD Unit 4
25 pages
S2 23 ES ZG512 Lab Session 02
No ratings yet
S2 23 ES ZG512 Lab Session 02
22 pages
Acces 2D Array
No ratings yet
Acces 2D Array
14 pages
Module 1B - ARM Cortex M0+ Core Architecture
No ratings yet
Module 1B - ARM Cortex M0+ Core Architecture
28 pages
UNIT-5 Notes
No ratings yet
UNIT-5 Notes
14 pages
L4b Intro to SystemVerilog - Combinational Logic
No ratings yet
L4b Intro to SystemVerilog - Combinational Logic
19 pages
Es MCQ
No ratings yet
Es MCQ
31 pages
ES MCQ CDAC
100% (2)
ES MCQ CDAC
53 pages
Module 3 Book1_merged
No ratings yet
Module 3 Book1_merged
42 pages
Midterm Exam1 - 2020-2021 - Model Answer
No ratings yet
Midterm Exam1 - 2020-2021 - Model Answer
2 pages
CH 16
No ratings yet
CH 16
11 pages
C Ompiler Theory: (Intermediate C Ode Generation - Abstract S Yntax + 3 Address C Ode)
No ratings yet
C Ompiler Theory: (Intermediate C Ode Generation - Abstract S Yntax + 3 Address C Ode)
32 pages
Code Generation
No ratings yet
Code Generation
25 pages
Programming DsPIC in C
100% (2)
Programming DsPIC in C
86 pages
ARM Teaching Material
100% (1)
ARM Teaching Material
33 pages
Question Bank #2 For MID Exam Based On Our Discussions of ARM Assembly Programming in 331,453 Lecture Classes
No ratings yet
Question Bank #2 For MID Exam Based On Our Discussions of ARM Assembly Programming in 331,453 Lecture Classes
5 pages
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
From Everand
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
Ginno
No ratings yet
Migrating Software To Multicore SMP Systems: Satyaki Mukherjee
No ratings yet
Migrating Software To Multicore SMP Systems: Satyaki Mukherjee
39 pages
Implementing Power Management Features On The Nucleus Rtos
No ratings yet
Implementing Power Management Features On The Nucleus Rtos
15 pages
Lipponen
No ratings yet
Lipponen
91 pages
Keystone PCIE
No ratings yet
Keystone PCIE
215 pages
Nor Nand Flash Guide
No ratings yet
Nor Nand Flash Guide
9 pages
Jagoinvestor Ebook 15 Best
100% (1)
Jagoinvestor Ebook 15 Best
143 pages
HC21 23 131 Ajanovic-Intel-PCIeGen3 PDF
No ratings yet
HC21 23 131 Ajanovic-Intel-PCIeGen3 PDF
61 pages
Spi 003
No ratings yet
Spi 003
20 pages
Netmap-Pktgen: Running Netmap-Pktgen
No ratings yet
Netmap-Pktgen: Running Netmap-Pktgen
3 pages
Ibm Developerworks Wireless Wireless Articles
No ratings yet
Ibm Developerworks Wireless Wireless Articles
5 pages
Data Structures MCQ
No ratings yet
Data Structures MCQ
37 pages
Im Usa Wlr-753ac
0% (1)
Im Usa Wlr-753ac
50 pages
Distance Protection Ingepac Ef ZT User Manual PDF Free
No ratings yet
Distance Protection Ingepac Ef ZT User Manual PDF Free
532 pages
SMELRSVXTiilq4kr5heA Mastering Hashtags Ebook V2 by Lauren Ashley
No ratings yet
SMELRSVXTiilq4kr5heA Mastering Hashtags Ebook V2 by Lauren Ashley
15 pages
False - SS
57% (7)
False - SS
66 pages
TCL
No ratings yet
TCL
53 pages
WSSF 2020 Stock Management System With Report Generating Software Using C# With Help of Vs2010 and SQL SERVER 2008
No ratings yet
WSSF 2020 Stock Management System With Report Generating Software Using C# With Help of Vs2010 and SQL SERVER 2008
7 pages
Chemicals With SAP S4HANA
100% (1)
Chemicals With SAP S4HANA
22 pages
Service Manual Printer Canon S600
No ratings yet
Service Manual Printer Canon S600
87 pages
TTMO Syllabus
No ratings yet
TTMO Syllabus
12 pages
SVM LAB.7
No ratings yet
SVM LAB.7
4 pages
Django Edu Empowering With Learning Management System
No ratings yet
Django Edu Empowering With Learning Management System
52 pages
SKF Electronic Parking Brake 6879 1 en
No ratings yet
SKF Electronic Parking Brake 6879 1 en
4 pages
R&S Receiver Architecture
No ratings yet
R&S Receiver Architecture
78 pages
AIX For System Administrators: Practical Guide To AIX
No ratings yet
AIX For System Administrators: Practical Guide To AIX
2 pages
Peterbilt - Medium Duty Trucks - Owners Manual
No ratings yet
Peterbilt - Medium Duty Trucks - Owners Manual
280 pages
Use Case Template
No ratings yet
Use Case Template
5 pages
Assignment of Electronics
No ratings yet
Assignment of Electronics
3 pages
OpenLMIS Implementaion Pre-Proposal
No ratings yet
OpenLMIS Implementaion Pre-Proposal
2 pages
5SDF 08H6005 - Datasheet
No ratings yet
5SDF 08H6005 - Datasheet
6 pages
Teknofest Commission Report
No ratings yet
Teknofest Commission Report
2 pages
DBMS Multiple Choice Questions and Answers-Normalization66
No ratings yet
DBMS Multiple Choice Questions and Answers-Normalization66
4 pages
Unit 6
No ratings yet
Unit 6
22 pages
Bcca Sem Vi C#.net2018 19 Unit III
0% (1)
Bcca Sem Vi C#.net2018 19 Unit III
43 pages
Application of Hybrid Fault Tree and Bayesian Networks in Safety Management of Oil and Gas Subsea Production Infrastructure
No ratings yet
Application of Hybrid Fault Tree and Bayesian Networks in Safety Management of Oil and Gas Subsea Production Infrastructure
12 pages
Drupal 8 Theming With Twig - Sample Chapter
100% (1)
Drupal 8 Theming With Twig - Sample Chapter
42 pages

AT - Better C Code For ARM Devices

Uploaded by

AT - Better C Code For ARM Devices

Uploaded by

Efficient C Code for ARM

Core External RAM

 BUT smaller code might cache more efficiently – avoiding

Data types Architecture

 Write sensible code

unsigned int a, b signed short a, b unsigned short a, b

 For an array at [r3], to access the first word of element number r1

fx(int a, double b, int

fy(int a, int c, double

 Some things are not ARM-specific...

 The ARM compiler provides helpful things like

__promise((n > 0) && ((n & 7) == 0));

for (i = 0; i < n; i++)

 You may need to resort to hand-coding in assembler to get

unsigned int SMUADop(unsigned int val1, unsigned int val2)

 Vectorizing this using NEON™ C intrinsics

 Access data in a cache-friendly manner (i.e. sequentially)

int func(void) int func(void)

a and b defined within the

LDR r0, [pc,#16] a

 Race to completion vs Just-in-time

256k 128k 64k

 Use registers as much as possible

Please visit www.arm.com for ARM related technical details

For any queries contact < [email protected] >

You might also like