Aapcs 32
Aapcs 32
Architecture
2023Q1
1.2 Keywords
Procedure call, function call, calling conventions, data layout
2
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
1.4 Licence
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of
this license, visit http://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons, PO Box 1866,
Mountain View, CA 94042, USA.
Grant of Patent License. Subject to the terms and conditions of this license (both the Public License and this Patent
License), each Licensor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free,
irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and
otherwise transfer the Licensed Material, where such license applies only to those patent claims licensable by such
Licensor that are necessarily infringed by their contribution(s) alone or by combination of their contribution(s) with the
Licensed Material to which such contribution(s) was submitted. If You institute patent litigation against any entity
(including a cross-claim or counterclaim in a lawsuit) alleging that the Licensed Material or a contribution incorporated
within the Licensed Material constitutes direct or contributory patent infringement, then any licenses granted to You
under this license for that Licensed Material shall terminate as of the date such litigation is filed.
1.6 Contributions
Contributions to this project are licensed under an inbound=outbound model such that any such contributions are
licensed by the contributor under the same terms as those in the Licence section.
1.8 Copyright
Copyright (c) 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
3
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
Contents
1 Preamble 2
1.1 Abstract 2
1.2 Keywords 2
1.3 Latest release and defects report 2
1.4 Licence 3
1.5 About the license 3
1.6 Contributions 3
1.7 Trademark notice 3
1.8 Copyright 3
2 About This Document 6
2.1 Change Control 6
2.1.1 Current Status and Anticipated Changes 6
2.1.2 Change History 6
2.2 References 8
2.3 Terms and Abbreviations 8
2.4 Acknowledgements 10
3 Scope 11
4 Introduction 12
4.1 Design Goals 12
4.2 Conformance 12
5 Data Types and Alignment 13
5.1 Fundamental Data Types 13
5.1.1 Half-precision Floating Point 13
5.1.2 Containerized Vectors 14
5.2 Endianness and Byte Ordering 14
5.3 Composite Types 15
5.3.1 Aggregates 15
5.3.2 Unions 15
5.3.3 Arrays 15
5.3.4 Bit-fields 15
5.3.5 Homogeneous Aggregates 15
6 The Base Procedure Call Standard 17
6.1 Machine Registers 17
6.1.1 Core registers 17
6.1.2 Co-processor Registers 18
6.2 Processes, Memory and the Stack 19
6.2.1 The Stack 20
4
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
6.3 Subroutine Calls 21
6.3.1 Use of IP by the linker 22
6.4 Result Return 22
6.5 Parameter Passing 23
6.6 Interworking 25
7 The Standard Variants 27
7.1 VFP and SIMD vector Register Arguments 27
7.1.1 Mapping between registers and memory format 27
7.1.2 Procedure Calling 27
7.2 Arm Alternative Format Half-precision Floating Point values 28
7.3 Read-Write Position Independence (RWPI) 28
7.4 Variant Compatibility 28
7.4.1 VFP and Base Standard Compatibility 28
7.4.2 RWPI and Base Standard Compatibility 29
7.4.3 VFP and RWPI Standard Compatibility 29
7.4.4 Half-precision Format Compatibility 29
8 Arm C and C++ Language Mappings 30
8.1 Data Types 30
8.1.1 Arithmetic Types 30
8.1.2 Pointer Types 31
8.1.3 Enumerated Types 31
8.1.4 Additional Types 32
8.1.5 Volatile Data Types 32
8.1.6 Structure, Union and Class Layout 33
8.1.7 Bit-fields 33
8.2 Argument Passing Conventions 36
9 APPENDIX: Support for Advanced SIMD Extensions and MVE 37
9.1 Introduction 37
9.2 SIMD vector data types 37
9.2.1 C++ Mangling 39
5
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
2 About This Document
2.1 Change Control
2.01 5th July 2005 Added clarifying remark following Additional data types – word-sized
enumeration contains are int if possible (Enumerated Types)
2.02 4th August 2005 Clarify that a callee may modify stack space used for incoming parameters.
2.03 7th October 2005 Added notes concerning VFPv3 D16-D31 (VFP register usage conventions);
retracted requirement that plain bit-fields be unsigned by default (Bit-fields
(C mappings))
2.04 4th May 2006 Clarified when linking may insert veneers that corrupt r12 and the condition
codes (Use of IP by the linker).
2.05 19th January 2007 Update for the Advanced SIMD Extension.
2.06 2nd October 2007 Add support for half-precision floating point.
6
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
Issue Date Change
nd
B 2 April 2008 Simplify duplicated text relating to VFP calling and clarify that homogeneous
aggregates of containerized vectors are limited to four members in calling
convention (VFP co-processor register candidates).
C 10th October 2008 Clarify that __va_list is in namespace std. Specify containers for oversized
enums. State truth values for _Bool/bool. Clarify some wording with respect
to homogeneous aggregates and argument marshalling of VFP CPRCs.
D 16th October 2009 Re-wrote Enumerated Types to better reflect the intentions for enumerated
types in ABI-complying interfaces.
E 2.09 30th November 2012 Clarify that memory passed for a function result may be modified at any
point during the function call (Result Return (base PCS)). Changed the
illustrative source name of the half-precision float type from __f16 to __fp16
to match [ACLE] (Arithmetic Types). Re-wrote APPENDIX: Support for
Advanced SIMD Extensions and MVE to clarify requirements on Advanced
SIMD types.
F 24th October 2015 SIMD vector data types, corrected the element counts of poly16x4_t and
poly16x8_t. Added [u]int64x1_t, [u]int64x2_t, poly64x2_t. Allow
half-precision floating point types as function parameter and return types, by
specifying how half-precision floating point types are passed and returned in
registers Result Return (base PCS), Parameter Passing (base PCS),
Mapping between registers and memory format, VFP co-processor register
candidates). Added parameter passing rules for over-aligned types
(Composite Types, Parameter Passing (base PCS)).
2018Q4 21st December 2018 In Volatile bit-fields – preserving number and width of container accesses,
relaxed the rules regarding accesses to volatile bitfield members to be
compatible with the C/C++ memory model.
In Stack probing, relaxed the rules regarding stack accesses to permit stack
probing.
In VFP register usage conventions, corrected the rules regarding the values
of the IDC and IDE bits of the FPSCR register on a public interface.
2019Q4 28th January 2020 Be more specific on the use of frame pointers and frame records. (The
Frame Pointer, Machine Registers).
Add description of half-precision Brain floating-point format (Half-precision
Floating Point, Arm Alternative Format Half-precision Floating Point values,
Arithmetic Types).
For clarity, renamed half-precision format 'Alternative' to 'Arm Alternative'
(Half-precision Floating Point, Arm Alternative Format Half-precision
Floating Point values, Half-precision Format Compatibility, Mapping of C &
C++ built-in data types).
2020Q2 1st July 2020 Correct minus signs not rendering in sections Bit-field extraction
expressions and Over-sized bit-fields.
Clarify the AAPCS rules for volatile zero length bit-fields in section Volatile
bit-fields – preserving number and width of container accesses.
2021Q1 12th April 2021 Clarify what it means for a VFP CPRC argument to be correctly aligned.
7
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
Issue Date Change
th
2023Q1 6 April 2023 Fix formatting of v6 cell in core registers table.
2.2 References
This document refers to, or is referred to by, the following documents.
ARMARM Arm DDI 0100E, ISBN 0 201 737191 The Arm Architecture Reference Manual 2nd
https://developer.arm.com/docs/ddi0100/l edition, edited by David Seal, published by
atest/armv5-architecture-reference-manu Addison-Wessley.
al
1. The specifications to which an executable must conform in order to execute in a specific execution
environment. For example, the Linux ABI for the Arm Architecture.
2. A particular aspect of the specifications to which independently produced relocatable files must conform in
order to be statically linkable and executable. For example, the C++ ABI for the Arm Architecture
[CPPABI32], the Run-time ABI for the Arm Architecture [RTABI32], the C Library ABI for the Arm
Architecture [CLIBABI32].
Arm-based
based on the Arm architecture
EABI
An ABI suited to the needs of embedded (sometimes called free standing) applications.
PCS
Procedure Call Standard.
AAPCS
Procedure Call Standard for the Arm Architecture (this standard).
8
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
APCS
Arm Procedure Call Standard (obsolete).
TPCS
Thumb Procedure Call Standard (obsolete).
ATPCS
Arm-Thumb Procedure Call Standard (precursor to this standard).
PIC / PID
Position-independent code, position-independent data.
Routine / subroutine
A fragment of program to which control can be transferred that, on completing its task, returns control to its caller
at an instruction following the call. Routine is used for clarity where there are nested calls: a routine is the caller
and a subroutine is the callee.
Procedure
A routine that returns no result value.
Function
A routine that returns a result value.
Activation stack / call-frame stack
The stack of routine activation records (call frames).
Activation record / call frame
The memory used by a routine for saving registers and holding local variables (usually allocated on a stack, once
per activation of the routine).
Argument / Parameter
The terms argument and parameter are used interchangeably. They may denote a formal parameter of a routine
given the value of the actual parameter when the routine is called, or an actual parameter, according to context.
Externally visible [interface]
[An interface] between separately compiled or separately assembled routines.
Variadic routine
A routine is variadic if the number of arguments it takes, and their type, is determined by the caller instead of the
callee.
Global register
A register whose value is neither saved nor destroyed by a subroutine. The value may be updated, but only in a
manner defined by the execution environment.
Program state
The state of the program’s memory, including values in machine registers.
Scratch register / temporary register
A register used to hold an intermediate value during a calculation (usually, such values are not named in the
program source and have a limited lifetime).
Thumb-1
The variant of the Thumb instruction set introduced in Arm v4T and used in Arm v6-M and the Arm v8-M.Baseline
variants of the architecture. It consists of instructions that are predominantly encoded with 16-bit opcodes.
Thumb-2
The variant of the Thumb instruction set introduced in Arm v6T2. It consists of a mix of instructions encoded with
16- and 32-bit opcodes.
Variable register / v-register
A register used to hold the value of a variable, usually one local to a routine, and often named in the source code.
9
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
2.4 Acknowledgements
This specification has been developed with the active support of the following organizations. In alphabetical order:
Arm, CodeSourcery, Intel, Metrowerks, Montavista, Nexus Electronics, PalmSource, Symbian, Texas Instruments, and
Wind River.
10
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
3 Scope
The AAPCS defines how subroutines can be separately written, separately compiled, and separately assembled to
work together. It describes a contract between a calling routine and a called routine that defines:
• Obligations on the caller to create a program state in which the called routine may start to execute.
• Obligations on the called routine to preserve the program state of the caller across the call.
• The rights of the called routine to alter the program state of its caller.
This standard specifies the base for a family of Procedure Call Standard (PCS) variants generated by choices that
reflect alternative priorities among:
• Code size.
• Performance.
• Functionality (for example, ease of debugging, run-time checking, support for shared libraries).
Some aspects of each variant – for example the allowable use of R9 – are determined by the execution environment.
Thus:
• It is possible for code complying strictly with the base standard to be PCS compatible with each of the variants.
• It is unusual for code complying with a variant to be compatible with code complying with any other variant.
• Code complying with a variant, or with the base standard, is not guaranteed to be compatible with an execution
environment that requires those standards. An execution environment may make further demands beyond the
scope of the procedure call standard.
This specification does not standardize the representation of publicly visible C++-language entities that are not also C
language entities (these are described in CPPABI32) and it places no requirements on the representation of language
entities that are not visible across public interfaces.
11
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
4 Introduction
The AAPCS embodies the fifth major revision of the APCS and third major revision of the TPCS. It forms part of the
complete ABI specification for the Arm Architecture.
4.2 Conformance
The AAPCS defines how separately compiled and separately assembled routines can work together. There is an
externally visible interface between such routines. It is common that not all the externally visible interfaces to
software are intended to be publicly visible or open to arbitrary use. In effect, there is a mismatch between the
machine-level concept of external visibility—defined rigorously by an object code format—and a higher level,
application-oriented concept of external visibility—which is system-specific or application-specific.
Conformance to the AAPCS requires that1:
• At all times, stack limits and basic stack alignment are observed (Universal stack constraints).
• At each call where the control transfer instruction is subject to a BL-type relocation at static link time, rules on the
use of IP are observed (Use of IP by the linker).
• The routines of each publicly visible interface conform to the relevant procedure call standard variant.
• The data elements2 of each publicly visible interface conform to the data layout rules.
12
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
5 Data Types and Alignment
5.1 Fundamental Data Types
The following table shows the fundamental data types (Machine Types) of the machine. A NULL pointer is always
represented by all-bits-zero.
Byte
Type Class Machine Type Byte size alignment Note
Signed byte 1 1
Unsigned half-word 2 2
Signed half-word 2 2
Unsigned word 4 4
Signed word 4 4
Unsigned 8 8
double-word
Signed double-word 8 8
13
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
The first two formats are mutually exclusive. The base standard of the AAPCS specifies use of the IEEE754-2008
variant, and a procedure call variant that uses the Arm Alternative format is permitted.
• In a little-endian view of memory the least significant byte of a data object is at the lowest byte address the data
object occupies in memory.
• In a big-endian view of memory the least significant byte of a data object is at the highest byte address the data
object occupies in memory.
MSB LSB M+ 3
M+ 2
M+ 1
M+ 0
M+ 3
M+ 2
M+ 1
MSB LSB M+ 0
31 0
14
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
5.3 Composite Types
A Composite Type is a collection of one or more Fundamental Data Types that are handled as a single entity at the
procedure call level. A Composite Type can be any of:
The definitions are recursive; that is, each of the types may contain a Composite Type as a member.
• The member alignment of an element of a composite type is the alignment of that member after the application
of any language alignment modifiers to that member
• The natural alignment of a composite type is the maximum of each of the member alignments of the 'top-level'
members of the composite type i.e. before any alignment adjustment of the entire composite is applied
5.3.1 Aggregates
5.3.2 Unions
5.3.3 Arrays
5.3.4 Bit-fields
A member of an aggregate that is a Fundamental Data Type may be subdivided into bit-fields; if there are unused
portions of such a member that are sufficient to start the following member at its natural alignment then the following
member may use the unallocated portion. For the purposes of calculating the alignment of the aggregate the type of
the member shall be the Fundamental Data Type upon which the bit-field is based. 4 The layout of bit-fields within an
aggregate is defined by the appropriate language binding (see Arm C and C++ Language Mappings).
15
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
A Homogeneous Aggregate has a Base Type, which is the Fundamental Data Type of each Element. The overall size
is the size of the Base Type multiplied by the number of Elements; its alignment will be the alignment of the Base
Type.
16
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
6 The Base Procedure Call Standard
The base standard defines a machine-level, core-registers-only calling standard common to the Arm and Thumb
instruction sets. It should be used for systems where there is no floating-point hardware, or where a high degree of
inter-working with Thumb code is required.
r10 v7 Variable-register 7.
r8 v5 Variable-register 5.
r7 v4 Variable-register 4.
r6 v3 Variable-register 3.
r5 v2 Variable-register 2.
r4 v1 Variable-register 1.
17
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
The first four registers r0-r3 (a1-a4) are used to pass argument values into a subroutine and to return a result value
from a function. They may also be used to hold intermediate values within a routine (but, in general, only between
subroutine calls).
Register r12 (IP) may be used by a linker as a scratch register between a routine and any subroutine it calls (for
details, see Use of IP by the linker). It can also be used within a routine to hold intermediate values between
subroutine calls.
In some variants r11 (FP) may be used as a frame pointer in order to chain frame activation records into a linked list.
The role of register r9 is platform specific. A virtual platform may assign any role to this register and must document
this usage. For example, it may designate it as the static base (SB) in a position-independent data model, or it may
designate it as the thread register (TR) in an environment with thread-local storage. The usage of this register may
require that the value held is persistent across all calls. A virtual platform that has no need for such a special register
may designate r9 as an additional callee-saved variable register, v6.
Typically, the registers r4-r8, r10 and r11 (v1-v5, v7 and v8) are used to hold the values of a routine’s local variables.
Of these, only v1-v4 can be used uniformly by the whole Thumb instruction set, but the AAPCS does not require that
Thumb code only use those registers.
A subroutine must preserve the contents of the registers r4-r8, r10, r11 and SP (and r9 in PCS variants that designate
r9 as v6).
In all variants of the procedure call standard, registers r12-r15 have special roles. In these roles they are labeled IP,
SP, LR and PC.
The CPSR is a global register with the following properties:
• The N, Z, C, V and Q bits (bits 27-31) and the GE[3:0] bits (bits 16-19) are undefined on entry to or return from a
public interface. The Q and GE[3:0] bits may only be modified when executing on a processor where these
features are present.
• On Arm Architecture 6, the E bit (bit 8) can be used in applications executing in little-endian mode, or in
big-endian-8 mode to temporarily change the endianness of data accesses to memory. An application must have
a designated endianness and at entry to and return from any public interface the setting of the E bit must match
the designated endianness of the application.
• The T bit (bit 5) and the J bit (bit 24) are the execution state bits. Only instructions designated for modifying
these bits may change them.
• The A, I, F and M[4:0] bits (bits 0-7) are the privileged bits and may only be modified by applications designed to
operate explicitly in a privileged mode.
• All other bits are reserved and must not be modified. It is not defined whether the bits read as zero or one, or
whether they are preserved across a public interface.
• A double-word sized type is passed in two consecutive registers (e.g., r0 and r1, or r2 and r3). The content of the
registers is as if the value had been loaded from memory representation with a single LDM instruction.
• A 128-bit containerized vector is passed in four consecutive registers. The content of the registers is as if the
value had been loaded from memory with a single LDM instruction.
18
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
Note
Even though co-processor registers are not used for passing arguments, some elements of the run-time support for
a language may require knowledge of all co-processors in use in an application in order to function correctly (for
example, setjmp() in C and exceptions in C++).
• The condition code bits (28-31), the cumulative saturation (QC) bit (27) and the cumulative exception-status bits
(0-4 and 7) are not preserved across a public interface.
• The exception-control bits (8-12 and 15), rounding mode bits (22-23) and flush-to-zero bits (24) may be modified
by calls to specific support functions that affect the global state of the application.
• The length bits (16-18) must be 0b100 when using M-profile Vector Extension, 0b000 when using VFP vector
mode and otherwise preserved across a public interface.
• The stride bits (20-21) must be zero on entry to and return from a public interface.
• All other bits are reserved and must not be modified. It is not defined whether the bits read as zero or one, or
whether they are preserved across a public interface.
• The VPT mask bits (16-23) must be zero on entry to and return from a public interface.
• The predication bits (0-15) are not preserved across a public interface.
• All other bits are reserved and must not be modified. It is not defined whether the bits read as zero or one, or
whether they are preserved across a public interface.
• code (the program being executed), which must be readable, but need not be writable, by the process.
• read-only static data.
• writable static data.
• the heap.
• the stack.
19
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
Writable static data may be further sub-divided into initialized, zero-initialized and uninitialized data. Except for the
stack there is no requirement for each class of memory to occupy a single contiguous region of memory. A process
must always have some code and a stack, but need not have any of the other categories of memory.
The heap is an area (or areas) of memory that are managed by the process itself (for example, with the C malloc
function). It is typically used for the creation of dynamic data objects.
A conforming program must only execute instructions that are in areas of memory designated to contain code.
• Stack-limit ≤ SP ≤ stack-base. The stack pointer must lie within the extent of the stack.
• SP mod 4 = 0. The stack must at all times be aligned to a word boundary.
• A process may only store data in the closed interval of the entire stack delimited by [SP, stack base - 1] (where
SP is the value of register r13).
Note
This implies that instructions of the following form can fail to satisfy the stack discipline constraints, even when reg
points within the extent of the stack.
If execution of the instruction is interrupted after sp has been loaded, the stack extent will not be restored, so
restarting the instruction might violate the third constraint.
20
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
6.2.1.4 The Frame Pointer
A platform may require the construction of a list of stack frames describing the current call hierarchy in a program.
Each frame shall link to the frame of its caller by means of a Frame Record of two 32-bit values on the stack. The
frame record for the innermost frame (belonging to the most recent routine invocation) shall be pointed to by the
Frame Pointer register (FP). The lowest addressed word shall point to the previous frame record and the highest
addressed word shall contain the value passed in LR on entry to the current function. The end of the frame record
chain is indicated by the address zero in the address for the previous frame. The location of the frame record within a
stack frame is not specified. The frame pointer register must not be updated until the new frame record has been fully
constructed.
Note
There will always be a short period during construction or destruction of each frame record during which the frame
pointer will point to the caller’s record.
A platform shall mandate the minimum level of conformance with respect to the maintenance of frame records. The
options are, in decreasing level of functionality:
• It may require the frame pointer to address a valid frame record at all times, except that small subroutines which
do not modify the link register may elect not to create a frame record
• It may require the frame pointer to address a valid frame record at all times, except that any subroutine may elect
not to create a frame record
• It may permit the frame pointer register to be used as a general-purpose callee-saved register, but provide a
platform-specific mechanism for external agents to reliably locate the chain of frame records
• It may elect not to maintain a frame chain and to use the frame pointer register as a general-purpose
callee-saved register.
Note
Unlike the APCS and its variants, the same frame pointer register is used for both the Arm and Thumb ISAs
(including the Thumb-1 variant), this ensures that the frame chain can be constructed even when generating code
that interworks between both the Arm and Thumb instruction sets. It is expected that Thumb-1 code will rarely, if
ever, want to create stack frames - the choice of a high register therefore ensures that such code can conform
minimally to the requirements of having a valid value stored in the frame pointer register without noticably reducing
the number of registers available to normal code.
The AAPCS does not specify where, within a function's stack frame record, the frame chain data structure resides.
This permits implementors the freedom to use whatever location will result in the most efficient code needed to
establish the frame chain record. As a result, even in Thumb-1, the overhead for establishing the frame will rarely
exceed three additional instructions in the function entry sequence and two additional instructions in the return
sequence.
21
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
A subroutine call can be synthesized by any instruction sequence that has the effect:
For example, in Arm-state, to call a subroutine addressed by r4 with control returning to the following instruction, do
MOV LR, PC
BX r4
...
Note
The equivalent sequence will not work from Thumb state because the instruction that sets LR does not copy the
Thumb-state bit to LR[0].
In Arm Architecture v5 both Arm and Thumb state provide a BLX instruction that will call a subroutine addressed by a
register and correctly sets the return address to the sequentially next value of the program counter.
Note
• A Half-precision Floating Point Type is returned in the least significant 16 bits of r0.
• A Fundamental Data Type that is smaller than 4 bytes is zero- or sign-extended to a word and returned in r0.
• A word-sized Fundamental Data Type (e.g., int, float) is returned in r0.
• A double-word sized Fundamental Data Type (e.g., long long, double and 64-bit containerized vectors) is
returned in r0 and r1.
• A 128-bit containerized vector is returned in r0-r3.
• A Composite Type not larger than 4 bytes is returned in r0. The format is as if the result had been stored in
memory at a word-aligned address and then loaded into r0 with an LDR instruction. Any bits in r0 that lie outside
the bounds of the result have unspecified values.
22
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
• A Composite Type larger than 4 bytes, or whose size cannot be determined statically by both caller and callee, is
stored in memory at an address passed as an extra argument when the function was called (Parameter Passing
(base PCS), Rule A.4). The memory to be used for the result may be modified at any point during the function
call.
The mapping from the source language onto the machine type is specific for each language and is described
separately (the C and C++ language bindings are described in Arm C and C++ Language Mappings). The result is an
ordered list of arguments that are to be passed to the subroutine.
In the following description there are assumed to be a number of co-processors available for passing and receiving
arguments. The co-processor registers are divided into different classes. An argument may be a candidate for at most
one co-processor register class. An argument that is suitable for allocation to a co-processor register is known as a
Co-processor Register Candidate (CPRC).
In the base standard there are no arguments that are candidates for a co-processor register class.
A variadic function is always marshaled as for the base standard.
For a caller, sufficient stack space to hold stacked arguments is assumed to have been allocated prior to marshaling:
in practice the amount of stack space required cannot be known until after the argument marshalling has been
completed. A callee can modify any stack space used for receiving parameter values from the caller.
When a Composite Type argument is assigned to core registers (either fully or partially), the behavior is as if the
argument had been stored to memory at a word-aligned (4-byte) address and then loaded into consecutive registers
using a suitable load-multiple instruction.
Stage A -– Initialization
This stage is performed exactly once, before processing of the arguments commences.
A.3 The next stacked argument address (NSAA) is set to the current stack-pointer value
(SP).
A.4 If the subroutine is a function that returns a result in memory, then the address for the
result is placed in r0 and the NCRN is set to r1.
23
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
B.1 If the argument is a Composite Type whose size cannot be statically determined by
both the caller and callee, the argument is copied to memory and the argument is
replaced by a pointer to the copy.
B.2 If the argument is an integral Fundamental Data Type that is smaller than a word,
then it is zero- or sign-extended to a full word and its size is set to 4 bytes. If the
argument is a Half-precision Floating Point Type its size is set to 4 bytes as if it had
been copied to the least significant bits of a 32-bit register and the remaining bits
filled with unspecified values.
B.3.cp If the argument is a CPRC then any preparation rules for that co-processor register
class are applied.
B.4 If the argument is a Composite Type whose size is not a multiple of 4 bytes, then its
size is rounded up to the nearest multiple of 4.
B.5 If the argument is an alignment adjusted type its value is passed as a copy of the
actual value. The copy will have an alignment defined as follows.
• For a Fundamental Data Type, the alignment is the natural alignment of that
type, after any promotions.
• For a Composite Type, the alignment of the copy will have 4-byte alignment if
its natural alignment is ≤ 4 and 8-byte alignment if its natural alignment is ≥ 8
The alignment of the copy is used for applying marshaling rules.
C.1.cp If the argument is a CPRC and there are sufficient unallocated co-processor registers
of the appropriate class, the argument is allocated to co-processor registers.
C.2.cp If the argument is a CPRC then any co-processor registers in that class that are
unallocated are marked as unavailable. The NSAA is adjusted upwards until it is
correctly aligned for the argument and the argument is copied to the memory at the
adjusted NSAA. The NSAA is further incremented by the size of the argument. The
argument has now been allocated.
C.3 If the argument requires double-word alignment (8-byte), the NCRN is rounded up to
the next even register number.
C.4 If the size in words of the argument is not more than r4 minus NCRN, the argument is
copied into core registers, starting at the NCRN. The NCRN is incremented by the
number of registers used. Successive registers hold the parts of the argument they
would hold if its value were loaded into those registers from memory using an LDM
instruction. The argument has now been allocated.
C.5 If the NCRN is less than r4 and the NSAA is equal to the SP, the argument is split
between core registers and the stack. The first part of the argument is copied into the
core registers starting at the NCRN up to and including r3. The remainder of the
argument is copied onto the stack, starting at the NSAA. The NCRN is set to r4 and
the NSAA is incremented by the size of the argument minus the amount passed in
registers. The argument has now been allocated.
C.7 If the argument required double-word alignment (8-byte), then the NSAA is rounded
up to the next double-word address.
C.8 The argument is copied to memory at the NSAA. The NSAA is incremented by the
size of the argument.
24
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
It should be noted that the above algorithm makes provision for languages other than C and C++ in that it provides for
passing arrays by value and for passing arguments of dynamic size. The rules are defined in a way that allows the
caller to be always able to statically determine the amount of stack space that must be allocated for arguments that are
not passed in registers, even if the function is variadic.
Several further observations can also be made:
• The initial stack slot address is the value of the stack pointer that will be passed to the subroutine. It may
therefore be necessary to run through the above algorithm twice during compilation, once to determine the
amount of stack space required for arguments and a second time to assign final stack slot addresses.
• A double-word aligned type will always start in an even-numbered core register, or at a double-word aligned
address on the stack even if it is not the first member of an aggregate.
• Arguments are allocated first to registers and only excess arguments are placed on the stack.
• Arguments that are Fundamental Data Types can either be entirely in registers or entirely on the stack.
• At most one argument can be split between registers and memory according to Rule C.5.
• CPRCs may be allocated to co-processor registers or the stack – they may never be allocated to core registers.
• Since an argument may be a candidate for at most one class of co-processor register, then the rules for multiple
co-processors (should they be present) may be applied in any order without affecting the behavior.
• An argument may only be split between core registers and the stack if all preceding CPRCs have been allocated
to co-processor registers.
6.6 Interworking
The AAPCS requires that all sub-routine call and return sequences support inter-working between Arm and Thumb
states. The implications on compiling for various Arm Architectures are as follows.
Arm v5 and Arm v6
Calls via function pointers should use one of the following, as appropriate:
Calls to functions that use bl<cond>, b, or b<cond> will need a linker-generated veneer if a state change is required,
so it may sometimes be more efficient to use a sequence that permits use of an unconditional bl instruction.
Return sequences may use load-multiple operations that directly load the PC or a suitable bx instruction.
The following traditional return must not be used if inter-working might be required.
mov pc, Rm
Arm v4T
In addition to the constraints for Arm v5, the following additional restrictions apply to Arm v4T.
Calls using bl that involve a state change also require a linker-generated stub.
Calls via function pointers must use a sequence equivalent to the Arm-state code
mov lr, pc
bx Rm
25
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
However, this sequence does not work for Thumb state, so usually a bl to a veneer that does the bx instruction must
be used.
Return sequences must restore any saved registers and then use a bx instruction to return to the caller.
Arm v4
The Arm v4 Architecture supports neither Thumb state nor the bx instruction, therefore it is not strictly compatible with
the AAPCS.
It is recommended that code for Arm v4 be compiled using Arm v4T inter-working sequences but with all bx
instructions subject to relocation by an R_ARM_V4BX relocation [AAELF32]. A linker linking for Arm V4 can then
change all instances of:
bx Rm
Into:
mov pc, Rm
26
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
7 The Standard Variants
This section applies only to non-variadic functions. For a variadic function the base standard is always used both for
argument passing and result return.
• A half precision floating point type is passed as if it were loaded from its memory format into the least significant
16 bits of a single precision register.
• A single precision floating point type is passed as if it were loaded from its memory format into a single precision
register with VLDR.
• A double precision floating point type is passed as if it were loaded from its memory format into a double
precision register with VLDR.
• A 64-bit containerized vector type is passed as if it were loaded from its memory format into a 64-bit vector
register (Dn) with VLDR.
• A 128-bit containerized vector type is passed as if it were loaded from its memory format into a 128-bit vector
register (Qn) with a single VLDM of the two component 64-bit vector registers (for example, VLDM r0,{d2,d3}
would load q1).
Note
27
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
7.1.2.2 Result return
Any result whose type would satisfy the conditions for a VFP CPRC is returned in the appropriate number of
consecutive VFP registers starting with the lowest numbered register (s0, d0, q0).
All other types are returned as for the base standard.
C.1.vfp If the argument is a VFP CPRC and there are sufficient consecutive VFP registers of the appropriate
type unallocated then the argument is allocated to the lowest-numbered sequence of such registers.
C.2.vfp If the argument is a VFP CPRC then any VFP registers that are unallocated are marked as
unavailable. The NSAA is rounded up to the next multiple of 4 if the natural alignment of the
argument is ≤ 4 or the next multiple of 8 if its natural alignment is ≥ 8 and the argument is copied to
the stack at the adjusted NSAA. The NSAA is further incremented by the size of the argument. The
argument has now been allocated.
Note that the rules require the ‘back-filling’ of unused co-processor registers that are skipped by the alignment
constraints of earlier arguments. The back-filling continues only so long as no VFP CPRC has been allocated to a slot
on the stack.
28
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
7.4.2 RWPI and Base Standard Compatibility
Code compiled for the base standard is compatible with the RWPI calling standard if it makes no use of register r9.
However, a platform ABI may restrict further the subset of code that is usefully compatible.
29
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
8 Arm C and C++ Language Mappings
This section describes how Arm compilers map C language features onto the machine-level standard. To the extent
that C++ is a superset of the C language it also describes the mapping of C++ language features.
30
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
C/C++ Type Machine Type Notes
long double _Complex 2 double precision (IEEE 754) C99 Only. Layout is
The preferred type of wchar_t is unsigned int. However, a virtual platform may elect to use unsigned short
instead. A platform standard must document its choice.
• An enumerated type normally occupies a word (int or unsigned int). If a word cannot represent all of its
enumerated values the type occupies a double word (long long or unsigned long long).
• The type of the storage container for an enumerated type is the smallest integer type that can contain all of its
enumerated values.
When both the signed and unsigned versions of an integer type can represent all values, this ABI recommends that
the unsigned type should be preferred (in line with common practice).
31
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
Discussion
The definition of enumerated types in the C and C++ language standards does not define a binary interface and leaves
open the following questions.
• Does the container for an enumerated type have a fixed size (as expected in most OS environments) or is the
size no larger than needed to hold the values of the enumeration (as expected by most embedded users)?
• What happens when a (strictly, non-conforming) enumerated value (e.g. MAXINT+1) overflows a fixed-size (e.g.
int) container?
• Is a value of enumerated type (after any conversion required by C/C++) signed or unsigned?
In relation to the last question the C and C++ language standards state:
• [C] Each enumerated type shall be compatible with an integer type. The choice of type is
implementation-defined, but shall be capable of representing the values of all the members of the enumeration.
• [C++] An enumerated type is not an integral type but ... An rvalue of... enumeration type (7.2) can be converted
to an rvalue of the first of the following types that can represent all the values of its underlying type: int,
unsigned int, long, or unsigned long.
Under this ABI, these statements allow a header file that describes the interface to a portable binary package to force
its clients, in a portable, strictly-conforming manner, to adopt a 32-bit signed (int/long) representation of values of
enumerated type (by defining a negative enumerator, a positive one, and ensuring the range of enumerators spans
more than 16 bits but not more than 32).
Otherwise, a common interpretation of the binary representation must be established by appealing to a platform ABI or
a separate interface contract.
va_list struct __va_list { A va_list may address any object in a parameter list.
void *__ap; Consequently, the first object addressed may only have word
} alignment (all objects are at least word aligned), but any
double-word aligned object will appear at the correct
double-word alignment in memory. In C++, __va_list is in
namespace std.
32
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
The behavior of assigning to or from an entire structure or union that contains volatile-qualified members is undefined.
Likewise, the behavior is undefined if a cast is used to change either the qualification or the size of the type.
Not all Arm architectures provide for access to types of all widths; for example, prior to Arm Architecture 4 there were
no instructions to access a 16-bit quantity, and similar issues apply to accessing 64-bit quantities. Further, the memory
system underlying the processor may have a restricted bus width to some or all of memory. The only guarantee
applying to volatile types in these circumstances are that each byte of the type shall be accessed exactly once for
each access mandated above, and that any bytes containing volatile data that lie outside the type shall not be
accessed. Nevertheless, if the compiler has an instruction available that will access the type exactly it should use it in
preference to smaller or larger accesses.
8.1.7 Bit-fields
A bit-field may have any integral type (including enumerated and bool types).
A sequence of bit-fields is laid out in the order declared using the rules below.
For each bit-field, the type of its container is:
• Its declared type if its size is no larger than the size of its declared type.
• The largest integral type no larger than its size if its size is larger than the size of its declared type (see
Over-sized bit-fields).
The container type contributes to the alignment of the containing aggregate in the same way a plain (not bit-field)
member of that type would, without exception for zero-sized or anonymous bit-fields.
Note
The C++ standard states that an anonymous bit-field is not a member, so it is unclear whether or not an anonymous
bit-field of non-zero size should contribute to an aggregate’s alignment. Under this ABI it does.
The content of each bit-field is contained by exactly one instance of its container type.
Initially, we define the layout of fields that are no bigger than their container types.
CA(F) = &(container(F));
This address will always be at the natural alignment of the container type, that is
CA(F) % sizeof(container(F)) == 0.
• For big-endian data types K(F) is the offset from the most significant bit of the container to the most significant
bit of the bit-field.
33
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
• For little-endian data types K(F) is the offset from the least significant bit of the container to the least significant
bit of the bit-field.
A bit-field can be extracted by loading its container, shifting and masking by amounts that depend on the byte order,
K(F), the container size, and the field width, then sign extending if needed.
The bit-address of F, BA(F), can now be defined as
For a bit address BA falling in a container of width C and alignment A (≤ C) (both expressed in bits), define the
unallocated container bits (UCB) to be
UCB(BA, C, A) = C - (BA % A)
TRUNCATE(X,Y) = Y * floor(X/Y)
NCBA(BA, A) = TRUNCATE(BA + A - 1, A)
Note
The AAPCS does not allow exported interfaces to contain packed structures or bit-fields. However a scheme for
laying out packed bit-fields can be achieved by reducing the alignment, A, in the above rules to below that of the
natural container type. ARMCC uses an alignment of A=8 in these cases, but GCC uses an alignment of A=1.
• Load the (naturally aligned) container at byte address TRUNCATE(BA(F), C) / 8 into a register R (or two
registers if the container is 64-bits)
• Set Q = MAX(32, C)
34
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
• Little-endian, set R = (R << ((Q - W) - (BA MOD C))) >> (Q - W).
• Big-endian, set R = (R << (BA MOD C)) >> (Q - W).
The long long bit-fields use shifting operations on 64-bit quantities; it may often be the case that these expressions can
be simplified to use operations on a single 32-bit quantity (but see Volatile bit-fields – preserving number and width of
container accesses).
• Selecting a new container width C' which is the width of the fundamental integer data type with the largest size
less than or equal to W. The alignment of this container will be A'. Note that C' ≥ C and A' ≥ A.
• If C' > UCB(CBA, C', A') setting CBA = NCBA(CBA, A'). This ensures that the bit-field will be placed at
the start of the next container type.
• Allocating a normal (undersized) bit-field using the values (C, C', A') for (W, C, A).
• Setting CBA = CBA + W - C.
Note
Although standard C++ does not have a long long data type, this is a common extension to the language. To
avoid the presence of this type changing the layout of oversized bit-fields the above rules are described in terms of
the fundamental machine types (Fundamental Data Types) where a 64-bit integer data type always exists.
Note
Any tail-padding added to a structure that immediately precedes a bit-field member is part of the structure and must
be taken into account when determining the CBA.
When a non-bit-field member follows a bit-field it is placed at the lowest acceptable address following the allocated
bit-field.
Note
When laying out fundamental data types it is possible to consider them all to be bit-fields with a width equal to the
container size. The rules in Bit-fields no larger than their container can then be applied to determine the precise
address within a structure.
35
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
8.1.7.5 Volatile bit-fields – preserving number and width of container accesses
When a volatile bit-field is read, and its container does not overlap with any non-bit-field member or any zero length
bit-field member, its container must be read exactly once using the access width appropriate to the type of the
container.
When a volatile bit-field is written, and its container does not overlap with any non-bit-field member or any zero length
bit-field member, its container must be read exactly once and written exactly once using the access width appropriate
to the type of the container. The two accesses are not atomic.
Note
This ABI does not place any restrictions on the access widths of bit-fields where the container overlaps with a
non-bit-field member or where the container overlaps with any zero length bit-field placed between two other
bit-fields. This is because the C/C++ memory model defines these as being separate memory locations, which can
be accessed by two threads simultaneously. For this reason, compilers must be permitted to use a narrower
memory access width (including splitting the access into multiple instructions) to avoid writing to a different memory
location. For example, in struct S { int a:24; char b; }; a write to a must not also write to the location
occupied by b, this requires at least two memory accesses in all current Arm architectures. In the same way, in
struct S { int a:24; int:0; int b:8; };, writes to a or b must not overwrite each other.
Multiple accesses to the same volatile bit-field, or to additional volatile bit-fields within the same container may not be
merged. For example, an increment of a volatile bit-field must always be implemented as two reads and a write.
Note
Note the volatile access rules apply even when the width and alignment of the bit-field imply that the access could
be achieved more efficiently using a narrower type. For a write operation the read must always occur even if the
entire contents of the container will be replaced.
If the containers of two volatile bit-fields overlap then access to one bit-field will cause an access to the other. For
example, in struct S {volatile int a:8; volatile char b:2}; an access to a will also cause an access
to b, but not vice-versa.
If the container of a non-volatile bit-field overlaps a volatile bit-field then it is undefined whether access to the
non-volatile field will cause the volatile field to be accessed.
• For C, each argument is formed from the value specified in the source code, except that an array is passed by
passing the address of its first element.
• For C++, an implicit this parameter is passed as an extra argument that immediately precedes the first user
argument. Other rules for marshalling C++ arguments are described in CPPABI32.
• For variadic functions, float arguments that match the ellipsis (...) are converted to type double.
The argument list is then processed according to the standard rules for procedure calls (see Parameter Passing (base
PCS)) or the appropriate variant.
36
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
9 APPENDIX: Support for Advanced
SIMD Extensions and MVE
9.1 Introduction
The Advanced SIMD and M-profile Vector Extension to the Arm architecture add support for processing short vectors.
Because the C and C++ languages do not provide standard types to represent these vectors, access to them is
provided by a vendor extension. The status of this appendix is normative in respect of public binary interfaces, i.e. the
calling convention and name mangling of functions which use these types. In other respects it is informative.
• They provide a set of user-level type names that map onto short vector types
• They provide prototypes for intrinsic functions that map onto the Advanced SIMD and M-profile Vector
Extension(MVE) intruction sets respectively.
Note
The intrinsic functions are beyond the scope of this specification. Details of the usage of the user-level types (e.g.
initialization, and automatic conversions) are also beyond the scope of this specification. For further details see
[ACLE].
Note
The user-level types are listed in Advanced SIMD Extension only vector data types using 64-bit containerized
vectors and SIMD vector data types using 128-bit containerized vectors. The types have 64-bit alignment and map
directly onto the containerized vector fundamental data types. The memory format of the containerized vector is
defined as loading the specified registers from an array of the Base Type using the Fill Operation and then storing
that value to memory using a single VSTM of the loaded 64-bit (D) registers.
MVE only allows 128-bit vector types and it uses unsigned integer vectors to represent polynomials.
The tables also list equivalent structure types to be used for name mangling. Whether these types are actually
defined by an implementation is unspecified.
Advanced SIMD Extension only vector data types using 64-bit containerized vectors
37
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
User type Equivalent type name for
name mangling Elements Base type Fill operation
38
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
User type Equivalent type name for
name mangling Elements Base type Fill operation
void f(int8x8_t)
is mangled as
_Z1f15__simd64_int8_t
39
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.
Footnotes
40
Copyright © 2003, 2005-2009, 2012, 2015, 2018, 2020-2023, Arm Limited and its affiliates. All rights reserved.