0% found this document useful (0 votes)
2 views

armv8-a

The document outlines a training agenda for the Armv8-A Architecture, covering topics such as the exception model, memory management, DynamIQ architecture updates, virtualization, and synchronization over five days. It highlights the key features of Armv8-A, including the introduction of AArch32 and AArch64 execution states, new registers, and a refined privilege model. Additionally, it provides insights into system control and important system registers relevant to the architecture.

Uploaded by

sanjay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

armv8-a

The document outlines a training agenda for the Armv8-A Architecture, covering topics such as the exception model, memory management, DynamIQ architecture updates, virtualization, and synchronization over five days. It highlights the key features of Armv8-A, including the introduction of AArch32 and AArch64 execution states, new registers, and a refined privilege model. Additionally, it provides insights into system control and important system registers relevant to the architecture.

Uploaded by

sanjay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 233

Maven Silicon

Armv8-A Architecture Training


Remote

Agenda
Day 1

3
1. Armv8A Overview

2
2. Exception model

Day 2

20
3. Memory Management Overview
4. Memory model

Day 3
5. DynamIQ Architecture update (8.1/8.2)
6. Barriers
7. DynamIQ Caches

Day 4
on
lic
8. DynamIQ Cache Coherency
9. Virtualization
10. Synchronization
Si

Day 5
11. SW GICv3 Programming
12. Booting
en

Appendix
13. Arm Glossary
14. Services and Support Overview
av
M
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

3
Architecture Overview

2
Armv8-A

20
Confidential © 2023 Arm
on
2
lic
Si

Development of the Arm Architecture


en

Armv8-A is one of the most significant architecture changes in Arm’s history


Armv8
• Positions Arm to continue servicing current markets as their needs grow

Improved Virtualization
Armv7
Vector Extensions
av

bFloat
Armv6
Adv SIMD Enhanced Crypto
Armv5 VFPv3/v4 Scalar Floating Point
M

Thumb®-2
LPAE Secure EL2
Armv4
Jazelle® TrustZone® Full Armv7 Pointer
Authentication
compatibility
Virtualization Branch Target
Thumb® VFPv2 SIMD Identifier

1990 2011
3 0838 rev 35733
Maven Silicon, 2023:04:20

3
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

What’s new in Armv8-A?


Armv8-A introduces two execution states: AArch32 and AArch64

AArch32
• Evolution of Armv7-A
• A32 (Arm) and T32 (Thumb) instruction sets

3
– Armv8-A adds some new instructions
• Traditional Arm exception model

2
• Virtual addresses stored in 32-bit registers

AArch64

20
• New 64-bit general purpose registers (X0 to X30)
• New instructions – A64, fixed length 32-bit instruction set
– Includes SIMD, floating point and crypto instructions
• New exception model
• Virtual addresses now stored in 64-bit registers

4 0838 rev 35733


Maven Silicon, 2023:04:20
on
4
lic
Si

Agenda
en

Privilege levels
AArch64 registers
A64 Instruction Set
AArch64 Exception Model
av

AArch64 Memory Model


M

5 0838 rev 35733


Maven Silicon, 2023:04:20

5
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

AArch64 privilege model


AArch64 has four exception levels and two security states
• EL0 = least privileged, EL3 = most privileged
• Secure state and Non-secure (or “Normal”) state

Non-secure Secure

3
EL0 App App App App Trusted Services

2
EL1 Rich OS Rich OS Trusted OS

Hypervisor No EL2 in

20
EL2
Secure world

EL3 Firmware / Secure Monitor

EL2 and EL3 are optional


• A processor may not implement EL2/EL3 if the Virtualization and/or Security extensions are not required

6 0838 rev 35733


Maven Silicon, 2023:04:20
on
6
lic
Si

AArch32 privilege model


en

The privilege model in AArch32 is similar to Armv7-A


Non-secure Secure

EL0 User App App App App Trusted Services


SVC, Abort,
av

EL1
IRQ, etc
Rich OS Rich OS Trusted OS

EL2 Hyp Hypervisor No Hyp in


Secure world

EL3 Mon Firmware / Secure Monitor


M

When EL3 is using AArch32, in the Secure world the EL1 modes are treated as EL3
• No effect on the Normal world

7 0838 rev 35733


Maven Silicon, 2023:04:20

7
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Moving between AArch32 & AArch64


Execution state can only change on exception entry or return
• Moving to a lower EL (less privilege), execution state can stay the same or switch to AArch32
• Moving to a higher EL (more privilege), execution state can stay the same or switch to AArch64

2 3
AArch64 OS can host a
AArch32 OS cannot host an
mix of AArch32 and AArch32 App AArch64 App AArch32 App AArch64 App AArch64 application
AArch64 applications

20
AArch64 OS AArch32 OS
AArch64 Hypervisor can
AArch32 Hypervisor cannot
host AArch64 AArch64 Hypervisor host AArch64 OSs
andAArch32 OSs

8 0838 rev 35733


Maven Silicon, 2023:04:20
on
8
lic
Si

Agenda
en

Privilege levels
AArch64 registers
A64 Instruction Set
AArch64 Exception Model
av

AArch64 Memory Model


M

9 0838 rev 35733


Maven Silicon, 2023:04:20

9
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Register banks
AArch64 provides 31 general purpose registers: R0-R30
• Each register has a 64-bit (Xn) and 32-bit (Wn) form

W0 W1
X0 X1

3
R0 R1

Separate register file for floating point, SIMD, and crypto operations: V0-V31

2
• Each register has a 128-bit (Qn), 64-bit (Dn), 32-bit (Sn), 16-bit (Hn), and 8-bit (Bn) form

20
B0 B1
H0 H1
S0 S1
D0 D1
Q0 Q1

10 0838 rev 35733


V0

Maven Silicon, 2023:04:20


on V1

10
lic
Si

Other registers
AArch64 introduces the “zero” register: XZR and WZR
en

• Reads as 0, writes are ignored

The PC is not a general purpose register, cannot be directly referenced

There are separate link registers for function calls and exceptions
av

• X30 Updated by branch-with-link instructions (BL, BLR)


Use RET instruction to return from subroutines

• ELR_ELx Updated on exception entry


Use ERET instruction to return from exceptions
M

Each exception level has its own stack pointer


• SP_EL0, SP_EL1, SP_EL2, and SP_EL3
• The SPs are not general purpose registers
• Stack pointers must always be 16-byte aligned (bits 3:0 = b0000)
– Hardware checking of SP alignment can be enabled

11 0838 rev 35733


Maven Silicon, 2023:04:20

11
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Processor state
AArch64 does not have a direct equivalent of the AArch32 CPSR
• Settings previously held in the CPSR are referred to as “Processor State” (or PSTATE) fields
– These fields are accessed individually

Fields Description

3
NZCV ALU flags
Q Sticky overflow (AArch32 only)

2
DAIF Exception mask bits
SPSel SP selection (SP_EL0 or SP_ELx, AArch64 only)

20
CurrentEL The current exception level
E Data endianness (AArch32 only)
IL Illegal flag. When set, all instructions are treated as UNDEFINED
SS Software stepping bit

AArch64 does include SPSRs, covered later...

12 0838 rev 35733


Maven Silicon, 2023:04:20
on
12
lic
Si

Procedure call standard (1)


en

There is a set of rules known as the Procedure Call Standard (PCS) that specifies how registers should be used:

.global foo
.type foo, @function
.section _foo
av

extern long foo(long, long); .text

int main(void) Must preserve:


foo:
{ X19-X29
...
... ... Can corrupt:
M

a = foo(b, c); ... X0-X18


... ...
} RET Return address:
X30 (LR)
.pool
Some registers are reserved... .size foo,.-foo

13 0838 rev 35733


Maven Silicon, 2023:04:20

13
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Procedure call standard (2)


IP0-IP1 Intra-procedure-call temporary registers (corruptible)
XR Indirect result location parameter (corruptible)
PR Platform registers, reserved for use by platform ABIs
FP Frame pointer

3
X0-X7 X8-X15 X16-X23 X24-X30

2
XR (X8) IP0 (X16)

IP1 (X17) Callee-saved

20
Registers
PR (X18)
(X24-X28)
Parameter and Result
Registers Corruptible Registers
(X0-X7) (X9-X15) Callee-saved
FP (X29)
Registers
(X19-X23) LR (X30)

14 0838 rev 35733


Maven Silicon, 2023:04:20
on
14
lic
Si

Procedure call standard (3)


en

The PCS also covers the use of the floating-point and SIMD registers

D0-D7 D8-D15 D16-D23 D24-D31


av

Parameter and Result Callee-saved


Corruptible Registers
Registers Registers
(D16-D31)
(D0-D7) (D8-D15)
M

15 0838 rev 35733


Maven Silicon, 2023:04:20

15
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

AArch64 ↔ AArch32 register mappings


When moving from AArch32 to AArch64:
• Registers not accessible in AArch32 state retain their values from previous AArch64 execution
• For registers that are accessible in both execution states:
– Top 32 bits: UNKNOWN
– Bottom 32 bits: The value of the mapped AArch32 register

3
X0-X7 X8-X15 X16-X23 X24-X30

2
R0 R8_usr LR_irq R8_fiq
R1 R9_usr SP_irq R9_fiq
R2 R10_usr LR_svc R10_fiq

20
R3 R11_usr SP_svc R11_fiq
R4 R12_usr LR_abt R12_fiq
R5 SP_usr SP_abt SP_fiq
R6 LR_usr LR_und LR_fiq
R7 SP_hyp SP_und

16 0838 rev 35733


Maven Silicon, 2023:04:20
on
16
lic
Si

System control
en

In AArch64, system configuration is controlled through system registers

System registers are suffixed with “_ELx”, for example SCTLR_EL1


• Suffix defines the lowest exception level that can access that system register
• For example:
av

TTBR0_EL1 Can be accessed from EL1, EL2, and EL3


TTBR0_EL2 Can be accessed from EL2 and EL3
TTBR0_EL3 Can be accessed from EL3
M

Use the MRS instruction to read a system register, and MSR instruction to write to a system register

MRS X0, SCTLR_EL1 ; X0 = SCTLR_EL1


MSR SCTLR_EL1, X0 ; SCTLR_EL1 = X0

17 0838 rev 35733


Maven Silicon, 2023:04:20

17
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Some important System Registers


SCTLR_ELx
• System Control Register
• Controls architectural features such as the MMU, caches, and alignment checking

ACTLR_ELx
• Auxiliary Control Register
• Controls processor-specific features

3
SCR_EL3
• Secure Configuration Register
• Controls Secure state and trapping of exceptions to EL3

2
HCR_EL2
• Hypervisor Configuration Register
• Controls Virtualization settings and trapping of exceptions to EL2

20
MIDR_EL1
• Main ID Register
• Specifies the type of processor that the code is running on (e.g. part number and revision)

MPIDR_EL1
• Multiprocessor Affinity Register
• Specifies the core and cluster IDs in a multi-core/multi-cluster system

CTR_EL0
• Cache Type register
• Specifies information about the integrated caches (e.g. the line size)

18 0838 rev 35733


Maven Silicon, 2023:04:20
on
18
lic
Si

Agenda
en

Privilege levels
AArch64 registers
A64 Instruction Set
AArch64 Exception Model
av

AArch64 Memory Model


M

19 0838 rev 35733


Maven Silicon, 2023:04:20

19
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

A64 overview
AArch64 introduces new A64 instruction set
• Similar set of functionality as traditional A32 (Arm) and T32 (Thumb) ISAs

Fixed-length 32-bit instructions

3
Syntax similar to A32 and T32

2
ADD W0, W2, W7 ; 32-bit addition, W0 = (W2 + W7)
ADD X0, X2, X7 ; 64-bit addition, X0 = (X2 + X7)
MOV X0, XZR ; Clear X0 to #0

20
Most instructions are not conditional

Optional floating-point and Advanced SIMD instructions

Optional cryptographic extensions

20 0838 rev 35733


Maven Silicon, 2023:04:20
on
20
lic
Si

Agenda
en

Privilege levels
AArch64 registers
A64 Instruction Set
AArch64 Exception Model
av

AArch64 Memory Model


M

21 0838 rev 35733


Maven Silicon, 2023:04:20

21
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

AArch64 exceptions
In AArch64, exceptions are split between:
• Synchronous: Data Aborts from the MMU, Permission Faults, Alignment Faults, service call instructions (e.g. SVC), etc
• Asynchronous: IRQs, FIQs, SErrors (System Errors)

On taking an exception, the EL can either stay the same or get higher

3
• Exceptions are never taken to EL0

2
Synchronous exceptions are typically taken in the current EL

Asynchronous exceptions can be routed to a higher EL Application EL0

20
• HCR_EL2 controls routing to EL2 IRQ ?
• SCR_EL3 controls routing to EL3 Rich OS EL1
• Separate bits to control routing of IRQs, FIQs, and SErrors

Hypervisor EL2

Secure Monitor EL3

22 0838 rev 35733


Maven Silicon, 2023:04:20
on
22
lic
Si

Taking an exception
Handler for specific
en

Application code source

Top-level handler
av

Execute an ERET instruction to return from an exception:


M

When an exception occurs:


• SPSR_ELx updated • Restores PSTATE from SPSR_ELx
• PSTATE updated (EL stays the same or gets higher) • Restores PC from ELR_ELx
• Return address stored to ELR_ELx
• PC set to vector address
• ESR_ELx updated with cause of exception
– Only if synchronous or SError exception

23 0838 rev 35733


Maven Silicon, 2023:04:20

23
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Agenda
Privilege levels
AArch64 registers
A64 Instruction Set

3
AArch64 Exception Model
AArch64 Memory Model

2
20
24 0838 rev 35733
Maven Silicon, 2023:04:20
on
24
lic
Si

Memory types
en

Address locations must be described in terms of a type


• The “type” tells the processor how it can access that location
– Access ordering rules
– Speculation

Normal
av

• Used for code and data


• Processor allowed to re-order, re-size and repeat accesses
• Speculative accesses allowed

Device
M

• Used for peripherals


• Accesses could have side effects, so there are more restrictions on what optimizations a processor can perform
• Speculative data accesses not allowed

Other attributes can also be specified


• For example whether a region is executable, shareable, and cacheable

25 0838 rev 35733


Maven Silicon, 2023:04:20

25
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Alignment
Unaligned data accesses are allowed to address ranges marked as Normal

Optionally, all unaligned data accesses can be trapped


• Trapped unaligned accesses cause a synchronous data abort

3
• Trapping can be enabled separately for EL0/EL1, EL2, and EL3
– Controlled by SCTLR_ELx.A bits

2
Unaligned data accesses to addresses marked as Device will always trigger an exception
• Synchronous data abort

20
Instruction fetches must always be aligned
• A64 instructions must be 4-byte aligned (bits [1:0] = 0b00)
• Synchronous exception

26 0838 rev 35733


Maven Silicon, 2023:04:20
on
26
lic
Si

Virtual address space


en

Virtual addresses are 64-bit wide, but not all addresses are accessible
• Virtual memory address space split between two translation tables
– Each covering a configurable size, up to 48 bits of address space (TCR_ELx)
• Addresses not covered by either translation table automatically generate translation faults
av

Virtual Address Space


0xFFFF_FFFF_FFFF_FFFF
Peripherals Physical Address Space
Not available in EL2 or EL3
OS Translation
Tables
0xFFFF_0000_0000_0000
M

TTBR1_EL1 RAM
FAULT

0x0000_FFFF_FFFF_FFFF
Translation Peripherals
Tables
Application
TTBR0_EL1 Flash
0x0

27 0838 rev 35733


Maven Silicon, 2023:04:20

27
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Multiple virtual address spaces


A system may define multiple virtual address spaces
OS / App Virtual Address Space
OS / Applications
• TTBR0_EL1 Peripherals

3
• TTBR1_EL1 OS Translation Physical Address Space
Tables
• TCR_EL1
TTBR1_EL1 Peripherals
FAULT

2
Hypervisor
• TTBR0_EL2 Translation Flash
• TCR_EL2 Tables

20
Application
TTBR0_EL1
Secure Monitor
RAM
• TTBR0_EL3
• TCR_EL3 Hypervisor Virtual Address Space

Peripherals Translation
Tables
Data

28 0838 rev 35733


Maven Silicon, 2023:04:20
on Code
TTBR0_EL2

28
lic
Si

Physical address spaces


en

Armv8-A defines two security states


• Secure and Non-secure (“Normal”)
Secure EL1/EL0 Secure Physical
Address Space
It also defines two physical address spaces Secure Peripherals
• Secure and Non-secure RAM
Secure Code
av

Translation
Flash
Secure Data Tables
These are in theory completely separate
Peripherals
• SP:0x8000 != NP:0x8000 Non-secure Data
• But most systems instead treat Secure and
Non-secure as an attribute for access control Non-secure Physical
Non-secure EL1/EL0
M

Address Space

The Normal world can only access the Non- Non-secure Peripherals RAM
secure physical address space Translation
Non-secure Data Flash
Tables
The Secure world can access both physical Non-secure Data Peripherals
address spaces
• Controlled through the translation tables

29 0838 rev 35733


Maven Silicon, 2023:04:20

29
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

MPCore configurations
Many implementations of Arm processors have a multi-core configuration
• Multiple cores contained within the same block

Each core has its own MMU configuration, register bank, internal state, and Program Counter
• Core0 might be executing in Non-secure AArch32 EL0 while Core1 is executing in Secure AArch64 EL1

3
Cores can be powered and brought in and out of reset independently
• ID registers allow for discovery of core affinity

2
Each core has separate L1 data and instruction caches
• Hardware will maintain coherency between L1 data caches for certain memory types

20
• Some cache and TLB instructions are broadcast to other cores
• All cores share a common physical memory map

Core0 Core1 Core2 Core3


D$ I$ D$ I$ D$ I$ D$ I$

30 0838 rev 35733


Maven Silicon, 2023:04:20
on
Unified L2 Cache

30
lic
Si
en

Appendix
av
M

Confidential © 2023 Arm

31
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Armv8 terminology reference


EL3, EL2, EL1, and EL0 are Exception Levels
• The EL denotes the level of privilege; higher number === more privilege

AArch32 and AArch64 are Execution States


• The programmer’s model being used

3
Secure and Non-Secure are Security States

2
• EL3 is always Secure, EL2 is always Non-secure
• EL1/EL0 can be Secure or Non-secure (sometime S.ELx or NS.ELx are used as shorthand)

20
A64, A32, and T32 are Instruction Sets
• A64 used when in AArch64
• A32 and T32 used when in AArch32
– In previous architecture versions A32 was called Arm, and T32 was called Thumb

Examples:
• Processor currently executing in EL3 as AArch64, executing A64 instructions

32 0838 rev 35733


Maven Silicon, 2023:04:20
on
32
lic
Si

REServed bits and the v8 Architecture


System registers often include REServed bit fields
en

• Indicating fields that are not used by hardware


Hypothetical architecturally-mapped System Registers
31 0
Some bits have a defined use in one execution state but
RES1 AArch32 Q
not the other.
av

Some bits are defined as REServed (RES0 / RES1) in


RES1 AArch64 RES0
both AArch32 and AArch64

Where a bit can be RES in one Execution State and used in another
• The Architecture defines the bit field as writeable or “stateful”
M

– Allows the correct value to be written for a context switch

Where bits are unused in both Execution States


• Typically an implementation would make these fields write-ignore
• However the Architecture does permit writing to such fields
– Changing the value will have no functional impact in current implementations
– Future Architecture revisions may use these fields
▪ RES0 / RES1 indicate expected values SW should write to guarantee current behaviour
33 0838 rev 35733
Maven Silicon, 2023:04:20

33
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

System Register contents at reset


In AArch64 most System Register fields are defined as UNKNOWN at reset
• If an implementation defines unused RES bits as stateful
– These fields are also defined as UNKNOWN at reset
• Recommended that software always writes the full register field when initializing (rather than read/modify write)
– Including any RES0/RES1 bits as b0/b1

3
The Execution State of the highest EL (entered on reset) defines the reset contents of System Registers

2
• If the highest EL uses AArch64, but lower ELs use AArch32
• You may need to initialize Armv7/AArch32 System Registers with expected Armv7/AArch32 reset values in software before changing EL

20
34 0838 rev 35733
Maven Silicon, 2023:04:20
on
34
lic
Si

Armv8.1-A
en

The Arm architecture continues to evolve, with the announcement of Armv8.1-A

Instruction set enhancements


• Atomic read-write instructions added to A64 (AArch64 only)
– For example: Compare and swap
• Additional SIMD instructions
av

– Example use case is colour space conversion


• Load and stores with ordering limited to a configurable region (AArch64 only)

Virtualization Host Extensions (AArch64 only)


• To improve performance of Type 2 Hypervisors
M

And other enhancements to the memory system architecture, such as Privileged Access Never (PAN) state bit

35 0838 rev 35733


Maven Silicon, 2023:04:20

35
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Armv8.2-A
Armv8.2-A further extends Armv8.1-A

Address space extended to 52-bit (AArch64 only)


• 52-bit virtual addresses only supported when 64KB granule is used

3
New cache clean operation to point of persistency (AArch64 only)
• DC CVAP, Xt

2
Execute never support in stage 2 translation extended (AArch64 only)
• Can now specify different attributes for EL1 and EL0

20
Optional half precision floating point
• Supports IEEE754-2008 formatted half-precision floating point data processing
• Base architecture already had support for converting to fp16 format

RAS (Reliability, Availability, Serviceability) extension

36 0838 rev 35733


Maven Silicon, 2023:04:20
on
36
lic
Si

Armv8-A software support


en

Armv8 software support now widely available in the open source community

Linux Kernel
• AArch64 support has been available in mainline for several releases
• Under arch/arm64/
av

Filesystems
• AArch64 kernel supports both legacy Armv7-A and AAarch64 filesystem components
• Some guidance on building file-systems for AArch64 is available here https://wiki.linaro.org/HowTo/Armv8/OpenEmbedded
• Both Fedora and openSUSE have AArch64 releases
M

Arm Foundation Model


• Linaro provide an example kernel and file-system, which can be executed on Arm's free virtual software platform (Foundation Model)
• http://www.arm.com/products/tools/models/fast-models/foundation-model.php
• http://www.linaro.org/engineering/engineering-projects/armv8

37 0838 rev 35733


Maven Silicon, 2023:04:20

37
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Armv8-A software support continued


Open Source Tools Support
• Source for tools depends on your desired build system and environment

Prebuilt versions of gcc are available for download from Arm Developer at:
• https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-a

3
Linaro provides prebuilt AArch64 GCC toolchain binaries (GCC, GDB, etc.) with Linux and bare-metal library options
• These are available as cross or native toolchains: https://launchpad.net/linaro-toolchain-binaries/

2
Any version of gcc that cross compiles to AArch64 will work, but:
• Certain processor-specific optimizations and automatic feature detection (using -mcpu=xxx) are only available past certain gcc versions

20
Arm tools
• The Arm compiler supports AArch64 and is suitable for bare-metal/validation environments
• ArmDS includes debug support for Armv8 hardware and models:
– https://developer.arm.com/tools-and-software/embedded/arm-development-studio
• Fast Models allows the creation of DynamIQ CPU based Arm Virtual Platforms for software development

38 0838 rev 35733


Maven Silicon, 2023:04:20
on
38
lic
Si

Thank You
Danke
en

Gracias
谢谢
ありがとう
Asante
av

Merci
감사합니다
धन्यवाद
Kiitos
M

‫شكرا‬
ً
ধন্যবাদ
‫תודה‬

Confidential © 2021 Arm

39
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

3
AArch64 Exception Model

2
Armv8-A

20
Confidential © 2023 Arm Limited
on
2
lic
Si

This module
This module will focus on:
en

• AArch64 instructions/exception handling


• Taking exceptions from AArch32 to AArch64

What will not be covered


• Handling exceptions in AArch32
av

• AArch64 debug exceptions


M

3 0000

3
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Agenda
• The AArch64 exception model

• Interrupts

3
• Synchronous exceptions

2
• SError exceptions

20
• Exceptions in EL2 and EL3

4 0840 rev 33026


Maven Silicon, 2023:04:20
on
4
lic
Si

Exception Levels
• AArch64 has four exception levels, and two security states
en

• Each exception level has an execution state (AArch64 or AArch32)


• You can only move between ELs by taking (or returning from) an exception
‐ Moving to a lower EL, execution state can stay the same or switch to AArch32
‐ Moving to a higher EL, execution state can stay the same or switch to AArch64
av

Non-secure Secure

AArch32 AArch64 AArch32 AArch64


EL0 App App App App Trusted Services
M

EL1 AArch64 Kernel AArch32 Kernel Trusted OS

EL2 Hypervisor No EL2 in


Secure world

EL3 Firmware / Secure Monitor

5 0840 rev 33026


Maven Silicon, 2023:04:20

5
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

AArch64 exceptions
• Synchronous
• Service Calls: SVC, HVC, and SMC (covered later)
• Aborts from MMU (e.g. Permission Faults, Alignment Faults)
• SP and PC alignment checking
• Unallocated instructions

3
• Asynchronous

2
• IRQ
• FIQ Interrupt signals into the Arm core
• SError (System Error)

20
6 0840 rev 33026
Maven Silicon, 2023:04:20
on
6
lic
Si

Taking an exception
Handler for
en

Application code specific source

Top-level handler
av

When an exception occurs: Execute an ERET instruction to return from an exception:


• SPSR_ELx updated • Restores PSTATE from SPSR_ELx
M

• PSTATE updated (EL stays the same or gets higher) • Restores PC from ELR_ELx
• Return address stored to ELR_ELx
• PC set to vector address
• ESR_ELx updated with cause of exception
‐ Only if synchronous or SError exception

7 0840 rev 33026


Maven Silicon, 2023:04:20

7
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Exception routing
• On taking an exception, the EL can either stay the same or get higher
• Exceptions are never taken to EL0

• Synchronous exceptions are typically taken in the current EL

3
• Asynchronous exceptions must be configured to be routed to:
• EL1 – The Rich OS kernel

2
EL2 – The Hypervisor
• EL3 – The Secure Monitor

20
Application EL0
IRQ ?
Rich OS EL1

Hypervisor EL2

8 0840 rev 33026


Maven Silicon, 2023:04:20
on
Secure Monitor EL3

8
lic
Si

PSTATE and the SPSR


• In AArch64 current processor state is held in a set of discrete PSTATE fields
en

• Rather than a single CPSR register (used in AArch32)

Fields Description
NZCV ALU flags
DAIF Exception mask bits
av

SPSel SP selection (SP_EL0 or SP_ELn, AArch64 only)


CurrentEL The current exception level
IL Illegal flag
M

• When taking an exception PSTATE is stored in the relevant Saved Program Status Register (SPSR)
• SPSR_EL3, SPSR_EL2, SPSR_EL1

• The SPSR also includes a mode field that holds the execution state
• M[4]=0: AArch64, M[4]=1: AArch32

9 0840 rev 33026


Maven Silicon, 2023:04:20

9
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Changing Execution State (1)


• The execution state may change when taking an exception to a higher EL
• For example: Timer generates IRQ while executing AArch32 application code

...

3
EL0 AArch32 application code
...

2
entry
EL1 AArch64 kernel IRQ handler
exit

20
• The SPSR (populated by the core when taking the exception) includes the execution state and EL to return to

• But how does the core decide which execution state to enter when taking the exception?
• This is defined by the RW bit of control register for the EL above the one that the exception is taken to
• In this example HCR_EL2.RW would configure the execution state for EL1

10 0840 rev 33026


Maven Silicon, 2023:04:20
on
10
lic
Si

Changing Execution State (2)


• The Execution state of the highest implemented EL is determined at reset
en

• For a cold reset the Execution state is determined by a configuration input signal
• For a warm reset the Execution state entered is determined by RMR_ELx.AA64

• The Execution state of all other ELs can be dynamically changed by software

• For an exception return, the SPSR_ELx.M[4] must match the execution state defined by the corresponding
av

configuration register or input signal


• A mismatch will result in an illegal exception

What determines the execution state?


Exception Level
M

Taking exception to EL Returning from exception to EL


EL0 Exceptions cannot be taken to EL0 SPSR_ELx.M[4]

EL1 HCR_EL2.RW SPSR_ELx.M[4]

EL2 SCR_EL3.RW SPSR_ELx.M[4]

EL3 Configuration input signal / RMR_ELx.AA64 SPSR_EL3.M[4]

11 0840 rev 33026


Maven Silicon, 2023:04:20

11
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Exception return address


• There are separate link registers for function calls and exceptions

• X30 Updated by branch-with-link instructions (BL, BLR)


Use RET instruction to return from subroutines

3
• ELR_ELx Updated on exception entry
Use ERET instruction to return from exceptions

2
• After taking an exception, ELR_ELx contains the preferred return address
• For service calls (e.g. SVC):
‐ The address of the next instruction after the system call instruction

20
• For other synchronous exceptions:
‐ The address of the instruction that generated the exception
• For asynchronous exceptions:
‐ The address of first instruction that has not been executed, or executed fully as a result of taking the interrupt

12 0840 rev 33026


Maven Silicon, 2023:04:20
on
12
lic
Si

Exception stacks
• Each exception level has its own dedicated stack pointer
en

• SP_EL0, SP_EL1, SP_EL2, and SP_EL3


• The SPs are not general purpose registers
• Stack pointers must always be 16-byte aligned (bits 3:0 = b0000)
‐ Hardware checking of SP alignment can be enabled

• On exception entry to ELx, SP_ELx is automatically selected


av

• Handler code can optionally switch from using SP_ELx to SP_EL0

MSR SPSel, #0 ; Switch to SP_EL0


MSR SPSel, #1 ; Switch to SP_ELx
M

• One example use case allows a kernel to guarantee a usable stack


• SP_EL1 → Small stack which the kernel assures will always be valid
• SP_EL0 → Kernel ‘task’ stack, non-terminal if it overflows

13 0840 rev 33026


Maven Silicon, 2023:04:20

13
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

AArch32 register mapping


• As exceptions can be taken from AArch32 to AArch64
• AArch64 handler code may need to access AArch32 registers
• The architecture defines mappings to allow handler code to access AArch32 registers
• For example: AArch32 general purpose registers are directly mapped to AArch64 registers

3
AArch32 AArch64

2
R0-R12 X0-X12
Banked SP and LR X13-X23
Banked FIQ X24-X30

20
• When moving from AArch32 to AArch64:
• Registers not accessible in AArch32 state retain their values from previous AArch64 execution
• For registers that are accessible in both execution states:
‐ Top 32 bits: UNKNOWN
‐ Bottom 32 bits: The value of the mapped AArch32 register

14
‐ Typically access as Wn registers

0840 rev 33026


Maven Silicon, 2023:04:20
on
14
lic
Si

AArch64 vector table


• Each exception level has its own vector table
en

• Except EL0

• The virtual address is set by VBAR_EL3, VBAR_EL2, and VBAR_EL1

• Vector tables entries are 32 instructions long


• Tables contains instructions not addresses
av

• Which instruction block the PC is set to depends on:


• Type of exception
‐ Synchronous, IRQ, FIQ, or SError
• If taken from the same exception level, the stack pointer being used
M

• If taken from a lower exception level, the execution state of the level below the level that the exception is taken to
‐ Example: Exception taken from EL0 to EL2, instruction block depends on execution state of EL1

15 0840 rev 33026


Maven Silicon, 2023:04:20

15
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

AArch64 vector table


• Separate vector tables for EL1, EL2, and EL3
• Location set by VBAR_ELn
0x780 SError / vSError
0x700 FIQ / vFIQ Exception from a lower EL and
0x680 all lower ELs are AArch32.

3
IRQ / vIRQ
0x600 Synchronous
0x580 SError / vSError
0x500 FIQ / vFIQ

2
Exception from a lower EL and
0x480 IRQ / vIRQ at least one lower EL is AArch64.
0x400 Synchronous
0x380 SError / vSError

20
0x300 FIQ / vFIQ Exception from the current EL
0x280 IRQ / vIRQ while using SP_ELx
0x200 Synchronous
0x180 SError / vSError
0x100 FIQ / vFIQ Exception from the current EL
0x080 IRQ / vIRQ while using SP_EL0
VBAR_ELn + 0x000 Synchronous

16 0840 rev 33026


Maven Silicon, 2023:04:20
on Virtual exceptions (greyed out) are discussed elsewhere

16
lic
Si

Agenda
en

• The AArch64 exception model

• Interrupts

• Synchronous exceptions
av

• SError exceptions
M

• Exceptions in EL2 and EL3

17 0840 rev 33026


Maven Silicon, 2023:04:20

17
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Interrupt handling
• The Arm processor has two external interrupt signals
• IRQ and FIQ
• The architecture doesn’t mandate how these signals are used, but FIQ is often reserved for secure interrupt sources

• Processor State (PSTATE) contains interrupt masks for the current exception level

3
• PSTATE.I – Mask IRQs
• PSTATE.F – Mask FIQs

2
• On taking an exception to AArch64, all PSTATE interrupt masks are set

20
• Interrupts can be re-enabled in software to support nested exceptions

MSR DAIFClr, #imm4

18 0840 rev 33026


Maven Silicon, 2023:04:20
on
18
lic
Si

The Generic Interrupt Controller (GIC)


• Arm provide a standard interrupt controller for Cortex-A MP core systems
en

• GIC Architecture defines common interrupt controller programming interface


• Armv8 implementations will generally include GIC implementations (GIC-400, GIC-500)

External Interrupt Sources

…..
av

Interrupt Controller

Distributor

CPU interface CPU interface


M

IRQ FIQ IRQ FIQ


CPU0 CPU1

• The GIC supports routing software generated, private and shared peripheral interrupts between cores in an MP system

19 0840 rev 33026


Maven Silicon, 2023:04:20

19
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Simple interrupt example

Main

3
Application

2
20
20 0840 rev 33026
Maven Silicon, 2023:04:20
on
20
lic
Si

Simple interrupt example


en

Main
Application
av

ASM IRQ handler


M

21 0840 rev 33026


Maven Silicon, 2023:04:20

21
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Simple interrupt example


Save corruptible registers

Main

3
Application

2
20
ASM IRQ handler

22 0840 rev 33026


Maven Silicon, 2023:04:20
on
22
lic
Si

Simple interrupt example


en

Save corruptible registers

Main
Application
av

ASM IRQ handler C subroutine


M

23 0840 rev 33026


Maven Silicon, 2023:04:20

23
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Simple interrupt example


Save corruptible registers

Main

3
Application

2
Identify interrupt source

20
ASM IRQ handler C subroutine Clear interrupt source
Handle interrupt

24 0840 rev 33026


Maven Silicon, 2023:04:20
on
24
lic
Si

Simple interrupt example


en

Save corruptible registers

Main
Application
av

Identify interrupt source


ASM IRQ handler C subroutine Clear interrupt source
Handle interrupt
M

25 0840 rev 33026


Maven Silicon, 2023:04:20

25
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Simple interrupt example


Save corruptible registers

Main

3
Application

2
Identify interrupt source

20
ASM IRQ handler C subroutine Clear interrupt source
Handle interrupt

26 0840 rev 33026


Maven Silicon, 2023:04:20
on Restore corruptible registers

26
lic
Si

Simple interrupt example


en

Save corruptible registers

Main
Application
av

Identify interrupt source


ASM IRQ handler C subroutine Clear interrupt source
Handle interrupt
M

Restore corruptible registers

27 0840 rev 33026


Maven Silicon, 2023:04:20

27
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Example of a simple exception handler


ASM_IRQ_Handler:
; Stack all corruptible registers
STP X0, X1, [SP, #-16]! Save PCS corruptible registers
STP X2, X3, [SP, #-16]!

3
...

BL identify_and_clear_source

2
BL C_IRQ_Handler

; Restore corruptible registers

20
...
LDP X2, X3, [SP], #16
Restore PCS corruptible registers
LDP X0, X1, [SP], #16

; Return from exception Return from exception


ERET

Note: This is for non-nested interrupts


28 0840 rev 33026
Maven Silicon, 2023:04:20
on
28
lic
Si

Nested exception example


en

Main
Application
av
M

29 0840 rev 33026


Maven Silicon, 2023:04:20

29
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Nested exception example

Main

3
Application

2
20
30 0840 rev 33026
Maven Silicon, 2023:04:20
on
30
lic
Si

Nested exception example


en

Save corruptible registers


Save SPSR_EL1 and ELR_EL1
Unmask interrupts

Main
Application
av
M

31 0840 rev 33026


Maven Silicon, 2023:04:20

31
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Nested exception example


Save corruptible registers
Save SPSR_EL1 and ELR_EL1
Unmask interrupts

Main

3
Application

2
20
32 0840 rev 33026
Maven Silicon, 2023:04:20
on
32
lic
Si

Nested exception example


en

Save corruptible registers


Save SPSR_EL1 and ELR_EL1
Unmask interrupts

Main
Application
av
M

33 0840 rev 33026


Maven Silicon, 2023:04:20

33
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Nested exception example


Save corruptible registers
Save SPSR_EL1 and ELR_EL1
Unmask interrupts

Main

3
Application

2
20
34 0840 rev 33026
Maven Silicon, 2023:04:20
on
34
lic
Si

Nested exception example


en

Save corruptible registers


Save SPSR_EL1 and ELR_EL1
Unmask interrupts

Main
Application
av
M

35 0840 rev 33026


Maven Silicon, 2023:04:20

35
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Nested exception example


Save corruptible registers
Save SPSR_EL1 and ELR_EL1
Unmask interrupts

Main

3
Application

2
20
36 0840 rev 33026
Maven Silicon, 2023:04:20
on
36
lic
Si

Nested exception example


en

Save corruptible registers


Save SPSR_EL1 and ELR_EL1
Unmask interrupts

Main
Application
av
M

Mask interrupts
Restore SPSR_EL1 and ELR_EL1
Restore corruptible registers
37 0840 rev 33026
Maven Silicon, 2023:04:20

37
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Nested exception example


Save corruptible registers
Save SPSR_EL1 and ELR_EL1
Unmask interrupts

Main

3
Application

2
20
38 0840 rev 33026
Maven Silicon, 2023:04:20
on
Mask interrupts
Restore SPSR_EL1 and ELR_EL1
Restore corruptible registers

38
lic
Si

Nested exception example


en

Save corruptible registers


Save SPSR_EL1 and ELR_EL1
Unmask interrupts

Main
Application
av
M

Mask interrupts
Restore SPSR_EL1 and ELR_EL1
Restore corruptible registers
39 0840 rev 33026
Maven Silicon, 2023:04:20

39
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Nested exception example


Save corruptible registers
Save SPSR_EL1 and ELR_EL1
Unmask interrupts

Main

3
Application

2
20
40 0840 rev 33026
Maven Silicon, 2023:04:20
on
Mask interrupts
Restore SPSR_EL1 and ELR_EL1
Restore corruptible registers

40
lic
Si

Nested exception example


en

Save corruptible registers


Save SPSR_EL1 and ELR_EL1
Unmask interrupts

Main
Application
av
M

Mask interrupts
Restore SPSR_EL1 and ELR_EL1
Restore corruptible registers
41 0840 rev 33026
Maven Silicon, 2023:04:20

41
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Example of a nested exception handler


ASM_IRQ_Handler:
; Stack all corruptible registers
...
; Read SPSR_EL1 and ELR_EL1 into GP registers Save PCS corruptible registers,
MRS X0, SPSR_EL1
MRS X1, ELR_EL1 SPSR_EL1, and ELR_EL1

3
; Stack SPSR_EL1 and ELR_EL1
STP X0, X1, [SP, #-16]!

2
BL identify_and_clear_source
; Unmask IRQs
MSR DAIFClr, #0b0010 Unmask IRQs

20
BL C_IRQ_Handler
; Mask IRQs
MSR DAIFSet, #0b0010 Mask IRQs
; Restore SPSR_EL1 and ELR_EL1
LDP X0, X1, [SP], #16
MSR SPSR_EL1, X0 Restore PCS corruptible registers,
MSR ELR_EL1, X1 SPSR_EL1, and ELR_EL1

42 0840 rev 33026


; Restore corruptible registers
...
ERET
Maven Silicon, 2023:04:20
on Return from exception

42
lic
Si

Agenda
en

• The AArch64 exception model

• Interrupts

• Synchronous exceptions
av

• SError exceptions
M

• Exceptions in EL2 and EL3

43 0840 rev 33026


Maven Silicon, 2023:04:20

43
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Synchronous exceptions
• Synchronous exceptions can occur for a wide variety of reasons
• Aborts from MMU (Permission Faults, Alignment Faults, regions of memory being marked as Faults, etc)
• SP and PC alignment checking
• Unallocated instructions
• Service Calls: SVC , HVC, and SMC

3
• These exceptions can be part of the normal operation of the OS

2
• For example: A way for a task to request allocation of more memory
‐ Handler loads new page of code or data

20
• Or to indicate a fault
• For example: Task attempts to access invalid memory location
‐ Handler terminates the process

44 0840 rev 33026


Maven Silicon, 2023:04:20
on
44
lic
Si

System calls
• Some instructions can only be carried out at a specific exception level
en

• Lower exception level code may need to perform a privileged operation


‐ Application code requests functionality from the kernel

• Applications can generate an exception via an SVC instruction

• Example:
av

• Application code (EL0) requests memory via malloc()

Application code (EL0) Library code (EL0) Kernel code (EL1)


M

... malloc() sys_brk()


... { {
... ... ...
malloc(); MOV X1, #45 /* Allocate
... SVC 0x1 memory */
... ... ...
... } }

45 0840 rev 33026


Maven Silicon, 2023:04:20

45
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Handling synchronous exceptions


• Information about the cause of a synchronous exception can be determined by reading the following registers…

• The Exception Syndrome Register (ESR_ELx)


• Includes information about the reasons for the exception
• See next slide…

3
• The Fault Address Register (FAR_ELx)

2
Holds the faulting virtual address
• For all synchronous instruction and data aborts and alignment faults

20
• The Exception Link Register (ELR_ELx)
• Holds the preferred return address
‐ For system calls – the next instruction after the call
‐ In most of other cases, the instruction that triggered the exception

46 0840 rev 33026


Maven Silicon, 2023:04:20
on
46
lic
Si

Exception Syndrome Register


• ESR_ELn contains information characterizing the reason for the exception
en

• Updated for synchronous exceptions and SErrors (Not updated for IRQ or FIQ)

• ESR bit fields


• EC
‐ Exception Class (bits [31:26]): examples include:
‐ Exceptions from system register accesses to CP15, Exception from FP operation
av

‐ SVC, HVC, SMC executed


‐ Data Aborts and Alignment Faults
‐ Serror

• ISS
M

‐ Instruction Specific Syndrome (bits [24:0]):


‐ Provides the immediate value associated with system calls (SVC, HVC, SMC)
‐ Provides register specifier information for the instruction
• Example: ISS encoding for trapped MCR or MRC access

47 0840 rev 33026


Maven Silicon, 2023:04:20

47
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Agenda
• The AArch64 exception model

• Interrupts

3
• Synchronous exceptions

2
• SError exceptions

20
• Exceptions in EL2 and EL3

48 0840 rev 33026


Maven Silicon, 2023:04:20
on
48
lic
Si

SError exceptions
• New exception type for Armv8-A processors
en

• Asynchronous system errors – data aborts, parity errors, etc


• Very implementation dependent (on processor and on SoC)

• On Cortex-A5x processors, SError can be generated by:


• Faults indicated on the bus (for example no slave at address or slave error)
• Internal cache errors (for example non-correctable ECC fault)
av

• External SErrors signalled to the cluster on the nSEI or nREI inputs


‐ For example failure of a L2 cache on another cluster
• Internal Level 2 cache errors will cause an error output to be asserted
‐ SoC may route these signals to nSEI or nREI
M

• SError source found in Exception Syndrome Registers (ESR_ELx)

49 0840 rev 33026


Maven Silicon, 2023:04:20

49
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Agenda
• The AArch64 exception model

• Interrupts

3
• Synchronous exceptions

2
• SError exceptions

20
• Exceptions in EL2 and EL3

50 0840 rev 33026


Maven Silicon, 2023:04:20
on
50
lic
Si

System calls to EL2/EL3


• In addition to SVC system call instructions
en

• Allowing EL0 applications to call EL1 kernel code


• HVC and SMC system call instructions move the processor to EL2 and EL3

• EL0 – Application
• EL0 (user) cannot call directly into Hypervisor or Secure Monitor

av

Only possible from EL1 and above


‐ Applications must use SVC to call into kernel and allow kernel to call into higher exception levels

• EL1 – Kernel
• Can call the Hypervisor (EL2) with the HVC instruction
• Can call the secure monitor (EL3) with the SMC instruction
M

‐ EL2 can trap SMC instructions from EL1

• EL2 - Hypervisor
• Can call the secure monitor (EL3) with the SMC instruction

51 0840 rev 33026


Maven Silicon, 2023:04:20

51
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Routing exceptions to EL2 and EL3 (1)

Application EL0
IRQ ?
Rich OS EL1

3
Example: IRQs routed to EL2 to be
Hypervisor EL2
handled by the Hypervisor

2
Example: Secure interrupts are signalled
Secure Monitor EL3
as FIQs, with FIQs routed to EL3

20
• For systems that implement EL2 (Hypervisors) or EL3 (Secure Kernels)
• Asynchronous exceptions can be routed to a higher EL to be dealt with by a Hypervisor or Secure kernel
‐ SCR_EL3 specifies exceptions to be routed to EL3
‐ HCR_EL2 specifies exceptions to be routed to EL2
‐ Separate bits control routing of IRQs, FIQs and SErrors

52 0840 rev 33026


Maven Silicon, 2023:04:20
on
52
lic
Si

Routing exceptions to EL2 and EL3 (2)


• If an exception occurs which is routed to a lower EL than the current one
en

• Exceptions can not be taken to a lower exception level


• Remains pending until execution returns to an EL where it can be taken

• It is possible to set the bits to indicate an exception should be routed to both EL2 and EL3
• It will then be routed to EL3
av
M

53 0840 rev 33026


Maven Silicon, 2023:04:20

53
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Example
• FIQ interrupts can be routed directly to EL3 using SCR_EL3.FIQ
• Other exceptions are routed to EL1 by the combination of the HCR_EL2 and SCR_EL3 values

Non-secure state

3
SCR_EL3.[ EA = 0, FIQ = 1, IRQ = 0 ]

SError

2
IRQ Applications EL0

FIQ

20
SError
IRQ Kernel EL1

FIQ

SError
IRQ Secure Monitor

54 0840 rev 33026


Maven Silicon, 2023:04:20
FIQ
on EL3

54
lic
Si

Example system with EL3 – Non Secure


• The following view shows the system in the Non-Secure state
en

• SCR_EL3.NS=1 (Non-secure), SCR_EL3.{EA=0,FIQ=1,IRQ=0}, HCR_EL2.{AMO=0,IMO=0,FMO=0}

Non-secure Secure
AArch32 AArch64
App App EL0
av

Non-secure interrupts
taken in EL1 (IRQs)
AArch64 Linux EL1
Secure interrupts routed to SMC
EL3 (FIQs)
Firmware / Secure Monitor EL3
M

• Non-Secure interrupts use VBAR_EL1 (points to Linux vector table)


‐ Interrupts handled directly by the Linux Kernel at EL1
• Secure interrupts and SMC calls are routed to EL3 (VBAR_EL3)
‐ Handled by the Secure Monitor
‐ The monitor may instigate a switch to the Secure state to service the interrupt

55 0840 rev 33026


Maven Silicon, 2023:04:20

55
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Example system with EL3 – Secure


• The following view shows the system in the Secure state
• SCR_EL3.NS=0 (Secure), SCR_EL3.{EA=0,FIQ=0,IRQ=1}, HCR_EL2.{AMO=X,IMO=X,FMO=X}

Non-secure Secure

3
EL0 Trusted Services
Secure interrupts taken in

2
EL1 (FIQs)
EL1 Trusted OS
SMC Non-secure interrupts

20
routed to EL3 (IRQs)
EL3 Firmware / Secure Monitor

• Secure interrupts use VBAR_EL1 (points to Trusted Kernel vector table)


‐ Handled directly by the Trusted Kernel at EL1
• Non-Secure interrupts and SMC calls are routed to EL3 (VBAR_EL3)
‐ Handled by the Secure Monitor which may instigate a switch to the Non-Secure state to service the interrupt

56
• When executing in Secure state, HCR_EL2 does not effect interrupt routing

0840 rev 33026


Maven Silicon, 2023:04:20
on
56
lic
Si
en

The trademarks featured in this


presentation are registered and/or
av

unregistered trademarks of ARM


Limited (or its subsidiaries) in the EU
and/or elsewhere. All rights
reserved. All other marks featured
M

may be trademarks of their


respective owners.

57 Confidential © 2023 Arm Limited

57
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

3
Memory Management

2
Armv8-A

20
Confidential © 2020 Arm Limited
on
2
lic
Si

Agenda
• Memory Management theory
en

• Stage 1 Translations at EL1/0

• Translations at EL2 / EL3


av

• TLB maintenance
M

3 0841 rev 32380


Maven Silicon, 2023:04:20

3
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Why do we need memory management?


• The memory map of a typical system is partitioned into logical regions

• Each region may require different memory attributes


• Access permissions

3
• Read/Write permissions for User/Privileged modes
• Memory types
• Caching/Buffering and access ordering rules for memory accesses

2
Memory Map

20
Uncached Peripherals

Privileged Access OS

Application
User Access
Space

4 0841 rev 32380


Maven Silicon, 2023:04:20
Read-only on Vectors

4
lic
Si

What is virtual addressing?


• A system will have a physical address map defined by the hardware
en

• The core operates in a virtual address space defined by software


• Mapping between virtual and physical address defines by translation tables

• Virtual addressing gives an OS greater flexibility over memory management


av

• For example: Making non-contiguous blocks of physical memory appear as a single block in the virtual address space

Virtual Memory Map Physical Memory Map

Uncached Peripherals Peripherals


M

Privileged Access OS
RAM
Application
User Access
Space

Read-only Vectors Flash


5 0841 rev 32380
Maven Silicon, 2023:04:20

5
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

What is a Memory Management Unit?


The MMU handles the translation of virtual address to physical addresses
• Provides hardware to read translation tables in memory (“table walking”)
• Translation Table Base Registers (TTBR) hold the physical base address of the tables
• Translation Lookaside Buffers (TLBs) cache recent translations

3
When the MMU is enabled, all accesses made by the core are passed through it
• MMU will use cached translations from the TLB(s), or perform a table walk

2
• Translation must occur before a cache lookup can complete

20
MMU Memory
Arm Core Table Caches
TLBs Translation Tables
Walk Unit

6 0841 rev 32380


Maven Silicon, 2023:04:20
on
6
lic
Si

How a physical address is formed


• The core issues a 64-bit virtual address
en

• Top-bits identify which block is being accessed


• Used as an index within the translation table
• Bottom bits give an offset within the physical block

• The MMU combines the physical address bits from the block table entry with the bottom bits from the original
av

virtual address

Translation Table
PA Base Attributes
PA Base Attributes VA Base
M

PA Base Attributes
PA Base Attributes
PA Base Attributes Translation Table
Base (TTBR)

VA Base Offset PA Base Offset


Virtual address issued by core Physical address from the MMU

7 0841 rev 32380


Maven Silicon, 2023:04:20

7
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Multiple Levels of Translation Table


First-level tables divide large areas of virtual memory into coarse sections
• The first-level table is indexed using the first bitfield of the VA (L1 Index)
• In this example, table entries contain the physical base address of a 512MB block, or a pointer to a second-level table

Second-level tables subdivide first-level sections into smaller blocks

3
• The second-level table is indexed using the second bitfield of the VA (L2 Index)
• Table entries contain the physical base address of a 64KB block

2
The last bitfield of the VA provides the offset of the final physical output address

20
L1 Table L2 Table

L1 Index L2 Index Offset Output Base


64KB Block
Input Virtual Address Physical Address
Table Base

L1 Index L2 Index
Table Base

8 0841 rev 32380


Maven Silicon, 2023:04:20
Translation Table
Base (TTBR)
on
512MB Block 64KB Block

8
lic
Si

Agenda
• Memory Management theory
en

• Stage 1 Translations at EL1/0

• Translations at EL2 / EL3


av

• TLB maintenance
M

9 0841 rev 32380


Maven Silicon, 2023:04:20

9
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Armv8-A Translation Tables


AArch64 supports a single “Long Descriptor” translation table format

AArch32 supports two formats


• Armv7-A Short Descriptor format

3
‐ Provides backward compatibility for legacy code
‐ Can not be used by the Hypervisor (EL2 or 2nd stage translations)
• Armv7-A (LPAE) Long Descriptor format (as used by Cortex-A15)

2
‐ Very similar to AArch64 format, but only supports 32-bit input address

20
10 0841 rev 32380
Maven Silicon, 2023:04:20
on
10
lic
Si

AArch64 Translation Tables


An evolution of the LPAE Armv7-A long descriptor format
en

• Same 64-bit long-descriptor format


‐ Now supports up to 48-bit input address and output addresses
• Input VA now comes from a 64-bit register (Bits 63:48 must all be same)
‐ New Level 0 Table Index Introduced
• Uses same format as LPAE level 1 table
av

AArch64 supports 3 different translation granules


• 4KB, 16KB, or 64KB
‐ Defines block size at lowest level of translation table and size of tables
• Configurable for each TTBR
• It is IMPLEMENTATION DEFINED which of the three are supported
M

‐ ID_AA64MMFR0_EL1 reports supported sizes

Larger granules reduce the number of levels of table required


• Particularly important in Virtualized systems

11 0841 rev 32380


Maven Silicon, 2023:04:20

11
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

AArch64 Table Descriptor Format


Table descriptor types are identified by bits [1:0] and provide either:
• The base address of a next-level Table (further subdividing memory)
• The base address of a (variable-sized) Block of memory

Level 0 descriptors only output Table addresses, Level 3 only output Block addresses

3
• Note: The same descriptor type code has a different format at Level 3

63 Table Descriptor (Levels 0, 1, 2) 0

2
Attributes Next-level Table Address 1 1

20
63 Block Descriptor (Levels 1, 2) 0
Upper Attributes Output Block Address Lower Attributes 0 1

63 Block Descriptor (Level 3) 0


Upper Attributes Output Block Address Lower Attributes 1 1

63 Fault Descriptor (Invalid Entry) 0

12 0841 rev 32380


Maven Silicon, 2023:04:20
on
Ignored X 0

12
lic
Si

AArch64 Tables with 4KB Granules


4KB Granule
en

• 4-level look up, 48-bit address, 9 address bits per level (512 entries)
av

Virtual Address bits


[47:39] [38:30] [29:21] [20:12] [11:0]

Level 0 Table Index Level 1 Table Index Level 2 Table Index Level 3 Table Index Block Offset
M

Each entry can: Each entry can: Each entry can: Each entry can:
• Point to an L1 Table • Point to an L2 Table • Point to an L3 Table • Point to a 4KB Block
(No Block entries) • Point to a 1GB Block • Point to a 2MB Block (No Table entries)

13 0841 rev 32380


Maven Silicon, 2023:04:20

13
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

AArch64 Tables with 16KB Granules


16KB Granule
• 4-level look up, 48-bit address, 11 address bits per level (2048 entries)
‐ Level0 is a “very small” table (2 entries)

3
Virtual Address bits

2
[47] [46:36] [35:25] [24:14] [13:0]

20
Level 0 Table Index Level 1 Table Index Level 2 Table Index Level 3 Table Index Block Offset

Each entry can: Each entry can: Each entry can: Each entry can:
• Point to an L1 Table • Point to an L1 Table • Point to an L3 Table • Point to a 16KB Block
(No Block entries) (No Block entries) • Point to a 32MB Block (No Table entries)

14 0841 rev 32380


Maven Silicon, 2023:04:20
on
14
lic
Si

AArch64 Tables with 64KB Granules


64KB Granule
en

• 3-level look up, 48-bit address, 13 address bits per level (8192 entries)
‐ Top Level is a partial table, 6 bits (64 entries)
av

Virtual Address bits


[47:42] [41:29] [28:16] [15:0]

Level 1 Table Index Level 2 Table Index Level 3 Table Index Block Offset
M

Each entry can: Each entry can: Each entry can:


• Point to an L2 Table • Point to an L3 Table • Point to a 64KB Block
(No Block entries) • Point to a 512MB Block (No Table entries)

15 0841 rev 32380


Maven Silicon, 2023:04:20

15
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Separate tables for app/kernel space


AArch64 EL1/0 supports two translation tables
• Allowing separate kernel and user virtual address spaces of configurable size at the top and bottom of memory
‐ Area between regions automatically faults (no tables required)
• Table base addresses are specified in TTBR0_EL1 and TTBR1_EL1

3
• Input virtual address can be a full 64-bit VA
‐ TTBR1_EL1 selected when upper 16 bits are all 1, TTBR0_EL1 selected when upper 16 bits are all 0
‐ Size of each address region controlled by TnSZ fields in the Translation Control Register (TCR_EL1)

2
• But, both of these regions must map to within a single 48-bit physical address space

Virtual Address Space Physical Address Space

20
0xFFFF_FFFF_FFFF_FFFF

Kernel space TTBR1

0xFFFF_0000_0000_0000 Size of regions


can be adjusted
FAULT between set
minimum and
0x0000_FFFF_FFFF_FFFF maximum values 0x0000_FFFF_FFFF_FFFF

16 0841 rev 32380


Application space

0x0
Maven Silicon, 2023:04:20
TTBR0
on 0x0

16
lic
Si

Translation Control Register


Translation Control Register (TCR_EL1)
en

• Configures many aspects of memory management for EL1/EL0 (kernel/user)


• Settings related to address ranges and granule size are shown

63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32

IPS
av

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

TG1 T1SZ TG0 T0SZ

IPS (Intermediate Physical address Size)


• Controls the maximum output address size; an abort will occur if translations output addresses beyond this range
M

‐ 0b000 = 32-bit, 0b101 = 48-bit

T1SZ, T0SZ
• Kernel- and User-space Virtual Address Space Size
• Address range = 2(64 - TnSZ)

TG1, TG0
• Kernel- and User-space Translation Granule
17 0841 rev 32380
Maven Silicon, 2023:04:20

17
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Setting the First Level of Lookup


The level of translation table used for the first lookup is configurable
• No need to implement all levels of table if full 48-bit address range is not required (allows flatter translation table walks)
• The first level of lookup is implicitly controlled by the TCR_ELx.TnSZ fields and the specified granule size
‐ Configured separately for TTBR0_EL1 and TTBR1_EL1

3
• Permitted address range is 2(64-TnSZ)
‐ Example: TnSZ = 34, address range is 30 bits

2
Table Address Range

20
Level 4KB Granule 16KB Granule 64KB Granule
0 40- to 48-bit 48-bit
1 31- to 39-bit 37- to 47-bit 43- to 48-bit
2 25- to 30-bit 26- to 36-bit 30- to 42-bit
3 25-bit 25- to 29-bit

18 0841 rev 32380


Maven Silicon, 2023:04:20
on
18
lic
Si

Caching Translation Tables


The MMU can be configured to allow translation tables to be stored in cacheable memory
en

• Typically gives better performance than going directly to external memory


• Controlled by fields in TCR_EL1

The Translation Control Register (TCR_EL1)


• Describes the caching and shareability of translation tables (for TTBR0_EL1 / TTBR1_EL1)
‐ SH0/1 Shareability, IRGN0/1 Inner Cacheability, ORGN0/1 Outer Cacheability
av

63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32

IPS

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
M

TG1 SH1 ORGN1 IRGN1 T1SZ TG0 SH0 ORGN0 IRGN0 T0SZ

You must ensure that the attributes specified in the TCR_EL1 match those specified for the virtual memory region
covering the translation tables

19 0841 rev 32380


Maven Silicon, 2023:04:20

19
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Contiguous Block entries


Translation table block entries contain a Contiguous Bit
• Setting this allows the TLB to cache a single entry covering translations for multiple blocks

To use the contiguous bit

3
• The Blocks must be adjacent (correspond to a contiguous range of VA)
‐ 16 adjacent blocks with 4KB granule
‐ 32 or 128 adjacent blocks with 16KB granule

2
‐ 32 adjacent blocks with 64KB granule
• Have consistent attributes

20
• Start on an aligned boundary
• Point to a contiguous output address range at the same level of translation

If these conditions are not met it is considered a programming error which may result in TLB aborts or corrupted
lookups
• For example:
‐ If any of the table entries do not have the contiguous bit set

20

0841 rev 32380


Maven Silicon, 2023:04:20
on
The output of one of the entries points outside the aligned range

20
lic
Si

Agenda
• Memory Management theory
en

• Stage 1 Translations at EL1/0

• Translations at EL2 / EL3


av

• TLB maintenance
M

21 0841 rev 32380


Maven Silicon, 2023:04:20

21
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Translation Tables Overview


The Virtualization extensions introduce a second stage of translation
• First stage: Virtual Address (VA) → Intermediate Physical Address (IPA)
‐ Operation of MMU appears unchanged for guest OSs; still use TTBRn_EL1 and TCR_EL1
• Second stage: Intermediate Physical Address (IPA) → Physical Address (PA)
‐ Controlled by the Hypervisor

3
The Hypervisor and Secure Monitor also have a set of Stage 1 translation tables
• Mapping directly from VA to PA

2
Guest OS Physical Memory Map
Virtual Memory Map seen by Guest OS
Stage 1 Real Physical Memory Map
Stage 2

20
EL1 OS Peripherals
Guest OS
Tables Virtualization Peripherals
EL0 Application RAM Tables
TTBRn_EL1 RAM
Hypervisor Flash VTTBR_EL2
Virtual Memory Map Peripherals
Hypervisor
EL2 Hypervisor Tables RAM
Secure Monitor TTBR0_EL2

22
EL3
Virtual Memory Map

0841 rev 32380


Secure Monitor
Secure
Monitor
Tables
Maven Silicon, 2023:04:20TTBR0_EL3
on Flash

22
lic
Si

Stage 2 Translations (IPA → PA)


Stage 2 translations:
en

• Use an additional set of tables to allow a Hypervisor to map IPAs to PAs


• Controlled by the Hypervisor Control Register (HCR_EL2)
• Only applies to Non-secure EL1/0 accesses

• Table base address specified by VTTBR_EL2


av

• Allows a single contiguous address space of variable size at the bottom of memory (up to 48-bit)

• Virtualization Translation Control Register (VTCR_EL2)


Intermediate Physical Address Space
‐ T0SZ[5:0] configures the size of the address space
‐ TG0 specifies the translation granule size
M

‐ First level of table lookup is controlled by SL0 (unlike Stage 1)


‐ Any access outside the defined address range causes a Translation Fault FAULT

0x0000_FFFF_FFFF_FFFF

VTTBR

0x0
23 0841 rev 32380
Maven Silicon, 2023:04:20

23
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Stage 1 Translation EL2/3


The Hypervisor (EL2) and Secure Monitor (EL3) have their own Stage 1 tables
• Map directly from virtual to physical address space

• Table base is specified in TTBR0_EL2 and TTBR0_EL3

3
• Allows a single contiguous address space of variable size at the bottom of memory (up to 48-bit)

• Translation Control Registers (TCR_EL2, TCR_EL3)

2
‐ T0SZ[5:0] configures the size of the address space
‐ TG0 specifies the translation granule size
Virtual Address Space

20
‐ First level of lookup is implicitly controlled by T0SZ and translation granule size
‐ Any access outside the defined address range causes a Translation Fault

FAULT

0x0000_FFFF_FFFF_FFFF

Hypervisor or TTBR0_

24 0841 rev 32380


Maven Silicon, 2023:04:20
on Secure Monitor
0x0
EL2/3

24
lic
Si

Secure World Translation Tables


The Secure Monitor (EL3) has dedicated tables
en

• Table base address specified in TTBR0_EL3 and configured via TCR_EL3


• EL3 Translation Tables are capable of accessing both secure and non-secure Physical Addresses

However once the transition to Secure world has completed


• The trusted kernel uses the EL1 translation regime TTBR0/TTBR1_EL1 tables
av

• These registers are not banked in AArch64


• It is the job of the Secure Monitor to bank these registers and configure new tables for the Secure world

When in the Secure World, the EL1 translation regime has two differences from when it is the in non-secure state:
• The second stage of translation is disabled
M

• The EL1 translation regime is capable of pointing to secure or non-secure physical addresses

25 0841 rev 32380


Maven Silicon, 2023:04:20

25
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Agenda
• Memory Management theory

• Stage 1 Translations at EL1/0

3
• Translations at EL2 / EL3

2
• TLB maintenance

20
26 0841 rev 32380
Maven Silicon, 2023:04:20
on
26
lic
Si

Translation table change example


The TLBs keep cached copies of recently used translations
en

• Whenever the translation tables are modified the TLBs must be manually invalidated
• TLBs cannot cache table entries which result in a translation fault

In A64 there is a TLBI instruction to invalidate the TLBs


av

<< Write to Translation Tables >>


DSB ISH ; Ensure write has completed
TLBI ???
DSB ISH ; Ensure completion of TLB invalidation
M

ISB ; Synchronize context on this processor

27 0841 rev 32380


Maven Silicon, 2023:04:20

27
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

AArch64 instructions
AArch64 TLB maintenance operations are performed via instructions
TLBI <type><level>{IS} {, <Xt>}
Type
• All - All TLB entries

3
• VMALL - All TLB entries (stage 1, for current guest OS)
• VMALLS12 - All TLB entries (stage 1 & 2 for current guest OS)
• ASID - Entries that match ASID in Xt

2
• VA - Entry for virtual address and ASID specified in Xt
• VAA - Entries for virtual address specified in Xt, with any ASID
• There are more…

20
Level
• En = ELn virtual address space (n can be 3, 2, or 1)
IS
• Inner-shareable operation
Examples:
TLBI VAE1, X0 ; Invalidate address/ASID in x0, for EL1 virtual address space

28 0841 rev 32380


TLBI ALLE3

Maven Silicon, 2023:04:20


on
; Invalidate entries for the EL3 virtual address space

28
lic
Si
en

The trademarks featured in this


presentation are registered and/or
av

unregistered trademarks of ARM


Limited (or its subsidiaries) in the EU
and/or elsewhere. All rights
reserved. All other marks featured
M

may be trademarks of their


respective owners.

29 Confidential © 2020 Arm Limited

29
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

3
Memory Model

2
Armv8-A

20
Confidential © 2020 Arm Limited
on
2
lic
Si

Memory model
• The memory map of a typical system is partitioned into logical regions
en

• Each region may require different memory attributes


• Access permissions
• Read/Write permissions for User/Privileged modes
• Access rules
av

• Caching and ordering rules

Memory Map
M

Uncached Peripherals
Privileged Access OS

Unprivileged Access Writeable Application Data


Unprivileged Access Read-only Application Code

3 0842 rev 32379


Maven Silicon, 2023:04:20

3
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

How attributes are specified


63 Block Descriptor 0
Upper Attributes Output Block Address Lower Attributes

2 3
Reserved for
Software use UXN PXN Contig nG AF SH AP NS Indx
58:55 54 53 52 11 10 9:8 7:6 5 4:2

20
• UXN, PXN Executable permissions
• AF Access flag
• nG Non-global
• SH Shareability
• AP Access permissions
• NS

4
Indx
0842 rev 32379
Security (EL3 and Secure-EL1 only)
on
Index into Memory Attribute Indirection Register (MAIR_ELn)

Maven Silicon, 2023:04:20

4
lic
Si

Hierarchical attributes
• The descriptor format provides support for hierarchical attributes
en

• Allows an attribute to be set at one level to be inherited by the next levels

• Use for:
• Access permissions (APTable)
• Security (NSTable)
av

• Executable permissions (UXNTable, PXNTable)

No effect; use PXN attributes of


Level 0 Table Level 1 Table Address PXNTable = 0
M

next-level table

Override L2/L3 entries; all block


Level 1 Table Level 2 Table Address PXNTable = 1 descriptors treated as PXN = 1

Level 2 Table Physical Block Address PXN = X PXN attribute ignored

5 0842 rev 32379


Maven Silicon, 2023:04:20

5
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Agenda
• Types

• Attributes

3
• Alignment and endianness

2
• Tagged pointers

20
6 0842 rev 32379
Maven Silicon, 2023:04:20
on
6
lic
Si

Memory Types
Each defined memory region has a specified memory type
en

The memory type affects how the processor can access the region

There are two mutually exclusive memory types


• Normal and Device
av

Additional attributes are specified to control:


• Access permissions
• Executable permissions
M

• Shareability
• Cacheability

7 0842 rev 32379


Maven Silicon, 2023:04:20

7
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Memory Types: Normal


The Normal type is used for code and most data regions

Normal memory gives the best performance because it imposes the fewest restrictions
• Allows the processor to re-order, repeat, and merge accesses

3
For optimal performance, application code and data should be marked as Normal
• Ordering can still be enforced when required using explicit barrier operations

2
Address regions marked as Normal can be accessed speculatively

20
• Data or instructions fetched from memory before being explicitly referenced
• Speculative access may, for example, be caused by: Memory Map
‐ Branch prediction
‐ Out-of-order data loads Peripherals
‐ Speculative cache line fills Normal OS

8 0842 rev 32379


Maven Silicon, 2023:04:20
on Normal
Normal
Application Data
Application Code

8
lic
Si

Memory Types – Normal (2)


Normal memory implements a “weakly ordered” memory model
en

• There is no requirement for Normal accesses to complete in order with respect to other Normal and Device accesses
• However, a processor must handle address dependencies

0x1000 0x1004
av

STR X0, [0x1000]

STRB W1, [0x1003]

LDRH W2, [0x1002]


M

In the example, the accesses are to overlapping addresses


• Processor must ensure the memory is updated as if the STR and STRB occurred in order
‐ These accesses might be merged into a single access
• The LDRH must return the most up-to-date value

9 0842 rev 32379


Maven Silicon, 2023:04:20

9
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Memory Types: Device


The Device type is used for regions where accesses can have side-effects
• Example: Writing to a peripheral’s control register may trigger an interrupt as a side-effect
• Typically only used for peripherals

3
Device type imposes more restrictions on the core

2
Attempting to execute from a region marked as Device is UNPREDICTABLE

Speculative data accesses cannot be performed to Device regions

20
• Speculative instruction fetches are covered later…
Memory Map

Device Peripherals
OS
Device regions should always

10 0842 rev 32379


Maven Silicon, 2023:04:20
on
be marked as Execute Never Application Data
Application Code

10
lic
Si

Memory Types: Device (2)


Four variants of Device are available:
en

• Device-nGnRnE - most restrictive


• Device-nGnRE
• Device-nGRE
• Device-GRE - least restrictive
av

Gathering (G, nG)


• Determines whether multiple accesses can be merged into a single bus transaction
• nG: number/size of accesses on the bus = number/size of accesses in code

Re-ordering (R, nR)


M

• Determines whether accesses to same device can be re-ordered


• nR: accesses to the same IMPLEMENTATION DEFINED block size will appear on the bus in program order

Early Write Acknowledgement (E, nE)


• Indicates to the memory system whether a buffer can send acknowledgements
• nE: The response should come from the end slave, not buffering in interconnect
11 0842 rev 32379
Maven Silicon, 2023:04:20

11
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Ordering of Device accesses


Example instruction sequence:
Memory Map

LDR X0, [A] A Device-nGnRnE


LDR X1, [B]

3
LDR X2, [A+8] B Device-GRE
LDR X3, [C]

2
LDR X4, [B+8] C Device-nGnRnE

20
Effect of ordering rules:

• The two accesses to region A are guaranteed to be in program order with respect to each other

• The two accesses to region B are not guaranteed to be in program order with respect to each other,
or with respect to the accesses to regions A and C

12 0842 rev 32379


Maven Silicon, 2023:04:20
on
• It is IMPLEMENTATION DEFINED whether the accesses to region A will occur in program order with
respect to the accesses to region C

12
lic
Si

Specifying the type


Translation table entries do not directly encode the type
en

• Each block entry specifies a 3-bit index into a table of types

The table of types is held in the Memory Attribute Indirection Register (MAIR_ELn)
• Eight entry table, each entry is 8 bits
av

63 0
7 6 5 4 3 2 1 0
M

For example:

0b00000000 = Device nGnRnE


...
Type Encoding 0b01000100 = Normal, non-cacheable
...
0b11111111 = Normal, Inner/Outer Write-Back Cacheable

13 0842 rev 32379


Maven Silicon, 2023:04:20

13
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Stronger to weaker device memory


Device-nGnRnE is the strongest memory type
• Defines the rules that memory accesses must obey

As the memory type weakens, those rules are relaxed


• E.g. Allowing Gathering

3
Device-
GRE Access rules for stronger memory are still legal for weaker
memory types

2
• For example: The Reordering attribute means accesses may be
Device-nGRE reordered, not that they must be reordered

20
An implementation may use the same behaviour for
Device-nGnRE different memory types
• Must use the behaviour of the strongest type
• Bus infrastructures may not be able to express all memory
types

Device-nGnRnE Software can specify the weakest type necessary for correct
operation

14 0842 rev 32379


Maven Silicon, 2023:04:20
on • Behaviour will still be correct if the access is upgraded to a
stronger memory type

14
lic
Si

Agenda
• Types
en

• Attributes

• Alignment and endianness


av

• Tagged pointers
M

15 0842 rev 32379


Maven Silicon, 2023:04:20

15
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Cacheability
Regions marked as Normal can be Cacheable or Non-cacheable
• Specified as Inner and Outer attributes
• The divide between Inner and Outer is IMPLEMENTATION DEFINED
‐ Typically Inner attributes are used by the integrated caches and Outer attributes are exported onto the bus

3
Inner Cacheable Outer Cacheable

2
Cores Memory
Cores
Cores L3$
Cores L1$ L2$ System

20
Cores Memory
Cores
Cores L2$
Cores L1$ System

Cache policies are covered in a separate module


16 0842 rev 32379
Maven Silicon, 2023:04:20
on
16
lic
Si

Shareable
The Shareable attribute is used to define whether a location is shared with multiple processors
en

• Non-shareable – only used by this observer


• Inner Shareable / Outer Shareable – shared with other observers
‐ The division between inner and outer is IMPLEMENTATION DEFINED

Example:
av

Processor Processor
Mali GPU
Core Core Core Core

Non-shared
M

Inner Shareable
Outer Shareable

These attributes can define sets of observers for which the Shareability attributes make the data/unified caches
transparent for data accesses
• Requires system to provide coherency management
17 0842 rev 32379
Maven Silicon, 2023:04:20

17
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Normal memory behaviour guarantees


Different types of Normal memory may also become weaker to allow extra behaviours
• Example: Cacheable means the data may be cached, not that it must be cached

The Shareability attribute indicates whether other masters will access the data

3
• Complex systems may make data visible using extra coherency logic
• PEs without coherent caches can share data by forcing memory to not be cached

2
Software that specifies the correct memory type for the desired behaviour will then work correctly on any
implementation

20
PE0 PE1
PE
D$ D$
Cache
Coherency Logic

18 0842 rev 32379


Maven Silicon, 2023:04:20
Memory on Memory

18
lic
Si

Access permissions
Access permissions control whether a region is readable and/or writeable
en

Separate permissions can be set for unprivileged and privileged accesses


av

Memory Map
AP Unprivileged (EL0) Privileged (EL1/2/3)
Privileged Peripherals
00 No access Read/write
Privileged OS
01 Read/write Read/write
M

10 No access Read-only Unprivileged Read/Write Application Data


11 Read-only Read-only Unprivileged Read-only Application Code

19 0842 rev 32379


Maven Silicon, 2023:04:20

19
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Executable
Blocks can be marked as executable or non-executable (XN)
• UXN – Unprivileged Execute Never
• PXN – Privileged Execute Never
• Setting these attributes prevents speculative instruction fetches

3
Processor can also be configured to treat writeable regions as Execute Never
• SCTLR_EL1.WXN

2
‐ Regions writeable at EL0 treated as XN at EL0 and EL1 Memory Map
‐ Regions writeable at EL1 treated as XN at EL1

20
• SCTLR_EL2/3.WXN Not executable Peripherals
‐ Regions writeable at ELn treated as XN at ELn Executable OS
• SCTLR.UWXN
‐ AArch32 only
Not executable Application Data
‐ Regions writeable at EL0 treated as XN at EL1
Executable Application Code

20 0842 rev 32379


Maven Silicon, 2023:04:20
on Device regions should always
be marked as Execute Never

20
lic
Si

Access flag
The Access Flag (AF) is a bit in each Block descriptor
en

• Indicates whether the entry has been referenced yet


• AF = 0: This block entry has not yet been used
• AF = 1: This block entry has been used

Entries with AF = 0 trigger an MMU fault


av

• Abort handler must manually set the AF bit in the table entry
• Translation tables can be written with the AF bits set if this feature is not being used

Why use the access flag?


• Could be used to check whether an application has used allocated memory
M

‐ Table entries with AF = 0 have not been accessed


‐ Could potentially re-allocate the memory
• Could also be used to influence whether a page is a “good” candidate to be swapped out

21 0842 rev 32379


Maven Silicon, 2023:04:20

21
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Global / Non-Global translations


• Translation table entries can be marked as Global (nG = 0) or Non-global (nG = 1)
• Global entries apply to all tasks (e.g. Kernel space entries)
• Non-Global entries only apply to the current task

3
• For Non-Global entries the TLBs record the current ASID
• TLB hit only occurs if the ASID in TLB entry matches current ASID

2
Current ASID Memory Map
0x02

20
VA Tag ASID Descriptor Global Peripherals
Global OS
Non-Global Entry 0xFFE 0x02 PA and Attributes

0x001 --- PA and Attributes Non-global Application Data


Translation Lookaside Buffer (TLB)
Non-global Application Code
Global Entry

22
Translation Table

0842 rev 32379


Maven Silicon, 2023:04:20
on
22
lic
Si

ASIDs
ASID stands for Address Space Identifier
en

AArch64
• ASID is 8 or 16-bit, controlled by TCR_EL1.AS bit
• Current ASID specified in TTBR0_EL1 or TTBR1_EL1
‐ TCR_EL1 controls which TTBR holds the ASID
av

• Typically TTBR0_EL1, as this corresponds to application space

AArch32, Long Descriptor Format


• ASID is 8-bit
• Current ASID specified in TTBR0 or TTBR1
M

‐ TTBCR controls which TTBR holds the ASID


• Typically TTBR0, as this corresponds to application space

AArch32, Short Descriptor Format


• ASID is 8-bit
• Specified in CONTEXTID register

23 0842 rev 32379


Maven Silicon, 2023:04:20

23
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Reserved bits
Entries in the Long Descriptor format include bits reserved for use by the OS
• These bits are guaranteed to be ignored by the hardware
• So can be used to record OS specific information in the translation tables

3
63 Table Descriptor 0
Attributes Next-level Table Address 1 1

2
Reserved for

20
Software use

58:55

63 Block Descriptor 0
Upper Attributes Output Block Address Lower Attributes 0 1

Reserved for

24 0842 rev 32379


Software use
58:55

Maven Silicon, 2023:04:20


on
24
lic
Si

Physical address spaces


Armv8-A defines two security states
en

• Secure and Non-secure (“Normal”)


Secure EL1/EL0 Secure Physical
It also defines two physical address spaces Address Space
• Secure and Non-secure Secure Peripherals
RAM
Secure Code
Translation
These are in theory completely separate
av

Flash
Secure Data Tables
• SP:0x8000 != NP:0x8000 Peripherals
• But most systems instead treat Secure and Non-secure Data
Non-secure as an attribute for access control
Non-secure Physical
The Normal world can only access the Non- Non-secure EL1/EL0 Address Space
M

secure physical address space


Non-secure Peripherals RAM
Translation
The Secure world can access both Non-secure Data
Tables
Flash
physical address spaces Peripherals
Non-secure Data
• Controlled through the translation tables

25 0842 rev 32379


Maven Silicon, 2023:04:20

25
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Agenda
• Types

• Attributes

3
• Alignment and endianness

2
• Tagged pointers

20
26 0842 rev 32379
Maven Silicon, 2023:04:20
on
26
lic
Si

Alignment
An unaligned access is where the address is not aligned to the element size
en

LDRH X0, 0x8001 ; Unaligned

Armv8-A permits for unaligned accesses to address regions marked as Normal


• Unaligned accesses to Device regions are faulted (Synchronous exception)
av

• All unaligned accesses will be faulted if the SCTLR_ELx.A bit is set for a given EL

Most A64 load/store instructions can perform unaligned accesses


• Exception for Load/store Exclusive, Load Acquire / Store Release
M

‐ These instructions must always be element aligned

27 0842 rev 32379


Maven Silicon, 2023:04:20

27
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Endianness
Data accesses can be little-endian (LE) or big-endian (BE)
• Instruction fetches are always LE

Data endianness is controlled independently for each EL

3
• SCTLR_ELx.EE controls ELx data endianness
• SCTLR_EL1.E0E controls EL0 data endianness
• CPSR.E and SCTLR.EE in AArch32

2
It is IMPLEMENTATION DEFINED whether a processor supports both LE and BE

20
• If only little endian is supported .EE and .E0E bits become RES0
• If only big endian is supported, .EE and .E0E bits become RES1

28 0842 rev 32379


Maven Silicon, 2023:04:20
on
28
lic
Si

Agenda
• Types
en

• Attributes

• Alignment and endianness


av

• Tagged pointers
M

29 0842 rev 32379


Maven Silicon, 2023:04:20

29
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Tagged pointers (AArch64 only)


Addresses are stored in 64-bit registers (Xn)
• ...But the top 16 bits of an address MUST be all 0xFFFF or 0x0000 (any other bit value will trigger a fault)

Virtual Address Space Physical Address Space


0xFFFF_FFFF_FFFF_FFFF

3
Peripherals Peripherals

OS Translation
Tables
0xFFFF_0000_0000_0000

2
Flash
TTBR1_EL1
FAULT
0x0000_FFFF_FFFF_FFFF

20
Translation
Tables RAM
Application
TTBR0_EL1
0x0

Architecture supports tagged pointers


• Top 8 bits [63:56] of virtual address are ignored by core

30
‐ Internally core uses bit [55] to sign-extend address to 64-bit format
• Allows bits [63:56] to be used to store other information
• Enabled through TCR_EL1, and controlled separately for each TTBR
0842 rev 32379
Maven Silicon, 2023:04:20
on
30
lic
Si
en

The trademarks featured in this


presentation are registered and/or
unregistered trademarks of ARM
av

Limited (or its subsidiaries) in the EU


and/or elsewhere. All rights
reserved. All other marks featured
may be trademarks of their
M

respective owners.

31 Confidential © 2020 Arm Limited

31
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Appendix: Linux access flag example


Linux provides:
#define pte_young(pte) (pte_val(pte) & L_PTE_YOUNG)
...
pte mkold(pte)  Marks PTE (page table entry) as not recently accessed

3
pte mkyoung(pte)  Marks PTE as recently accessed
...

2
From arch/arm64/include/asm/pgtable.h:
...

20
PTE_BIT_FUNC(mkold, &= ~PTE_AF);  Clears AF bit
PTE_BIT_FUNC(mkyoung, |= PTE_AF);  Sets AF bit
...

From arch/arm64/include/asm/pgtable-hwdef.h:
...
#define PTE_AF
...
32 0842 rev 32379
(_AT(pteval_t, 1) << 10)

Maven Silicon, 2023:04:20


on /* Access Flag */

32
lic
Si

Appendix: ASID with short-descriptor format


With the short-descriptor format, translation table base and ASID are specified in separate registers (TTBRn and
en

CONTEXTID)

This is dangerous: there is a race between setting the ASID and changing the TTBRn
• New ASID could be used for walks from old translation tables, or vice versa
av

There are several possible solutions; for example,


• The TTBCR.PDn bits disable table walks from TTBRn
• Assuming TTBR0 is used for application space…
… and TTBR1 for OS, with OS table only containing global (nG=0) entries
M

Set TTBCR.PD0 = 1 (disabling table walks from TTBR0)


ISB
Change ASID to new value
Change Translation Table Base Register to new value
ISB
Set TTBCR.PD0 = 0 (re-enabling table walks from TTBR0)

33 0842 rev 32379


Maven Silicon, 2023:04:20

33
Copyright © 2020 Arm Ltd. All rights reserved.
Not to be reproduced by any means without prior written consent.

DynamIQ and Neoverse


Architectural Updates

2 3
Overview

20
2
2 1101 rev 16641 Maven Silicon, 2023:04:20
on
lic
Si

Architecture versions
▪ Arm DynamIQ CPUs (such as Cortex-A55, Cortex-A75, and Cortex-A76) and Neoverse CPUs (E1 and
en

N1) implement:
▪ Armv8.1 extensions
▪ Armv8.2 extensions
▪ Advanced SIMD and floating point support (Optional in Cortex-A55)
▪ Cryptographic extension (Optional)
av

▪ RAS extension
▪ Armv8.3 LDPAR instructions

▪ Not all Armv8.1 and Armv8.2 features are supported


▪ Software can use identification registers to determine which features are implemented
M

▪ DynamIQ and Neoverse cores are fully backwards compatible with Armv8-A software

3 1101 rev 16641 Maven Silicon, 2023:04:20

3
Copyright © 2020 Arm Ltd. All rights reserved.
Not to be reproduced by any means without prior written consent.

Agenda

3
▪ Large System Extensions

2
▪ Memory system

20
▪ Other Changes

4
4 1101 rev 16641 Maven Silicon, 2023:04:20
on
lic
Si

Atomic instructions AArch64 only


en

▪ DynamIQ and Neoverse cores introduce new atomic memory access instructions to A64
▪ CAS - Compare and swap
▪ LD<OP> - Load and <operation>
▪ SWP - Swap
av

▪ Atomics can optionally have an ordering specifier


▪ A=Acquire, L=Release or AL=Acquire & Release

▪ DynamIQ and Neoverse cores support atomic instructions internally when memory is defined as
▪ Inner or Outer Shareable, Inner Write-Back, Outer Write-Back Normal memory with Read allocation hints and Write
M

allocation hints and not transient

▪ Other memory types will require fabric / interconnect support

5 1101 rev 16641 Maven Silicon, 2023:04:20

5
Copyright © 2020 Arm Ltd. All rights reserved.
Not to be reproduced by any means without prior written consent.

Atomic instructions – examples

▪ Compare and Swap: CAS x0,x1,[x2]


tmp = *x2
if (*x2 == x0)
*x2 = x1

3
x0 = tmp

2
▪ Load and Add: LDADD x0,x1,[x2]
x1 = *x2
*x2 = x1 + x0

20
▪ Swap: SWP w0, w1, [x2]
tmp = *x2
*x2 = w0
w1 = tmp

6
6 1101 rev 16641 Maven Silicon, 2023:04:20
on
▪ NOTE: These are pseudo code sequences, and the actual execution of the instructions would be
atomic
lic
Si

How atomics are implemented?


▪ The architecture does not specify how the atomics are implemented
en

▪ There are a number of possible approaches

Arm Arm Arm Arm Arm Arm


av

LDADD LDADD
LDADD + 1 6
Interconnect
6→7
M

Coherent Cache Coherent Cache


2 (load returns
LDADD original value)
+1

Cache Coherent Interconnect


2→3

Memory
Memory

7 1101 rev 16641 Maven Silicon, 2023:04:20

7
Copyright © 2020 Arm Ltd. All rights reserved.
Not to be reproduced by any means without prior written consent.

Access flag and Dirty bit AArch64 only


▪ Support added for hardware updating of the Access Flag in translation table entries
▪ Controlled by the TCR_ELn.HA bit for stage 1 translations, and VTCR_ELn.HA for stage 2
▪ Support is IMPLEMENTATION DEFINED, if not supported the bit RES0
▪ When HA==1:

3
▪ Accesses to a page/block will cause Access Flag bit to be set atomically by hardware without generating exception
▪ Can be triggered by speculative accesses
▪ Supported in AArch64 only

2
▪ DirtyBitModifier bit added to block and page descriptors (bit 51)

20
▪ If a page is marked as RO and DirtyBitModifier==1, accesses to the page will cause the AP bits to be updated to
RW
▪ Allows software to easily check whether a page had been written to

8
8 1101 rev 16641 Maven Silicon, 2023:04:20
on
lic
Si

Agenda
en
av

▪ Large System Extensions

▪ Memory system
M

▪ Other Changes

9 1101 rev 16641 Maven Silicon, 2023:04:20

9
Copyright © 2020 Arm Ltd. All rights reserved.
Not to be reproduced by any means without prior written consent.

Common not Private


▪ The Armv8-A architecture does not require that VMIDs and ASIDs have the same meaning on all cores
▪ In theory, software could use the same VMID/ASID for different tasks on different cores
▪ In practice, this is rare

▪ DynamIQ and Neoverse cores introduce the CnP (Common not Private) bit added to table pointer

3
registers
▪ When set, VMIDs and ASIDs must have the same meaning on all PEs in the inner shareable domain

2
▪ Allows for sharing of TLB entries

PE

20
MMU
TLB Caches Memory

PE
MMU

▪ Used by Neoverse-E1 for helping to implement SMT (each PE can have different tasks for the same ASID or VM)

10
10
▪ Ignored on Cortex-A55 and Cortex-A75

1101 rev 16641 Maven Silicon, 2023:04:20


on
lic
Si

Persistent memory (AArch64 only) Example memory


system hierarchy
▪ DynamIQ and Neoverse cores add support for persistent memory
en

Cortex-A76
▪ Only on DSU CHI configs; not supported on DSU ACE configs

I$ D$ PoU
▪ Persistent memory behaves like DRAM
▪ Similar read/write random access times, but state persists over power down
▪ Architecture is technology agnostic Private U$
av

▪ New cache maintenance operation: DC CVAP, Xn Shared U$ DSU L3

▪ Data cache clean by virtual address to Point of Persistence (PoP)


▪ PoP, if it exists, is a point at or beyond the Point of Coherency (PoC) CMN-600
M

▪ Pushes dirty data far enough out into memory system that it will persist over power down PoC
CHI interconnect with
System Level Cache

▪ DSU CHI configurations send CleanSharedPersist transactions


▪ If no PoP in memory system, DSU’s BROADCASTPERSIST input signal should be DRAM
tied LOW to convert cache maintenance operations by PoP to be by PoC instead
NVRAM PoP

11 1494 rev 0000

11
Copyright © 2020 Arm Ltd. All rights reserved.
Not to be reproduced by any means without prior written consent.

Agenda

3
▪ Large System Extensions

2
▪ Memory system

20
▪ Other Changes

12
12 1101 rev 16641 Maven Silicon, 2023:04:20
on
lic
Si

Privileged Access Never


▪ DynamIQ and Neoverse cores add the PAN (Privileged Access Never) control to PSTATE/CPSR
en

▪ 0 – Same behaviour as Armv8-A


▪ 1 – Privileged data accesses to EL0 accessible virtual addresses cause a fault
▪ Privileged accesses means EL1, or EL2 when HCR_EL2.E2H==1
▪ Apart from LDTR/STTR in AArch64, or LDRT/STRT in AArch32
▪ Does not affect cache maintenance or address translation operations
av

▪ PAN can be set manually by software


▪ A64: MSR/MRS
▪ A32/T32: SETPAN #imm1
M

▪ SCTLR_EL1.SPAN controls PAN state on exceptions taken to EL1


▪ SCTLR_EL2.SPAN for exceptions taken to EL2, with HCR_EL2.E2H==1 and HCR_EL2.TGE==1

13 1101 rev 16641 Maven Silicon, 2023:04:20

13
Copyright © 2020 Arm Ltd. All rights reserved.
Not to be reproduced by any means without prior written consent.

Other changes
▪ DynamIQ and Neoverse cores improve support for Type 2 hypervisors (discussed later)
▪ Adds support for 16-bit VMID

▪ New SIMD instructions (AArch64 and AArch32)

3
▪ Vector Saturating Rounding Doubling Multiply Accumulate Returning High Half
▪ Vector Saturating Rounding Doubling Multiply Subtract Returning High Half

2
▪ Hierarchical attributes can be disabled (AArch64 only)
▪ Controlled by the HPD bits in TCR_ELn

20
▪ When HPD==1, APTable, PXNTable and UXNTable bit in table entries are ignored by hardware
▪ Bits could potentially re-used by software, similar to the Reserved For Software Usage fields
▪ NSTable bit unaffected

▪ PMU extensions (AArch64 and AArch32)


▪ Event number extended, to allow more IMPLEMENTATION DEFINED events
▪ MDCR_EL2.HPMD allows disabling of event counting in EL2

14
14 1101 rev 16641 Maven Silicon, 2023:04:20
on
lic
Si

Other changes
▪ The top 4 bits of block/page descriptors can be made IMPLEMENTATION DEFINED (instead of
en

IGNORED)
▪ Controlled by TCR_ELn/VTCR_ELx.HWn bits

▪ Behaviour of LDTR/STTR changes


▪ When PSTATE.UAO==1, LDTR and STTR behave like equivalent LDR and STR instructions.
av

▪ Additional Address Translation (AT) operations


▪ Added to support PSTATE.PAN bit
M

▪ Stage 2 translations can now specify separate execute attributes for EL0 and EL1
▪ XN field extended to two bits

15 1101 rev 16641 Maven Silicon, 2023:04:20

15
Copyright © 2020 Arm Ltd. All rights reserved.
Not to be reproduced by any means without prior written consent.

Other changes
▪ ACTLR2 becomes mandatory in Armv8.2-A
▪ Was previously optional

▪ Armv7 ordering rules for LDM/STM to Device regions are deprecated


▪ From Armv8.2-A, these accesses can be re-ordered and interrupted

3
▪ New bits added to SCTLR_EL1 and SCTLR_EL2 switch between new and legacy behaviour:
▪ LSMAOE – Load/Store Multiple Atomicity and Ordering Enable

2
▪ nTLSMD – no Trap Load/Store Multiple to Device-nG* regions

20
16
16 1101 rev 16641 Maven Silicon, 2023:04:20
on
lic
Si

Armv8.1 features
en

Armv8.1 feature Supported?


New atomic instructions Yes
SIMD instructions for rounding double multiply operations Yes
Hierarchical permission disables Yes
av

Hardware updates to Access flag and dirty state bit Yes


Privileged Access Never Yes
LORegions Limited
16-bit VMID Yes
M

Virtualization Host Extensions Yes

17 1101 rev 16641 Maven Silicon, 2023:04:20

17
Copyright © 2020 Arm Ltd. All rights reserved.
Not to be reproduced by any means without prior written consent.

Armv8.2 features
Armv8.2 feature Supported?
Common Not Private translations (CnP) Limited
EL0 vs EL1 execute never control at Stage 2 Yes

3
Page based hardware attributes Yes
PSTATE control to Modify LDTR/STTR Yes

2
FP16 Yes
Larger VA, IPA and PA support No

20
Persistent memory Yes
Statistical profiling extension No
VMID aware PIPT instruction cache No
RAS extension Yes

18
18 1101 rev 16641 Maven Silicon, 2023:04:20
on
lic
Si
en

DynamIQ and Neoverse Architectural


Updates
av

Overview
M

19 1101 rev 16641 Maven Silicon, 2023:04:20

19
Copyright © 2020 Arm Ltd. All rights reserved.
Not to be reproduced by any means without prior written consent.

Appendix: PMU Extensions


▪ Hyp performance monitors disable
▪ A control bit in MDCR_EL2 / HDCR is added to allow the hypervisor or Host OS to filter event counting at EL2

▪ Extended event number space

3
▪ Allow for many more IMPLEMENTATION DEFINED event types

▪ Additional common events

2
▪ New architecturally defined event types to address the gap to standard software APIs such as Linux perf-events, and
support for more levels of TLB and cache analysis

20
▪ These changes are made retrospectively to Armv8-A

▪ Architecturally required events


▪ The STALL_FRONTEND and STALL_BACKEND events must be implemented

20
20 1101 rev 16641 Maven Silicon, 2023:04:20
on
lic
Si
en
av
M
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

DynamIQ and

3
Neoverse Barriers

2
20
2
Confidential © 2023 Arm
on
lic
Si

Memory model
en

The Arm architecture defines a weak ordering model…


• … between accesses to Normal memory regions
• … between Normal memory and Device memory accesses

This means that accesses might not occur in program order


av

The architecture also allows for speculative accesses


• Data or instructions fetched from memory before being explicitly referenced
• Examples of speculative access includes:
– Branch prediction
Out of order data loads
M


– Speculative cache line fills

Speculative data accesses are only allowed to Normal memory

Speculative instruction fetches are allowed to any region that’s executable at some exception level

3 1383 rev 35733


Maven Silicon, 2023:04:20

3
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Why do I care about access order?


In most cases precise access order does not matter
• But sometimes it is necessary to force access ordering

Examples of when ordering matters:


• Sharing data between different threads/cores

3
– e.g. mail boxes
• Sharing data with peripherals
e.g. DMA operations

2

• Modifying instruction memory
– e.g. loading a program into RAM or scatter loading

20
• Modifying memory management scheme
– e.g. context switching or demand paging

Where access order is important you may need to use barrier instructions

Compilers/assemblers will not automatically insert barriers for you!

4
4 1383 rev 35733
Maven Silicon, 2023:04:20
on
lic
Si

Barriers
en

The Arm architecture includes barrier instructions to force access order and access completion at a specific point

• DMB Data Memory Barrier


• DSB Data Synchronization Barrier
• ISB Instruction Synchronization Barrier
av

This module provides an introduction to barriers and their use, but if you are writing code where ordering is important we
recommend also reading:

• Arm Architecture Reference Manual Armv8-A


– B2.7 Memory ordering
D4.4.8 Ordering and completion of data and instruction cache instructions
M


– Appendix F Barrier Litmus Tests
▪ Includes worked examples

5 1383 rev 35733


Maven Silicon, 2023:04:20

5
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Agenda
• Data barriers
• Instruction barriers
• DynamIQ and Neoverse extensions

2 3
20
6
6 1383 rev 35733
Maven Silicon, 2023:04:20
on
lic
Si

DMB vs DSB
en

A Data Memory Barrier (DMB) is less restrictive than a Data Synchronization Barrier (DSB)

DMB
• The DMB only affects the ordering of explicit data accesses
– Data cache operations treated as explicit data accesses
• Ensures that all explicit data accesses before the DMB in program order are observed before any explicit access after the DMB
av

DSB
• No instruction or explicit data access after a DSB can be started until:
– All explicit data accesses before the DSB in program order have completed
– All cache, branch predictor and TLB maintenance operations issued by the local processor have completed
M

Use a DSB when necessary, but don’t overuse them

7 1383 rev 35733


Maven Silicon, 2023:04:20

7
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

DMB
Explicit memory accesses before the DMB are observed before any explicit access after the DMB
• Does not guarantee when the operations happen, just the order

LDR X0, [X1]  Must be seen by memory system before STR

3
DMB SY
ADD X2, #1  May be executed before or after memory system sees LDR
STR X3, [X4]  Must be seen by memory system after LDR

2
The effects of any data/unified cache maintenance operations issued by this core before the DMB are observed by

20
explicit data accesses after the DMB
• No effect on operations broadcast by other cores

DC CVAC, X5
LDR X0, [X1]  Effect of data cache clean might not be seen by this instruction
DMB SY

8
8 1383 rev 35733
LDR X2, [X3]

Maven Silicon, 2023:04:20


on
 Effect of data cache clean will be seen by this instruction
lic
Si

DSB
A DSB is more restrictive than a DMB
en

• Use a DSB when necessary, but do not overuse them

No instruction after a DSB will execute until:


• All explicit memory accesses before the DSB in program order have completed
av

• Any outstanding cache/TLB/branch predictor operations complete

DC ISW  Operation must have completed before DSB can complete


STR X0, [X1]  Access must have completed before DSB can complete
DSB SY
M

ADD X2, X2, #3  Cannot be executed until DSB completes

In a multi-core system, if cache/TLB/branch maintenance prediction operation is broadcast – the operation must have
completed on all cores that received it
• Operations received by the core via broadcast do not affect DSBs

9 1383 rev 35733


Maven Silicon, 2023:04:20

9
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Different observers
The core’s instruction interface, data interface and MMU table walker are separate observers
• An observer is something that can make memory accesses (e.g. MMU generates reads to walk translation tables)

No ordering enforced between different observers

3
DC CVAU, X0 ; Operations are executed in any order
IC IVAU, X0 ; despite address dependency. Could lead

2
; to I cache re-fetching old values!

20
A DSB instruction is often needed between such operations:

DC CVAU, X0
DSB ISH
IC IVAU, X0 ; I cache now guaranteed to see new values

10
10 1383 rev 35733
Maven Silicon, 2023:04:20
on
lic
Si

DMB and DSB qualifiers


en

Definitions on previous slide are simplifications

DMB and DSB take a qualifier option which defines the range of applicability for the barrier
• Optional in Armv7-A, mandatory in Armv8-A

Two degrees of freedom for this “limitation argument”


av

• Shareability domain:
– Full System
– Outer Shareable
– Inner Shareable
M

– Non-shareable

• Accesses for which the barrier operates (before – after):


– Load – Load/Store (new to Armv8-A)
– Store – Store
– Any – Any

11 1383 rev 35733


Maven Silicon, 2023:04:20

11
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Barrier Qualifiers
Qualifier Ordered Accesses (Before-After) Shareability Domain
NSHLD Load-Load, Load-Store
NSHST Store-Store Non-shareable
NSH Any-Any

3
ISHLD Load-Load, Load-Store
ISHST Store-Store Inner Shareable

2
ISH Any-Any
OSHLD Load-Load, Load-Store

20
OSHST Store-Store Outer Shareable
OSH Any-Any
LD Load-Load, Load-Store
ST Store-Store Full System
SY Any-Any

12
12 1383 rev 35733
Maven Silicon, 2023:04:20
on
lic
Si

Mail box example


P0 – DMB needed to ensure mail box is seen BEFORE the flag is updated
en

P1 – DMB needed to ensure mail box read AFTER flag is seen

P0 – Send Message P1 – Receive Message


LDR X1, =ADDR_MAILBOX_DATA LDR X1, =ADDR_MAILBOX_DATA
av

LDR X2, =ADDR_MAILBOX_FLAG LDR X2, =ADDR_MAILBOX_FLAG

; Write a new message into mail box ; Wait for available flag
STR X5, [X1] loop:
LDR X12, [X2]
DMB ISHST CBNZ X12, loop
M

; Set available flag to signal that DMB ISHLD


; mail box is full
MOV X0, #0 ; Read message
STR X0, [X2] LDR X0, [X1]

Assumption: P0 and P1 are in the same Inner Shareable domain

13 1383 rev 35733


Maven Silicon, 2023:04:20

13
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Speculation across barriers (1)


P0 = Cached processor, E1 = Uncached processor (e.g. Cortex-M3)
*X2 – flag from P0 to tell E1 to start writing data buffer
*X4 – flag from E1 to tell P0 that data are available

3
P0
DC CIVAC, X1 ; Clean & Invalidate region
DMB SY

2
STR W0, [X2] ; Send flag to E1
WAIT ([X4] == 1) ; Has E1 completed?

Interconnect
DMB SY P0

20
DC IVAC, X1 ; Invalidate region again
LDR W5, [X1] ; Read new data RAM
E1
E1
WAIT ([R2] == 1) ; Is P0 ready?
STR R5, [R1] ; Save data
DMB

14
STR

14 1383 rev 35733


R0, [R4] ; Send flag to P0

Maven Silicon, 2023:04:20


on
lic
Si

Speculation across barriers (2)


en

After the first data cache clean/invalidate, P0 could speculatively re-fetch the region into the data cache
• If this happened during the writes by E1 the data cache could be populated with the wrong (old) data
• The barrier forces the data to be read after the completion flag is seen
– BUT the P0 would be reading that data from the data cache!

This can be fixed by adding a second data cache invalidate


av

• Performed after the completion flag is seen


• Ensures any speculatively fetched data is discarded

Do I need the first data cache clean/invalidate?


• The cache may contain dirty lines, which we want to discard before E1’s writes
M

15 1383 rev 35733


Maven Silicon, 2023:04:20

15
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Memory mapped peripherals


A peripheral does not need to perform a memory access to read or write its own registers
• Observation rules only apply to memory accesses
• If the peripheral interacts with memory a DSB might be needed instead of DMB to ensure correct ordering

Example: DMA transfer

3
• Set up data in memory (uncached buffer)
• Write to register to have DMA controller copy data to another location

Interconnect

2
Core
STR X5, [X1] ; Store data to source buffer
RAM
DSB ST
DMA

20
STR W0, [X4] ; Write to DMA controller to
; begin transfer

DSB needed to ensure that the data is visible (globally observable) at the point in time that the DMA controller receives the
command

16
16 1383 rev 35733
Maven Silicon, 2023:04:20
on
lic
Si

“One-Way” Barriers (1)


en

AArch64 adds new load/store instructions with implicit barrier semantics

Load-Acquire (LDAR)
• All accesses after the LDAR are observed after the LDAR
• Accesses before the LDAR are not affected
av

LDR
STR
Accesses can cross a barrier in
LDAR one direction but not the other

LDR
M

STR

17 1383 rev 35733


Maven Silicon, 2023:04:20

17
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

“One-Way” Barriers (2)


AArch64 added new load/store instructions with implicit barrier semantics

Store-Release (STLR)
• All accesses before the STLR are observed before the STLR

3
• Accesses after the STLR are not affected

LDR

2
STR
Accesses can cross a barrier in
STLR one direction but not the other

20
LDR
STR

18
18 1383 rev 35733
Maven Silicon, 2023:04:20
on
lic
Si

“One-Way” Barriers (3)


LDAR and STLR may be used as a pair
en

• To protect a critical section of code


• May have lower performance impact than a full DMB
• No ordering is enforced within the critical section

Exclusive versions also available


av

• LDAXR, STLXR
• Remove the need for explicit barrier instructions in synchronization code

LDR
STR
LDAR
M

LDR
Critical code section
STR
STLR
LDR
STR

19 1383 rev 35733


Maven Silicon, 2023:04:20

19
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

“One-Way” Barriers (4)


Acquire/Release operations use a “sequentially consistent” model

“One-way” barrier semantics do not apply to sequences of LDAR/STLR pairs


• Subsequent LDAR loads are observed in-order with previous STLR stores

3
LDR

2
STR
Regular accesses can cross STLR
STLR barrier, but LDAR cannot

20
LDR
STR

LDAR

20
20 1383 rev 35733
Maven Silicon, 2023:04:20
on
lic
Si

Agenda
en

• Data barriers
• Instruction barriers
• DynamIQ and Neoverse extensions
av
M

21 1383 rev 35733


Maven Silicon, 2023:04:20

21
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

ISB
The Arm architecture defines context as the system registers

Context-changing operations include:


• Cache, TLB, and branch predictor maintenance operations
• Changes to system control registers (e.g., SCTLR_EL1, TCR_EL1, TTBRn_EL1, CONTEXTIDR_EL1)

3
The effect of a context-changing operation is only guaranteed to be seen after a context synchronization event

2
• Taking an exception
• Returning from an exception
• Instruction Synchronization Barrier (ISB)

20
An ISB flushes the pipeline, and re-fetches the instructions from the cache (or memory)
• Guarantees that effects of any completed context-changing operation before the ISB are visible to any instruction after the barrier
• Also guarantees that context-changing operations after the ISB instruction only take effect after the ISB has been executed

22
22 1383 rev 35733
Maven Silicon, 2023:04:20
on
lic
Si

ISB example
en

FPU/Advanced SIMD are enabled in AArch64 by writing the Coprocessor Access Control system register (CPACR_EL1)

MRS X1, CPACR_EL1


ORR X1, X1, #(0x3 << 20) ; Write CPACR_EL1.FPEN bits
MSR CPACR_EL1, X1
ISB
av

FADD S0, S1, S2 ; Without the ISB this might have


; triggered an exception

The ISB is a context synchronization event which ensures the enable is complete before any subsequent FPU/NEON instructions are
executed
M

23 1383 rev 35733


Maven Silicon, 2023:04:20

23
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Translation table change example


In some cases a combination of barriers are needed
• For example, consider an update of the translation tables

STR X11, [X1] ; Update a translation table entry


DSB ISH ; Ensure write has completed

3
TLBI VAE1IS, X10 ; Invalidate affected VA
DSB ISH ; Ensure completion of the TLB invalidation
ISB ; Synchronize context on this processor

2
The DSB is needed to ensure that the maintenance operations complete

20
The ISB is needed to ensure the effects of those operations are seen by the following instructions

24
24 1383 rev 35733
Maven Silicon, 2023:04:20
on
lic
Si

Self-modifying code (1)


en

P0 loads a new program into memory, which then gets executed by P0 and P1

P0
STR X11, [X1] ; Save instruction to program memory
DC CVAU, X1 ; Clean D$ so instruction is visible to I$
; (Note that clean to PoU may be NoP’d)
DSB ISH ; Ensure clean completes on all cores
IC IVAU, X1 ; Discard stale data from I$
av

DSB ISH ; Ensure I$ invalidates complete for all cores


STR X0, [X2] ; Set flag == 1 to signal completion
ISB ; Synchronize context on this core
BR X1 ; Branch to new code

P1-Pn
WAIT ([X2] == 1) ; Wait for flag signaling completion
M

; No DSB required here


ISB ; Synchronize context on this core
BR X1 ; Execute newly saved instruction

Assumption: P0 to Pn are in the same Inner Shareable domain

25 1383 rev 35733


Maven Silicon, 2023:04:20

25
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Self-modifying code (2)


On DynamIQ and Neoverse CPUs, cleans to PoU are not necessary
• PoU is the L1 data cache (instruction cache can snoop there)
• Data cache clean to PoU will be NoP’d
• However, to stay architecturally compatible, that clean to PoU should stay in the code

3
There is no barrier between the writes to program memory and the D cache clean
• Both operations specify the same address and both data side operations (same observer)
• So guaranteed to be in program order

2
A DSB is needed between the data cache clean and I cache invalidate

20
• Although both operations specify the same address, one is a data operation and the other is an instruction side operation (different observers)
• The DSB forces the data side operation to complete before the instruction side operations start

In a coherent system, the DSB forces the cache operations to complete not just on P0 but also on the other cores

26
26 1383 rev 35733
Maven Silicon, 2023:04:20
on
lic
Si

Agenda
en

• Data barriers
• Instruction barriers
• DynamIQ and Neoverse extensions
av
M

27 1383 rev 35733


Maven Silicon, 2023:04:20

27
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Limited Ordering Regions


Socket0 Socket1
DynamIQ and Neoverse CPUs support Limited Ordering Regions
• Allows access ordering to be limited to accesses in the same region

PEs PEs
Non-secure physical address space divided into LORegions
• Each LORegion is defined by one or more descriptors

3
– Number of supported regions and descriptors supported is reported by LORID_EL1

Socket1
Each descriptor is defined by:

2
• Start address (LORSA_EL1) and end address (LOREA_EL1)
• Region number (LORN_EL1), all the descriptors with the same region number together form a region

20
LDLAR – Load Acquire, within LORegion
• Explicit data accesses after the barrier, that access an address within the same LORegion, are observed
after the barrier
Socket0 Socket 1
LORegion0 LORegion1
STLLR – Store Release, within LORregion
• Explicit data accesses before the barrier, that access an address within the same LORegion, are observed
before the barrier Accesses by PEs on Socket 0 to LORegion0 don’t need to wait for

28
28 1383 rev 35733
Maven Silicon, 2023:04:20
on LORegion1 to complete (and vice versa).
lic
Si

Limited Ordering Regions - example


Accessed address is
en

Memory Map not in a LORegion,


STR x0, A so unaffected by
LDLAR/STLLR
LORegion 0

LDR x8, B
Normal
A
av

LDLAR x0, A

B Normal LDR x1, A

STLLR x2, A
M
LORegion 1

Normal
C STR x4, C

LDR x3, A
Accessed address is in a
different LORegion, so
unaffected by LDLAR/STLLR

29 1383 rev 35733


Maven Silicon, 2023:04:20

29
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Release Consistency Weakening


“Standard” Acquire/Release operations use a “sequentially consistent” model

DynamIQ and Neoverse CPUs support a new Load-Acquire instruction with a weaker release consistency
• LDAPR / LDAPRB / LDAPRH
• These instructions use a “processor consistent” consistency model

3
The requirement that load-acquires be observed in-order with preceding store-releases is dropped for these new Load-Acquire
instructions

2
LDR

20
STR
STLR
LDAPR
LDR
STR

30
30 1383 rev 35733
Maven Silicon, 2023:04:20
on
lic
Si

Thank You
Danke
en

Gracias
谢谢
ありがとう
av

Asante
Merci
감사합니다
M

धन्यवाद
Kiitos
‫شكرا‬
ً
ধন্যবাদ
‫תודה‬
Confidential © 2021 Arm

31
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Appendix

2 3
20
32
Confidential © 2023 Arm
on
lic
Si

Appendix: Compiler Barriers


en

C/C++ have a concept similar to barriers, called “sequence points”

At certain specified points in the execution sequence called sequence points, all side effects of previous evaluations shall be complete and no side effects of
subsequent evaluations shall have taken place.
Section 5.1.2.3 of the C Specification (ISO/IEC 9899:TC3)
av

Examples of sequence points include function calls and accesses to volatile variables

Sequence points restrict what optimizations a compiler can make


• Typically a compiler cannot re-arrange statements across a sequence point

Some compilers will allow you to explicitly add a sequence point


M

• For example: __schedule_barrier() for Armcc


• This does NOT add a barrier instruction, it only affects compiler optimization

When writing low-level code in C/C++ you may need to consider sequence points as well as the architectural barriers

33 1383 rev 35733


Maven Silicon, 2023:04:20

33
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Mailboxes with interrupts (1)


GIC
Sometimes software might use an interrupt to inform the receiver that a
message is in the mailbox

P0 P1
Where the sender and receiver(s) can both see the GIC, a DMB is enough

3
for correct operation
Interconnect The DMB ensures that the second store cannot be observed without the first store also being
observable

2
For P1 to receive the interrupt, the write to [x4] must be globally visible
RAM

20
P0:

STR x5, [x1] ; Write to buffer


DMB ST
STR w0, [x4] ; GICD_SGIR to send IPI

P1:

LDR x5, [x1] ; Read from buffer`

34
34 1383 rev 35733
Maven Silicon, 2023:04:20
on
lic
Si

Mailboxes with interrupts (2)


en

P0 GPU
In this example a custom component is used to generate an interrupt
Interconnect IRQGen • This component is only visible to P0, never visible to P1
P1

In this case a DSB is necessary


Interconnect • The write to [x1] can never be observed by P1
av

• As the second write is not observable by P1, there is no ordering guarantee


• A DSB guarantees that the first write is globally observable (complete) before the interrupt is
sent
RAM

P0:
M

STR w5, [x1] ; Write to buffer


DSB ST
STR w0, [x4] ; Trigger interrupt

P1: (For example, Cortex-M system controller)

LDR r5, [r1] ; Read from buffer

35 1383 rev 35733


Maven Silicon, 2023:04:20

35
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Mailboxes with interrupts (3)


P0 GPU

If the interrupt generating peripheral is moved in the system hierarchy to a


Interconnect point where it is visible by P0 and P1, then can switch back to a DMB
P1

3
Interconnect

2
RAM IRQGen

20
P0:

STR w5, [x1] ; Write to buffer


DMB ST
STR w0, [x4] ; Trigger interrupt

P1: (For example, Cortex-M system controller)

36
LDR r5, [r1]

36 1383 rev 35733


; Read from buffer

Maven Silicon, 2023:04:20


on
lic
Si

The Arm trademarks featured in this presentation are registered


en

trademarks or trademarks of Arm Limited (or its subsidiaries) in


the US and/or elsewhere. All rights reserved. All other marks
featured may be trademarks of their respective owners.

www.arm.com/company/policies/trademarks
av
M

Confidential © 2021 Arm

37
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

DynamIQ and

3
Neoverse Caches

2
20
Confidential © 2023 Arm Limited
on
2
lic
Si

Agenda
en

• General Cache Information

• Memory Attributes and Cache Policies


av

• Cache Maintenance Operations


M

3 1494 rev 32200


Maven Silicon, 2023:04:20

3
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Caches
• DynamIQ and Neoverse CPUs processors are implemented with multiple levels of cache
• Small, separate L1 Instruction and Data caches per core
• Similarly sized unified L2 cache per core
• A larger unified L3 cache per cluster with an integrated snoop filter

3
MMU uses translation tables and translation registers to control which memory locations are cached

2
L3 Cache/Snoop Filter
L2 Cache

CPU0

AMBA Interconnect

AMBA Interconnect
Control

Bus Interface Unit


Space

20
MMU

I-Cache APB
D-Cache
L2 Cache

CPU1
Control
Space
MMU

External
I-Cache SRAM
DRAM
D-Cache

4 1494 rev 32200


Maven Silicon, 2023:04:20
on
4
lic
Si

How is data stored in my cache?


• Caches handle data in lines
en

• Cache lines are 16 words (64 bytes) in current Arm CPUs


Address:

• Virtual address used to determine the location of data in cache Tag Index Offset

‐ Bottom bits (offset) identify word/byte in line


‐ Middle bits (index) identify which line
av

‐ Top bits (tag) identify remainder of address Index

• Each line in the cache includes:


• Tag bits from the associated physical address Set
• Valid bit(s): indicate whether line exists in the cache Way Tag
M

• Dirty data bit(s): indicate whether the data in the cache line is not coherent with external memory

• To reduce cache contention, caches in Arm are set associative


• Replacement policy (which cache way is selected by the victim counter) can very across CPUs
• Options implemented include pseudo-random, round-robin, least recently used (LRU)

5 1494 rev 32200


Maven Silicon, 2023:04:20

5
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

How are caches accessed?


• When the MMU is enabled, address translation is done in order to perform the cache access

CPU

3
Virtual address

MMU

2
TLB / Page Table Walk

Physical address

20
I-Cache D-Cache

• This is why caches behave as “physically indexed” and “physically tagged”

6 1494 rev 32200


Maven Silicon, 2023:04:20
on
6
lic
Si

Example: 64KB 4 way L1 data cache


en

Address
Tag Set (Index) Word Byte
N 14 13 6 5 2 1 0
av

Cache Line
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Tag v d Data
Line 0
M

Line 1

LDR X0, [0x0000007C]
Victim

Counter 1) Cache lookup is performed
Line 254
2) Cache miss; tag matches fail for given index in both cache ways
Line 255
3) Cache linefill is performed
4) Victim counter specifies which cache way to use (will evict previous data)
5) Cache returns requested word to the CPU

7 1494 rev 32200


Maven Silicon, 2023:04:20

7
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

L1/L2 cache allocation


• Cache behavior (including allocating or evicting data) is not defined in the Arm Architecture
• The main requirement is that behavior doesn’t break the memory model

• CPUs (typically) allocate into L1 caches for a cacheable read (including instruction fetch) or a write
• Allocation into L2 depends on CPU behavior

3
• For eviction (from allocating a cache line to a full cache), the cache selects the way indicated by the victim counter
• Victim counter can select by random, round-robin, or least recently used (LRU) – again, depending on the CPU

2
• There are Several options for cache behavior between L1 and L2:
• Strictly inclusive: Any cache line present in an L1 cache will also be present in the L2

20
• Weakly inclusive: Cache line will be allocated in L1 and L2 on a miss, but can later be evicted from L2
• Fully exclusive: Any cache line present in an L1 cache will not be present in the L2

• For fully exclusive D-caches, data typically only allocates into L2 following an L1 eviction
• There are some exceptions – preloads, write streaming (details in TRM for each CPU)

8 1494 rev 32200


Maven Silicon, 2023:04:20
on
8
lic
Si

L1/L2 weakly inclusive I-cache allocation


Cluster
• CPU instruction fetch
en

CPU[0] CPU[1]
Read A
• L1 miss L1-I L1-D L1-I L1-D
A
• L2 miss
• L3 miss (and snoop filter miss)
L2 Cache L2 Cache
A
av

• SCU issues Read on master interface

• Instruction (cache line) is returned on master interface


SCU-L3
• Does not allocate into L3
L3/Snoop Filter Control
M

Snoop Filter L3 Cache


• Instruction is allocated into CPU 0 L2 A: 0
Buffers

• Instruction is allocated into CPU 0 L1-I


Master

Data A

9 1494 rev 32200


Maven Silicon, 2023:04:20

9
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

L1/L2 inclusive D-cache allocation


Cluster
• Core load instruction
Core 0 Core 1
Read A
• L1 miss L1-I L1-D L1-I L1-D
A
• L2 miss
• L3 miss (and snoop filter miss)

3
L2 Cache L2 Cache
A
• SCU issues Read on master interface

2
• Data is returned on master interface
SCU-L3
• Does not allocate into L3
L3/Snoop Filter Control

20
Snoop Filter L3 Cache
• Data is allocated into Core 0 L2 and L1-D A: 0
Buffers

• Snoop filter is updated

Master
Data A

10 1494 rev 32200


Maven Silicon, 2023:04:20
on
10
lic
Si

L1/L2 exclusive D-cache allocation


Cluster
• Core loads data
en

CPU[0] CPU[1]
Read A
• L1 miss L1-I L1-D L1-I L1-D
A
• L2 miss
• L3 miss (and snoop filter miss)
L2 Cache L2 Cache
av

• SCU issues Read on master interface

• Data is returned on master interface


SCU-L3
• Does not allocate into L3
L3/Snoop Filter Control
• Does not allocate into CPU 0 L2
M

Snoop Filter L3 Cache


A: 0
• Data is allocated into CPU 0 L1-D Buffers
Master

Data A

11 1494 rev 32200


Maven Silicon, 2023:04:20

11
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

L3 cache allocation
Cluster

• “Exclusive until shared” caching policy between L3 and cores CPU[0]


Read A
CPU[1]
Read A
L1-I L1-D L1-I L1-D
Evict A A Data A Data
A A
• Core 0 issues a read (core miss, L3 miss), resulting in a linefill
• Core 0 allocation only (Exclusive) L2 Cache L2 Cache

3
• Core 0 evicts the line from its cache
• Core 0 de-allocation, L3 allocation

2
SCU-L3
L3/Snoop Filter Control
• Core 1 issues a read, hitting in the L3 cache

20
Snoop Filter L3A
Data Cache
Snoop A
• L3 de-allocation, Core 1 allocation (Exclusive) A:
A:0,1
0
1 Data
A A
Buffers

• Core 0 issues a read, resulting in a snoop to Core 1


• Core 0 allocation, Core 1 allocation, L3 allocation (Shared)

Master
• Cache line will be valid in only one cache until accessed by multiple cores Data A

12 1494 rev 32200


Maven Silicon, 2023:04:20
on
12
lic
Si

Agenda
en

• General Cache Information

• Memory Attributes and Cache Policies


av

• Cache Maintenance Operations


M

13 1494 rev 32200


Maven Silicon, 2023:04:20

13
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Cache policies (1)


• MMU translation tables are used to define the cache policy for different regions of the memory map
• Only memory regions marked as Normal can be cached

• Policies include:
• Cacheable / Non-cacheable

3
• Cache policies for data regions include
‐ Read / Write-allocate
‐ Write-Back Cacheable / Write-Through Cacheable

2
‐ Shareability (Discussed elsewhere)

• What cache policies are supported is IMPLEMENTATION DEPENDENT


• For example, write through is not supported on Arm Cortex-A CPUs

20
• For cache coherency across multi-core systems, specific cache policies and memory attributes must be used
• Arm Cortex-A processors require write-back and shareable

14 1494 rev 32200


Maven Silicon, 2023:04:20
on
14
lic
Si

Cache policies (2)


• Cache allocation policies
en

• Write allocation: Whether a cache line should be allocated on a write miss


• Read allocation: Whether a cache line should be allocated on a read miss

• Update policies
• Write-back: Write may update the cache only (and cache line is marked as dirty)
av

• Write-through: Write updates both the cache and the external memory system

• The implementation determines which cache allocation and update policy combinations are available

• The above are hints


M

• It is not architecturally required for a CPU to follow the hinted behavior

15 1494 rev 32200


Maven Silicon, 2023:04:20

15
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

System Control Register (SCTLR)


• SCTLR.M: MMU enable bit
• When SCTLR.M = 1, all memory accesses must pass through the MMU
• When SCTLR.M = 0, all data accesses are treated as Device-nGnRnE (non-bufferable, non-cacheable)

• SCTLR.C: D-cache and unified cache “allocation enable” bit

3
• Not exactly a “cache enable”, but allows cores to issue cacheable memory accesses
‐ Those accesses can look up and allocate into data/unified caches
• When SCTLR.C = 0, all Normal memory accesses are downgraded to Normal non-cacheable (no lookup, no allocate)

2
• SCTLR.I: Instruction cache enable bit
• When SCTLR.I = 1:

20
‐ L1 I-cache allocation is possible
‐ Allocation into downstream caches (e.g. L2/L3) is possible when the effective memory type is write-back and SCTLR.C = 1

16 1494 rev 32200


Maven Silicon, 2023:04:20
on
16
lic
Si

Write-Back and Write-Through


• Write-Through mode (WT) (not supported on ARM CPUs)
en

• Write updates both the Cache and the external memory system
• Write-Through accesses do not produce Dirty data

• Arm CPUs treat Write-Through as normal non-cacheable

• Write-Back mode (WB)


av

• Writes are allowed to only update the cached copy of the data
• Write-Back accesses can lead to Dirty data

• Eviction of a Cache Line containing Dirty data results in a write to the next level of memory
M

Write-Through Cache mode Write-Back Cache mode

LSU Core LSU Core


Writes

Reads Reads Linefills


Writes

D$ D$
L2 Cache

Evictions
Write Buffer Write Miss

Write Buffer
External
Writes
17 1494 rev 32200
Maven Silicon, 2023:04:20

17
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Inner and Outer


• Regions marked as Normal can be cacheable or non-cacheable
• Specified as “inner” and “outer” attributes
• The divide between inner and outer is IMPLEMENTATION DEFINED (not architecturally specified)

• For Neoverse and DynamIQ processors:


• Memory must be marked as inner and outer cacheable to be cached inside the CPUs and DSU
• Outer cache attributes are exported on the bus

2 3
Inner Cacheable Outer Cacheable

Bus Interface
System

20
Cores Memory
Cores
Cores L1$/ L3$ Level
Cores System
L2$ Cache

18 1494 rev 32200


Maven Silicon, 2023:04:20
on
18
lic
Si

Speculation and preloading


• Normal memory allows speculative accesses
en

• Meaning the core can potentially automatically load data it thinks will be used

• Core-specific algorithms will determine when speculative read accesses will occur
• For instance: Core will start speculatively pre-fetching data if code performs consecutive loads from memory


av

DynamIQ and Neoverse CPUs support write streaming


• If CPU detects some number of sequential stores, it will disable cache allocation for that stream of stores
• Avoids polluting the cache with data that is meant to be written externally
• Threshold for controlling allocation, or settings to disable allocation, can be set in the CPU control registers
M

19 1494 rev 32200


Maven Silicon, 2023:04:20

19
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

AArch64 PRFM (Prefetch Memory) Instructions


• Prefetching can be done automatically, or by using dedicated prefetch hint instructions
• Remember: allocation behavior for hints is not guaranteed

• PRFM instructions indicate addresses that are likely to be accessed in the near future
• Helps allocate cache lines before they are needed

3
• Several variants: PLDL1KEEP, PLIL2STRM, PSTL3STRM, …

• Different CPUs support prefetch into different levels of cache

2
<Prefetch type> <Level of cache> <Single-use or multi-use>

20
PLD - Prefetch for load L1 - Level 1 cache KEEP - Retained or temporal prefetch
PST - Prefetch for store L2 - Level 2 cache • Allocated in the cache normally
PLI - Preload instructions L3 - Level 3 cache STRM - Streaming or non-temporal prefetch
• For data that is used only once
• AArch32 prefetch hint instructions: PLD Rm, PLI Rm
• Preload data or instructions from address in Rm to cache

20 1494 rev 32200


Maven Silicon, 2023:04:20
on
20
lic
Si

Cacheable memory allocation hints


• Normal writeback memory can be defined with the following allocation hints:
en

• Allocate on read miss


• Allocate on write miss
• Transient

• Behavior can be hinted at by either marking memory in page tables, or by using specific instructions
av

• For example, memory not marked as transient in page table will be marked in cache if allocated using those instructions

• Specific behavior may vary from CPU to CPU


• Allocate to which level of cache, etc.
• Transient memory behavior on eviction
M

21 1494 rev 32200


Maven Silicon, 2023:04:20

21
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Non-Temporal (Transient) Loads and Stores


• LDNP: Load Non-temporal Pair
• Indicate to the caches that the data is likely to be used for only short periods

• LDNP and PFRM *L1STRM cause allocation into the L1 data cache
‐ But cache lines will be marked as non-temporal/transient

3
• STNP: Store Non-temporal Pair
‐ May or may not allocate into CPUs caches

2
20
22 1494 rev 32200
Maven Silicon, 2023:04:20
on
22
lic
Si

Agenda
en

• General Cache Information

• Memory Attributes and Cache Policies


av

• Cache Maintenance Operations


M

23 1494 rev 32200


Maven Silicon, 2023:04:20

23
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Cache maintenance
• Data or instructions may be fetched from the caches rather than from external memory

• There are some cases where this is not desirable:


• Self-modifying code, memory map changes, external master read/writes

• There are manual maintenance operations to clean and invalidate caches

3
• Type of operation
‐ Invalidate - makes changes in the outer domains visible to the cache user
‐ Clean - makes changes in the cache visible to outer domain user(s)

2
‐ Zero - zero a block of memory (only available for data caches)
• Which entries
‐ All - the entire cache (not architecturally supported for data/unified caches)

20
‐ MVA or VA - a cache line containing a specific virtual address
‐ Set/Way - a specific cache line
• Scope
‐ PoC - Point of Coherency
‐ PoU - Point of Unification
‐ PoP - Point of Persistence
• Shareability
‐ Operations that can be broadcast

24 1494 rev 32200


Maven Silicon, 2023:04:20
on
24
lic
Si

PoU and PoC and cache maintenance


• These terms refer to caches within the Arm processor
en

• External caches (outside the processor) must be handled separately

• PoU: Point at which instruction, data, and TLB accesses see the same copy of memory

• PoC: Point at which all agents see same copy of memory, generally memory system
av

Master A Master A Master B

System Control Registers System Control Registers System Control Registers


M

I$ I D D$ D D$ D D$

TLB

Point of Unification (PoU) Point of Coherency (PoC)

25 1494 rev 32200


Maven Silicon, 2023:04:20

25
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

PoU and PoC compared


• PoC is system dependent and can depend on the interconnect
• It can be an external (outside the DSU) system cache (in CMN-600, for example), or could be external memory

• PoU is effectively the L1 D-cache for Neoverse and DynamIQ CPUs


• Implemented through different mechanisms, but appears the same to programmers

3
CPU A System Control Registers System Control Registers CPU B

2
Point of Point of
I$ I D D$ I$ I D D$
Unification Unification

20
TLB TLB

L2 Cache L2 Cache

DSU L3 Cache

26 1494 rev 32200


System cache/Memory System (Point of Coherency)
Maven Silicon, 2023:04:20
on
26
lic
Si

Persistent memory (AArch64 only) Example memory


system hierarchy
• DynamIQ and Neoverse cores add support for persistent memory
en

• Only on DSU CHI configs; not supported on DSU ACE configs Cortex-A76

• Persistent memory behaves like DRAM to the programmer I$ D$ PoU


• Similar read/write random access times, but state persists over power down
• Architecture is technology agnostic
Private U$
av

• New cache maintenance operation: DC CVAP, Xn


• Data cache clean by virtual address to Point of Persistence (PoP) Shared U$ DSU L3
‐ PoP, if it exists, is a point at or beyond the Point of Coherency (PoC)
‐ Pushes dirty data far enough out into memory system that it will persist over power down
CMN-600
M

• DSU CHI configurations send CleanSharedPersist transactions PoC


CHI interconnect with
• If no PoP in memory system, DSU’s BROADCASTPERSIST input signal should be tied LOW to System Level Cache
convert cache maintenance operations by PoP to be by PoC instead
DRAM

NVRAM PoP

27 1494 rev 32200


Maven Silicon, 2023:04:20

27
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

AArch64 instructions
• AArch64 cache maintenance operations are initiated by dedicated instructions

<cache> <operation> {, <Xt>}

3
<function> <type> [<point>] {IS}
IC - Instruction Cache Passes an address

2
DC - Data Cache argument where required

20
I – Invalidate IS – Inner Shareable
C – Clean
CI – Clean & Invalidate VA – By Address U – Point of Unification
Z – Zero SW – By Set/Way C – Point of Coherency
ALL – Entire cache P – Point of Persistence

• IC instruction accepts <operation> = IALLU, IALLUIS, IVAU



28 1494 rev 32200
Maven Silicon, 2023:04:20
on
DC instruction accepts <operation> = IVAC, ISW, CVAC, CVAU, CSW, CIVAC, CISW, ZVA

28
lic
Si

Maintenance broadcasts
• In multi-core systems, we do not know which core may have a specific address in its caches
en

• For instance, the core issuing a cache clean/invalidate by VA may not be the core that holds the addressed cache line

• Some cache maintenance operations can be broadcast


• This means that the operation is performed by all the cores in a particular shareability domain
• Useful for SMP operating systems
av

Cache operation broadcast to other cores


M

Core Core Core


D$ I$ D$ I$ D$ I$

29 1494 rev 32200


Maven Silicon, 2023:04:20

29
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Maintenance broadcast – AArch64

Instruction Description Broadcast?


IC IALLUIS I-Cache Invalidate All to Point of Unification, Inner Shareable Yes (Inner only)
IC IALLU I-Cache Invalidate All to Point of Unification No

3
IC IVAU, Xt I-Cache Invalidate by Address to Point of Unification Based on VA
DC IVAC, Xt D-Cache Invalidate by Address to Point of Coherency Based on VA

2
DC ISW, Xt D-Cache Invalidate by Set/Way No
DC CVAC, Xt D-Cache Clean by Address to Point of Coherency Based on VA

20
DC CSW, Xt D-Cache Clean by Set/Way No
DC CVAU, Xt D-Cache Clean by Address to Point of Unification Based on VA
DC CIVAC, Xt D-Cache Clean & Invalidate by Address to Point of Coherency Based on VA
DC CISW, Xt D-Cache Clean & Invalidate by Set/Way No

For operations by VA, the Shareability attribute of the address determines whether it is broadcast and to which domain

30 1494 rev 32200


Maven Silicon, 2023:04:20
on
30
lic
Si
en

The trademarks featured in this


presentation are registered and/or
unregistered trademarks of ARM
av

Limited (or its subsidiaries) in the EU


and/or elsewhere. All rights
reserved. All other marks featured
may be trademarks of their
M

respective owners.

31 Confidential © 2023 Arm Limited

31
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Cache Discovery

3
Appendix

2
20
Confidential © 2023 Arm Limited
on
32
lic
Si

Cache discovery code (1)


• When writing code to clean/invalidate/zero data in the caches, you may need to know a few things:
en

• How many levels of cache are there?


• How big is a cache line?
• How many sets and ways are in the cache?
• For Zero operations, how much data will be zeroed?
av

Level 1 Level 2
M


Line Size Line Size
Set Set
Way Way

33 1494 rev 32200


Maven Silicon, 2023:04:20

33
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Cache discovery code (2)


• The number of cache levels is listed in the Cache Level ID Register (CLIDR, CLIDR_EL1)

• The cache line size is listed in the Cache Type Register (CTR , CTR_EL0)
• Can be made accessible by EL0 by setting the UCT bit of the System Control Register (SCTLR , SCTLR_EL1)

3
• Two register accesses are needed to determine the number of sets and ways:
• Write to the Cache Size Selection Register (CSSELR , CSSELR_EL1) to select which cache you want information for
• Read of the Cache Size ID Register (CCSIDR , CCSIDR_EL1)

2
• The Data Cache Zero ID Register (DCZID_EL0) contains the block size that will be zeroed for Zero operations

20
SCTLR.DZE/SCTLR_EL1.DZE controls if execution of the DC ZVA instruction is allowed at EL0
• HCR.TDZ/HCR_EL2.TDZ controls trapping of DC ZVA instruction

34 1494 rev 32200


Maven Silicon, 2023:04:20
on
34
lic
Si

Non-integrated caches
• CLIDR/CLIDR_EL1 only tell you how many levels of cache are integrated into the core
en

• The core is not aware of how many cache levels are outside
• For example:
CPU
• If only L1 and L2 are integrated, CLIDR/CLIDR_EL1 will identify 2 levels of cache
L1 Cache
• May need to take into account non-integrated caches when
av

• Performing cache maintenance


L2 Cache
• Writing cache discovery code
• Maintaining coherency with integrated caches Bus Interface Unit
M

L3 Cache

AMBA Interconnect

System Cache (L4)

35 1494 rev 32200


Maven Silicon, 2023:04:20

35
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

AArch32 Maintenance

3
Instructions

2
Appendix

20
Confidential © 2023 Arm Limited
on
36
lic
Si

AArch32 mnemonics
• AArch32 Cache / branch prediction maintenance operations are CP15 operations
en

• But the Arm Arm normally refers to them using mnemonics:

<cache> <type of operation> <entries> <scope> <shareability>

IS – Inner shareable
av

U – PoU
C – PoC

I – Invalidate MVA – Virtual address


C – Clean ALL – All
CI – Clean and invalidate SW – Set/Way
M

IC – Instruction cache
DC – Data cache / unified cache
BP – Branch predictor
Examples:
• DCCIMVAC – Data cache clean and invalidate to PoC, by MVA
• BPIMVA – Branch predictor invalidate by MVA

37 1494 rev 32200


Maven Silicon, 2023:04:20

37
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Maintenance broadcast – AArch32


Abbreviation Function Broadcast?
BPIALL Branch Predictor Invalidate All No**
BPIALLIS Branch Predictor Invalidate All Inner Shareable Yes (inner only)
BPIMVA Branch Predictor Invalidate by MVA Maybe*

3
DCCMVAU D-Cache Clean by MVA to PoU Maybe*
DCCIMVAC D-Cache Clean & Invalidate by MVA to PoC Maybe*

2
DCCISW D-Cache Clean & Invalidate by Set/Way No
ICIALLUIS I-Cache Invalidate All to PoU Inner Shareable Yes (inner only)

20
ICIMVAU I-Cache Invalidate by MVA to PoU Maybe*
TLBIMVA TLB Invalidate by MVA No**
TLBIMVAIS TLB Invalidate by MVA Inner Shareable Yes (inner only)

* Broadcast determined by shareability of memory region


** Broadcast in Non-secure EL1 if HCR/HCR_EL2 FB bit is set
Table shows a representative set of operations, see documentation for full list

38
• And details of “broadcast by shareability”
1494 rev 32200
Maven Silicon, 2023:04:20
on
38
lic
Si
en

Cache Disabled Behavior


av

Appendix
M

Confidential © 2023 Arm Limited

39
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Cache disabled behavior


• When data/unified cache is disabled (SCTLR_ELn.C=0) all data accesses treated as non-cacheable
• Non-cacheable access guaranteed not trigger allocation into the cache
• It is IMP DEF whether you will hit on the address if it is already held in the cache

• Instruction accesses to non-cacheable memory can be held in instruction caches

3
• This applies even when SCTLR_ELn.I=0
• This means you must ALWAYS invalidate the instruction cache(s) after writing to instruction memory.

2
‐ For example:

STR <Wt>, [Xn] ; Write to instruction memory

20
DSB ISH ; Ensure visibility of the data stored
IC IVAU, [Xn] ; Invalidate instruction cache by VA to PoU
DSB ISH ; Ensure completion of the invalidations
ISB

40 1494 rev 32200


Maven Silicon, 2023:04:20
on
40
lic
Si
en

Instruction Cache Policies


av

Appendix
M

Confidential © 2023 Arm Limited

41
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Instruction cache policies


VIPT I-cache
• CTR_EL0.L1Ip will report the L1 instruction cache policy
Tag Data
• Virtual Index, Physical Tag (VIPT) Line 0
• Physical Index, Physical Tag (PIPT) VA x PA 0x8000 Data

• ASID-tagged Virtual Index, Virtual Tag (AIVIVT)


VA y

3
PA 0x8000 Data
Line 255
• CPU module will list the policy each CPU implements

2
• For VIPT and PIPT caches maintenance is necessary when data is written to a physical address that contains
instructions

20
• If the same physical address is held in a VIPT cache addressed by different virtual addresses
• Cache operations by VA are not guaranteed to affect all copies of that PA
• IC IALLx operations are necessary to affect all aliases

42 1494 rev 32200


Maven Silicon, 2023:04:20
on
42
lic
Si

Thank You!
en

Danke!
Merci!
谢谢!
av

ありがとう!
Gracias!
M

Kiitos!

43 Confidential © 2023 Arm Limited

43
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

DynamIQ and Neoverse

3
Cache Coherency

2
20
Confidential © 2023 Arm Limited
on
2
lic
Si

Agenda
en

• Introduction to coherency

• Coherency details – multi-core processors

• Coherency details – multi-processor systems


av
M

3 1495 rev 33118


Maven Silicon, 2023:04:20

3
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

What is cache coherency?


• Cache coherency relates to the consistency of data stored in local caches

Core 0 Cache

3
Memory DMA

Core 1 Cache

2
• Cache coherency is an issue in any system that
• Contains one or more cache

20
• Has more than one device sharing data in a single cached memory area

• Example scenarios where coherency must be managed


• DMA using cached memory area (external memory inconsistent with CPU Data cache)
• Self-modifying code (Data cache contains more recent copy than Instruction cache)
• Modifications to the translation tables (TLB contents are no longer valid)

4
Coherency can be maintained in software or in hardware

1495 rev 33118


Maven Silicon, 2023:04:20
on
4
lic
Si

DynamIQ and Neoverse cache coherency


• Data caches and L2 caches for all CPUs, and DSU L3 can be kept coherent across system
en

• Hardware coherency support for SMP Operating Systems running on all of the CPUs
• Data from pages marked as Shareable and Write-Back cacheable can be cached and kept coherent between caches
• Maintenance instructions on one core may be broadcast to other cores

DynamIQ DynamIQ
av

CPU CPU CPU CPU CPU CPU CPU CPU


D$ D$ D$ D$ D$ D$ D$ D$

L2 L2 L2 L2 L2 L2 L2 L2

Coherency Logic Coherency Logic


M

DSU L3 DSU L3

Cache Coherent Interconnect

Memory

5 1495 rev 33118


Maven Silicon, 2023:04:20

5
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Neoverse-N1 and –V1 cache coherency


• Neoverse-N1 and –V1 have slightly different coherency options
• Can be connected directly to CHI-based interconnect (with no DSU-L3) in single CPU configuration
• Supports I-cache coherency (on later revisions)

3
Neoverse-N1 Neoverse-N1 Neoverse-N1 Neoverse-N1

CPU CPU CPU CPU

2
D$ I$ D$ I$ D$ I$ D$ I$

L2 L2 L2 L2

20
CHI-based Coherent Interconnect

Memory

6 1495 rev 33118


Maven Silicon, 2023:04:20
on
6
lic
Si

Shareability in the translation tables


• Which masters are in which domain is controlled by the hardware (BROADCASTOUTER signal) and software (e.g. MMU
en

programming)
• Should be set at system design time

DynamIQ Cluster DynamIQ Cluster Mali


av

CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU GPU

Non-shared
Inner Shareable
M

Outer Shareable
• For Cacheable regions, the translation tables specify which domains will access a particular region of memory
• This controls how the processor handles cache coherency
• The translation tables must accurately reflect which domains will access a given location

7 1495 rev 33118


Maven Silicon, 2023:04:20

7
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

ACE and CHI


• System coherency requires a cache coherent interconnect
• ACE (AXI Coherency Extensions)
‐ Found in CCI-xxx products
• CHI (Coherent Hub Interface)
‐ Available in CMN-xxx products

3
• Hardware coherency support for an SMP OS running across multiple DynamIQ or Neoverse clusters
• Data from pages marked as Shareable can be cached

2
• Instruction cache and TLB maintenance operations are broadcast across the interconnect
• In AMBA these are sent as Distributed Virtual Memory (DVM) messages
• DVM messages are only sent between masters in the same Inner Shareable domain

20
• Marking an area as Non-cacheable or Device implicitly marks that region of memory as being accessed by masters in the System
domain

8 1495 rev 33118


Maven Silicon, 2023:04:20
on
8
lic
Si

System coherency with GPUs and DMA


• In a non-coherent system, software must manually ensure that required data is cleaned from the core’s L1/L2 caches, and the
cluster’s L3 caches to make it visible to other masters (e.g. the Mali GPU)
en

CPU CPU CPU CPU

D$ I$ D$ I$ D$ I$ D$ I$ Mali-T604
Shader

Shader

Shader

L2 Cache
Core

Core

Core

L2 Cache L2 Cache L2 Cache


av

DMA
L3 Cache Cache
ACE ACE-Lite ACE-Lite
M

Cache Coherent Interconnect

Memory

ACE-Lite port allows uncached masters to snoop processor caches


• For example, the DMA can see the updates from the core without the need for an explicit cache clean

9 1495 rev 33118


Maven Silicon, 2023:04:20

9
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Agenda
• Introduction to coherency

• Coherency details – multi-core processors

3
• Coherency details – multi-processor systems

2
20
10 1495 rev 33118
Maven Silicon, 2023:04:20
on
10
lic
Si

MPCore coherency management


• Hardware maintains coherency between data caches
en

• A core automatically participates in the coherency scheme when:


‐ The core is powered up, with its D-cache and MMU enabled
‐ Can also include I-cache on Neoverse-N1 and Neoverse-V1
‐ The address is marked as coherent (Write-Back, Shareable)

• Coherency is managed by the cache coherency logic


av

• On DynamIQ and Neoverse clusters, level 1,2, and 3 caches are all part of the inner cache domain

• Most implementations are based on the MESI protocols


• Additional state stored for the tags in the level 1 data caches
‐ Modified – Cache line is dirty and present in only one L1 cache
‐ Exclusive – Cache line is clean and present in only one L1 cache
M

‐ Shared – Cache line is clean and may be present in more than one L1 cache
‐ Invalid – Cache line is invalid

• Note: the Arm architecture does not dictate the mechanisms used to manage coherency
• This is a micro-architectural detail that can vary significantly between implementations

11 1495 rev 33118


Maven Silicon, 2023:04:20

11
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Multi-core cluster coherency logic


Cache coherency logic inside the DSU cluster maintains coherency among L1 data/L2 caches across the CPUs
• Arbitrates accesses to L3 interface(s) for both instructions and data
• Duplicated Tag RAMs and the DSU snoop filter keep track of what data is allocated in each core’s caches

CPU CPU CPU CPU

3
D$ I$ D$ I$ D$ I$ D$ I$

2
L2 L2 L2 L2

20
TAG TAG TAG TAG

Cache Coherency Logic/L3 cache/snoop filter

12 1495 rev 33118


Maven Silicon, 2023:04:20
on
Bus Interfaces

12
lic
Si

Cache coherency logic example (1)


LDR X0, =Label LDR X0, =Label
en

LDR X1, [X0, #0x4]  Will make line Exclusive LDR X1, [X0, #0xC]
ADD X1, X1, #0x1 ADD X1, X1, #0x3
STR X1, [X0, #0x4] STR X1, [X0, #0xC]
; Clean cache line
DC CVAC, X0
av

L3 copy of location Label


0x0 0x7 MESI

Snoop filter
M

VI VI

MESI MESI
Core 0 Core 1

13 1495 rev 33118


Maven Silicon, 2023:04:20

13
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Cache coherency logic example (2)


LDR X0, =Label LDR X0, =Label
LDR X1, [X0, #0x4] Will make line Shared → LDR X1, [X0, #0xC]
ADD X1, X1, #0x1 ADD X1, X1, #0x3
STR X1, [X0, #0x4] STR X1, [X0, #0xC]

3
; Clean cache line
DC CVAC, X0

2
L3 copy of location Label
0x0 0x7 MESI

20
Snoop filter
0x801C0090 VI VI

0x0 0x7 0x801C0090 MESI MESI

14 1495 rev 33118


Core 0

Maven Silicon, 2023:04:20


on Core 1

14
lic
Si

Cache coherency logic example (3)


LDR X0, =Label LDR X0, =Label
en

LDR X1, [X0, #0x4] LDR X1, [X0, #0xC]


ADD X1, X1, #0x1 ADD X1, X1, #0x3
STR X1, [X0, #0x4]  Will make line Modified STR X1, [X0, #0xC]
; Clean cache line
DC CVAC, X0
av

L3 copy of location Label


0x0 0x7 MESI

Snoop filter
M

0x801C0090 VI VI 0x801C0090

0x0 0x7 0x801C0090 MESI MESI 0x801C0090 0x0 0x7


Core 0 Core 1

15 1495 rev 33118


Maven Silicon, 2023:04:20

15
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Cache coherency logic example (4)


LDR X0, =Label LDR X0, =Label
LDR X1, [X0, #0x4] LDR X1, [X0, #0xC]
ADD X1, X1, #0x1 ADD X1, X1, #0x3
STR X1, [X0, #0x4] Will make line Modified → STR X1, [X0, #0xC]

3
; Clean cache line
DC CVAC, X0

2
L3 copy of location Label
0x0 0x7 MESI

20
Snoop filter
0x801C0090 VI VI

0x0 0x8 0x801C0090 MESI MESI

16 1495 rev 33118


Core 0

Maven Silicon, 2023:04:20


on Core 1

16
lic
Si

Cache coherency logic example (5)


LDR X0, =Label LDR X0, =Label
en

LDR X1, [X0, #0x4] LDR X1, [X0, #0xC]


ADD X1, X1, #0x1 ADD X1, X1, #0x3
STR X1, [X0, #0x4] STR X1, [X0, #0xC]
; Clean cache line
Will make line Exclusive → DC CVAC, X0
av

L3 copy of location Label


0x0 0x7 MESI

Snoop filter
M

VI VI 0x801C0090

MESI MESI 0x801C0090 0x3 0x8


Core 0 Core 1

17 1495 rev 33118


Maven Silicon, 2023:04:20

17
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Cache coherency logic example (6)


LDR X0, =Label LDR X0, =Label
LDR X1, [X0, #0x4] LDR X1, [X0, #0xC]
ADD X1, X1, #0x1 ADD X1, X1, #0x3
STR X1, [X0, #0x4] STR X1, [X0, #0xC]

3
; Clean cache line
DC CVAC, X0

2
L3 copy of location Label
0x3 0x8 MESI

20
Duplicated TAG RAMs
VI VI 0x801C0090

MESI MESI 0x801C0090 0x3 0x8

18 1495 rev 33118


Core 0

Maven Silicon, 2023:04:20


on Core 1

18
lic
Si

Agenda
en

• Introduction to coherency

• Coherency details – multi-core processors

• Coherency details – multi-processor systems


av
M

19 1495 rev 33118


Maven Silicon, 2023:04:20

19
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Multi-cluster coherency
• Arm provides a number of interconnect options for maintaining cross cluster coherency, including:
• CCI-XXX Cache Coherent Interconnect
‐ Implements AMBA 4 Coherency Extensions (ACE)
‐ Supports 1 to 6 ACE Masters (depending on product)
‐ The ACE protocol supports MOESI-based cache line states for cross-cluster coherency

3
• CMN-600 Coherent Mesh Network
‐ Implements AMBA 5 Coherent Hub Interconnect (CHI)
‐ Supports larger number of masters – flexible configuration

2
‐ Includes integrated System Level Cache

• Coherency operations are extended across clusters

20
• Shareable data transactions and broadcast cache/TLB maintenance operation
• BROADCASTOUTER signal must be tied high (configurable during system design) to allow coherency operations

• The interconnect can be programmed dynamically to disable snooping and maintenance operation broadcasts
• Allows for clusters to be removed from coherency management
‐ Example: When an entire cluster (including its L3 cache) is powered-down

20 1495 rev 33118


Maven Silicon, 2023:04:20
on
20
lic
Si

Coherency example: Reads


en
av

Master 0 Master 1 Master 2

Cache Coherent Interconnect


M

Bus Slave

21 1495 rev 33118


Maven Silicon, 2023:04:20

21
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Coherency example: Reads


1. Master 0 sends a coherent request

2 3
Master 0 Master 1 Master 2
Read

20
Cache Coherent Interconnect

Bus Slave

22 1495 rev 33118


Maven Silicon, 2023:04:20
on
22
lic
Si

Coherency example: Reads


1. Master 0 sends a coherent request
en

2. Interconnect snoops Master 1 and Master 2


av

Master 0 Master 1 Master 2


Snoop Snoop

Cache Coherent Interconnect


M

Bus Slave

23 1495 rev 33118


Maven Silicon, 2023:04:20

23
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Coherency example: Reads


1. Master 0 sends a coherent request
2. Interconnect snoops Master 1 and Master 2
3. Hit in Master 2’s cache, data is returned

2 3
Master 0 Master 1 Master 2
Data

20
Cache Coherent Interconnect

Bus Slave

24 1495 rev 33118


Maven Silicon, 2023:04:20
on
24
lic
Si

Coherency example: Reads


1. Master 0 sends a coherent request
en

2. Interconnect snoops Master 1 and Master 2


3. Hit in Master 2’s cache, data is returned

** If snoop hit had not occurred, the data would have been fetched from the Bus Slave
av

Master 0 Master 1 Master 2


Data

Cache Coherent Interconnect


M

Bus Slave

25 1495 rev 33118


Maven Silicon, 2023:04:20

25
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Coherency example: Writes

2 3
Master 0 Master 1 Master 2
0xAABBCCDD 0xAABBCCDD 0xAABBCCDD

20
Cache Coherent Interconnect

Bus Slave

26 1495 rev 33118


Maven Silicon, 2023:04:20
on
26
lic
Si

Coherency example: Writes


1. Master 0 sends a coherent invalidate request to the interconnect
en
av

Master 0 Master 1 Master 2


0xAABBCCDD 0xAABBCCDD 0xAABBCCDD

Addr
M

Cache Coherent Interconnect

Bus Slave

27 1495 rev 33118


Maven Silicon, 2023:04:20

27
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Coherency example: Writes


1. Master 0 sends a coherent invalidate request to the interconnect
2. The interconnect sends an invalidate request to Master 1 and Master 2

2 3
Master 0 Master 1 Master 2
0xAABBCCDD 0xAABBCCDD 0xAABBCCDD

20
Snoop Snoop

Cache Coherent Interconnect

Bus Slave

28 1495 rev 33118


Maven Silicon, 2023:04:20
on
28
lic
Si

Coherency example: Writes


1. Master 0 sends a coherent invalidate request to the interconnect
en

2. The interconnect sends an invalidate request to Master 1 and Master 2


3. Master 1 and Master 2 invalidate the cache line (if present)
av

Master 0 Master 1 Master 2


0xAABBCCDD -- --

Resp Resp
M

Cache Coherent Interconnect

Bus Slave

29 1495 rev 33118


Maven Silicon, 2023:04:20

29
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Coherency example: Writes


1. Master 0 sends a coherent invalidate request to the interconnect
2. The interconnect sends an invalidate request to Master 1 and Master 2
3. Master 1 and Master 2 invalidate the cache line (if present)
4. Master 0 is notified by the interconnect

2 3
Master 0 Master 1 Master 2
0xAABBCCDD -- --

20
Resp

Cache Coherent Interconnect

Bus Slave

30 1495 rev 33118


Maven Silicon, 2023:04:20
on
30
lic
Si

Coherency example: Writes


1. Master 0 sends a coherent invalidate request to the interconnect
en

2. The interconnect sends an invalidate request to Master 1 and Master 2


3. Master 1 and Master 2 invalidate the cache line (if present)
4. Master 0 is notified by the interconnect
5. Master 0 can now write to the address
av

Master 0 Master 1 Master 2


0x12345678 -- --
M

Cache Coherent Interconnect

Bus Slave

31 1495 rev 33118


Maven Silicon, 2023:04:20

31
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Thank You!
Danke!
Merci!

3
谢谢!

2
ありがとう!
Gracias!

20
Kiitos!

32 Confidential © 2023 Arm Limited


on
32
lic
Si
en
av
M
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

DynamIQ and Neoverse

3
Virtualization

2
20
2
Confidential © 2020 Arm Limited
on
lic
Si

Agenda
• Introduction
en

• Armv8-A Recap
av

• Virtualization Host Extensions


M

3 1381 rev 32514


Maven Silicon, 2023:04:20

3
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

What is virtualization?

App App App App

Guest OS Guest OS

3
Hypervisor

2
Virtualization is the ability to create virtual machines that act like real machines
• These virtual machines can run their own OS, often referred to as a Guest OS

20
A Hypervisor or Virtual Machine Manager controls allocation of physical resources and execution time

The level of virtualization/abstraction can vary based on the use case

4
4 1381 rev 32514
Maven Silicon, 2023:04:20
on
lic
Si

Type 1 and 2 hypervisors


en

App App App App • Type 1


• Hypervisor runs directly on the hardware
Guest OS Guest OS
• Hypervisor controls all guest OSs
Hypervisor • Secure Boot passes execution directly to the Hypervisor
• Example: Xen
Hardware
av

App App App App


• Type 2 (or hosted-virtualization)
Guest OS Guest OS • Host OS booted by Secure Boot handover
M

• Hypervisor sits on top of host OS


Hypervisor App App
• Hypervisor only controls guests
Host OS • Host OS responsible for power management
• Example: KVM
Hardware

5 1381 rev 32514


Maven Silicon, 2023:04:20

5
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Neoverse and DynamIQ virtualization features


New architectural features relevant to virtualization introduced in DynamIQ and Neoverse CPUs

Virtualization Host Extensions (VHE)


• Extended support for Type 2 hypervisors

3
Extended VMID size (8bits to 16bits)

2
Larger VA/IPA/PA size

20
Extended Stage 2 execute permissions

6
6 1381 rev 32514
Maven Silicon, 2023:04:20
on
lic
Si

Agenda
• Introduction
en

• Armv8-A Recap
av

• Virtualization Host Extensions


M

7 1381 rev 32514


Maven Silicon, 2023:04:20

7
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Armv8-A virtualization
Armv8-A provides architectural support for virtualization
• Support for trapping system register accesses and instructions
• Two stage translation for EL1/0
• Virtual exceptions
• Virtual timer in System Timer architecture

3
Normal world Secure world

2
Trusted Trusted
EL0 App App App App
Service Service

20
EL1 OS Kernel OS Kernel Trusted Kernel

EL2 Hypervisor No EL2 in Secure


world

EL3 Secure Monitor

8
Support is optional, and no support for virtualization in secure state

8 1381 rev 32514


Maven Silicon, 2023:04:20
on
lic
Si

Instruction/register trapping
The Hypervisor can be configured to trap certain instructions
en

• Configured through Hypervisor Control Register (HCR_EL2)


• When trapped, Exception Syndrome Register (ESR_EL2) provides information about trapped instruction

Examples of what can be trapped:


• System instructions, e.g. cache and TLB maintenance instructions
av

• Accesses to Auxiliary Control Register (ACTLR_EL1)


Reads to ID registers
• WFE and WFI instructions
‐ Allows the Hypervisor to switch guests, whenever the current guest attempts to enter a low power state
M

Some registers have dedicated control for virtualization purpose, so no trapping is required
• For example, MIDR_EL1, MPIDR_EL1

9 1381 rev 32514


Maven Silicon, 2023:04:20

9
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Second stage translation


• The virtualization extension introduces a second stage of translation

• First stage: Virtual Address (VA) → Intermediate Physical Address (IPA)


• Guest OSs, still uses TTBRn_EL1/TCR_EL1

2 3
Virtual Memory Map Physical Memory Map
(Under control of Guest OS)

20
Peripherals Peripherals

OS Kernel
Translation Flash
Tables

App RAM

10
10 1381 rev 32514
TTBR0/1_EL1
Maven Silicon, 2023:04:20
on
lic
Si

Second stage translation


• The virtualization extension introduces a second stage of translation
en

• First stage: Virtual Address (VA) → Intermediate Physical Address (IPA)


• Guest OSs, still uses TTBRn_EL1/TCR_EL1

• Second stage: Intermediate Physical Address (IPA) → Physical Address (PA)


• Controlled by the Hypervisor
av

Virtual Memory Map Physical Memory Map


(Under control of Guest OS) (As seen by Guest OS, controlled by Hypervisor) Real Physical Memory map

Peripherals Peripherals Peripherals


M

OS Kernel RAM
Translation Flash Translation Peripherals
Tables Tables
RAM

App RAM Flash

TTBR0/1_EL1 VTTBR_EL2
11 1381 rev 32514
Maven Silicon, 2023:04:20

11
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Stage 2 memory management


Allows a Hypervisor to ‘Pretend’ to an OS that it has a particular physical memory map
• Location in real Physical Memory is transparent to OS
‐ May be changed dynamically without OS knowledge
• Similar in principle to the OS mapping virtual memory for an application

3
“Physical” addresses from the OS are now called “Intermediate Physical addresses”
• Translated to real PA by second stage of translation, under control of a Hypervisor

2
Hypervisor configures and enables stage 2 translations
• All addresses from EL0 and EL1 will still be translated by MMU hardware
• Hypervisor is not called for each translation

20
12
12 1381 rev 32514
Maven Silicon, 2023:04:20
on
lic
Si

Translation regimes
OS will create translation tables as normal, these are “Stage 1” tables
en

• The OS programs TTBR0_EL1 and TTBR1_EL1 with IPA of tables


• This address will be translated by second stage tables for a page table walk
• Pointers to further levels in stage 1 are also translated from IPA – PA

Stage 2 regime is configured from EL2, but used for EL0 & 1 accesses
av

EL0/1 Virtual Memory Map EL0/1 Intermediate


Guest OS Physical Memory Map Translation
OS Kernel Translation Tables
Tables
M

TTBRn_EL1 VTTBR_EL2
App App & OS Kernel
EL0/1/2 Physical Memory Map

EL2 Virtual Memory Map


Hypervisor
Translation
Tables
App, OS Kernel, & Hypervisor
TTBR0_EL2
Hypervisor

13 1381 rev 32514


Maven Silicon, 2023:04:20

13
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Stage 2 translation overhead


Stage 2 tables consume additional memory

A VA – IPA – PA translation can also require many memory accesses


• Up to 24 descriptor fetches!

3
Once fetched the entire VA – PA translation may be stored in the TLB
• Intermediate steps of the translation may be cached as well
• This is a decision for the implementation – not architecturally specified

2
A TLB hit may be same performance as without stage 2 translations

20
Likely to be more pressure on TLB

2nd stage tables will typically map far more memory than 1st stage tables
• Can map in larger blocks to mitigate this

14
14 1381 rev 32514
Maven Silicon, 2023:04:20
on
lic
Si

VMID
Each virtual machine is assigned a VMID (Virtual Machine ID)
en

• 16-bit value stored in VTTBR or VTTBR_EL2

Legacy support for 8-bit VMID (max. VMID size on non-DynamIQ and Neoverse CPUs)
‐ VTCR_EL2.VS selects VMID size:
‐ 0 – 8-bit
av

‐ 1 – 16-bit

• ID_AA64MMFR1_EL1 indicates whether 16-bit VMIDs are supported

Similar to ASIDs, VMIDs are used to tag address translations as belonging to a particular virtual machine
• VMID is significant, even when virtualization is disabled, so it should always be configured
M

For guest accesses, TLBs store complete VA→IPA→PA translation in one entry
• VMID ensures that only the correct virtual machine can hit on TLB entry
• May remove the need to invalidate TLBs on switching between guests

15 1381 rev 32514


Maven Silicon, 2023:04:20

15
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Device interrupt routing


Device drivers
• May need to service interrupts generated by a device
• The interrupt may not be for the currently running guest OS
• Hypervisor intercepts the interrupt and redirects it to the appropriate guest OS

3
Guest OS (A) Guest OS (B)

2
UART Driver

20
Hypervisor

UART generates IRQ


Device while OS (B) is running

Interrupt routing is controlled by HCR_EL2

16
16 1381 rev 32514
Maven Silicon, 2023:04:20
on
lic
Si

Virtualizing exceptions
en

App (EL0)

IRQ ? Guest OS (EL1)


av

Hypervisor (EL2)

Asynchronous exceptions (e.g. IRQs) can be routed to a specific EL


• Typically we want the hypervisor to deal with exceptions from real peripherals
M

Armv8-A includes support for virtual exceptions


• Virtual exceptions can be registered by the hypervisor
• Virtual exceptions only seen by guest (Non-Secure EL0/1)
‐ Behave like physical/real exceptions from perspective of the guest

17 1381 rev 32514


Maven Silicon, 2023:04:20

17
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Interrupt routing to EL2 / Hypervisor


Non-secure state
SCR_EL3.[ EA = 0, FIQ = 0, IRQ = 0 ]
HCR_EL2.[ AMO = 0, FMO = 0, IMO = 1 ]

App (EL0)

3
Guest OS (EL1)

2
FIQ FIQ
IRQ IRQ

20
SError SError

Hypervisor (EL2)

FIQ (Pended) FIQ


IRQ IRQ
SError (Pended) SError

18
18 1381 rev 32514
Maven Silicon, 2023:04:20
on
lic
Si

Virtual exceptions
There are three virtual exceptions: Virtual Serror, Virtual IRQ, Virtual FIQ
en

• Can only be signalled if corresponding physical exceptions are configured to be routed to EL2

There are two mechanisms to signal a virtual interrupt to a guest OS


• Using a GIC (Generic Interrupt Controller)
• Using system register HCR_EL2, for example, HCR_EL2.VI
av

Virtual exceptions are always masked when executing in EL2 and EL3
• When enabled and pending, they will be taken when the core returns to EL0/EL1
• In EL0/EL1, each virtual exception is masked by the corresponding PSTATE mask bits
M

19 1381 rev 32514


Maven Silicon, 2023:04:20

19
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Virtual Interrupts (GICv2 & GICv3)


The interrupt controller can signal “physical” interrupts and “virtual” interrupts
• Virtual interrupts from the interrupt controller can only be taken by virtual machine

Interrupt controller provides multiple CPU interfaces


• Physical CPU interface – Used by Hypervisor and Secure world for physical interrupts

3
• Virtual CPU Interface – Used by virtual machine to handle virtual interrupts
• Virtual interface control – Used by Hypervisor to configure Virtual CPU Interface

2
20
Interrupt Controller

Distributor

Physical CPU interface Virtual CPU interface

IRQ FIQ VIRQ VFIQ

CPU 0

20
20 1381 rev 32514
Maven Silicon, 2023:04:20
on
lic
Si

Virtual interrupt signaling (GIC)


en
av
M

Virtual
CPU Interface

Interrupt CPU
Distributor

Physical
CPU Interface

21 1381 rev 32514


Maven Silicon, 2023:04:20

21
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Virtual interrupt signalling (GIC)


1. External IRQ arrives at the GIC

2 3
20
Virtual
CPU interface
External Interrupt
Source
Distributor CPU

Physical
CPU interface

22
22 1381 rev 32514
Maven Silicon, 2023:04:20
on
lic
Si

Virtual interrupt signalling (GIC)


1. External IRQ arrives at the GIC
en

2. GIC signals a PHYSICAL IRQ to the CPU


av
M

Virtual
CPU interface
External Interrupt
Source
Distributor CPU

Physical IRQ
CPU interface

23 1381 rev 32514


Maven Silicon, 2023:04:20

23
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Virtual interrupt signalling (GIC)


1. External IRQ arrives at the GIC
2. GIC signals a PHYSICAL IRQ to the CPU
3. CPU moves to EL2, Hypervisor reads the interrupt status from the Physical CPU Interface

2 3
20
Virtual
CPU interface
External Interrupt
Source
Distributor CPU

Physical IRQ
CPU interface Hypervisor

24
24 1381 rev 32514
Maven Silicon, 2023:04:20
on
lic
Si

Virtual interrupt signalling (GIC)


1. External IRQ arrives at the GIC
en

2. GIC signals a PHYSICAL IRQ to the CPU


3. CPU moves to EL2, Hypervisor reads the interrupt status from the Physical CPU Interface
4. Hypervisor writes to the GIC List Register to register a VIRTUAL IRQ → pending state
av
M

Virtual
CPU interface
External Interrupt
Source
Distributor CPU

Physical Hypervisor
CPU interface

25 1381 rev 32514


Maven Silicon, 2023:04:20

25
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Virtual interrupt signalling (GIC)


1. External IRQ arrives at the GIC
2. GIC signals a PHYSICAL IRQ to the CPU
3. CPU moves to EL2, Hypervisor reads the interrupt status from the Physical CPU Interface
4. Hypervisor writes to the GIC List Register to register a VIRTUAL IRQ → pending state
5. GIC asserts vIRQ signal to CPU

2 3
20
Virtual vIRQ
CPU interface
External Interrupt
Source
Distributor CPU

Physical Hypervisor
CPU interface

26
26 1381 rev 32514
Maven Silicon, 2023:04:20
on
lic
Si

Virtual interrupt signalling (GIC)


1. External IRQ arrives at the GIC
en

2. GIC signals a PHYSICAL IRQ to the CPU


3. CPU moves to EL2, Hypervisor reads the interrupt status from the Physical CPU Interface
4. Hypervisor writes to the GIC List Register to register a VIRTUAL IRQ → pending state
5. GIC asserts vIRQ signal to CPU
6. EL2 returns to EL0 or EL1, back to virtual machine
av
M

Virtual vIRQ
CPU interface Guest OS
External Interrupt
Source
Distributor CPU

Physical
CPU interface

27 1381 rev 32514


Maven Silicon, 2023:04:20

27
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Virtual interrupt signalling (GIC)


1. External IRQ arrives at the GIC
2. GIC signals a PHYSICAL IRQ to the CPU
3. CPU moves to EL2, Hypervisor reads the interrupt status from the Physical CPU Interface
4. Hypervisor writes to the GIC List Register to register a VIRTUAL IRQ → pending state
5. GIC asserts vIRQ signal to CPU

3
6. EL2 returns to EL0 or EL1, back to virtual machine
7. CPU takes an IRQ exception, and Guest OS running on the virtual machine reads the interrupt status from the Virtual CPU

2
Interface

20
Virtual vIRQ
CPU interface Guest OS
External Interrupt
Source
Distributor CPU

Physical
CPU interface

28
28 1381 rev 32514
Maven Silicon, 2023:04:20
on
lic
Si

Generic Timer
Counter Module generates a fixed frequency incrementing count, which is distributed to all cores
en

Each core implements a number of comparators


• Secure physical EL1
• Non-secure physical EL2
• Non-secure EL1 physical and Virtual comparators
av

• EL2 virtual timer (added in DynamIQ and Neoverse CPUs)


• Accessed via system registers
• Interrupt can be generated when count >= comparator

Multi-core Processor Multi-core Processor


M

Core 0 Core 0 Core 0 Core 0


Comparators Comparators
Counter Module

Core 1 Core 1 Core 1 Core 1


Comparators Comparators

29 1381 rev 32514


Maven Silicon, 2023:04:20

29
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Virtual count and timer


The virtual timer operates in the same way as the physical timers
• BUT compares against the virtual count not the physical count
• Where: Virtual count = physical count – offset

3
Physical Count Virtual Offset
(CNTPCT_EL0) (CNTVOFF_EL2)

2
-

20
Virtual Count
(CNTVCT_EL0)

Guest access to physical registers controlled by EL2


• Access to Physical Count register controlled by CNTHCTL_EL2.EL1PCTEN
• Access to NS EL1 Physical Timer controlled by CNTHCTL_EL2.EL1PCEN

30
30 1381 rev 32514
Maven Silicon, 2023:04:20
on
lic
Si

Agenda
• Introduction
en

• Armv8-A Recap
av

• Virtualization Host Extensions


M

31 1381 rev 32514


Maven Silicon, 2023:04:20

31
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Virtualization Host Extensions AArch64 only


Armv8-A virtualization support is best suited to Type 1 hypervisors
• Hypervisor runs in EL2, with greater privilege than guest OSs running at EL1

Not as well suited to Type 2 hypervisors (where host OS runs “over” the hypervisors)
• EL2 not suited for running host OS; requires lots of system register manipulation

3
Virtualization Host Extensions (VHE) make it easier to run common kernel code at EL2
• Only available when EL2 is AArch64

2
When the EL2 Host is enabled (HCR_EL2.E2H==1):
• EL2 virtual address space gains an upper translation region, described by TTBR1_EL2

20
• Meaning of HCR_EL2.TGE==1 is redefined
‐ HCR_EL2.TGE controls whether EL0 uses the EL1&0 or the EL2 translation regime
• EL2 gains ASID support
• Accesses to _EL1 system registers at EL2 re-directed to equivalent _EL2 registers
‐ _EL1 registers accessible using _EL12 alias
• Additional virtual counter for EL2

32
32 1381 rev 32514
Maven Silicon, 2023:04:20
on
lic
Si

EL2 system register access when E2H==1


To allow the same OS binary to run either in EL1 or EL2, when HCR_EL2.E2H==1, accesses to _EL1 registers are re-
en

directed to _EL2 equivalents

TTBR0_EL2
av

EL2: MSR TTBR0_EL1, x0

TTBR0_EL1
M

_EL1 registers accessible via _EL12 alias

EL2: MSR TTBR0_EL12, x0 TTBR0_EL1


E2H==1

33 1381 rev 32514


Maven Silicon, 2023:04:20

33
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Virtual address spaces when E2H==1


Stage 1 Stage 2

Guest OS
OS (EL1) Translation Peripherals
Tables
TTBR1_EL1 Translation
FLASH Tables
Application VTTBR0_EL2
Translation RAM
Guest Application (EL0) Tables Peripherals

3
TTBR0_EL1
Virtual memory map Physical memory map
seen by guest (IPA) RAM
Under control of Guest OS(s)

2
Peripherals
Host OS (EL2) Host OS
Translation
Tables TTBR1_EL2
RAM

20
Host Application
Host Application (EL0) Translation FLASH
Tables TTBR0_EL2
Virtual memory space seen
by Host OS and Hypervisor

Monitor
Secure Mon (EL3) Translation Real physical memory map
Tables TTBR0_EL3
Virtual memory space seen

34
34
by Hypervisor & Secure Monitor

1381 rev 32514


Maven Silicon, 2023:04:20
on
lic
Si

Exception routing
When HCR_EL2.E2H==1 and HCR_EL2.TGE==1, all exceptions in EL0 are routed to EL2
en

• Unless routed to EL3

FIQ
IRQ Application EL0
SError
av

Guest OS EL1
M

FIQ FIQ
IRQ IRQ Hypervisor EL2
SError SError

35 1381 rev 32514


Maven Silicon, 2023:04:20

35
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Overhead without VHE


Host OS must run in both EL1 and EL2
• Requires kernel code to be aware where it is Host OS Kernel Hypervisor VM
running Function call
EL1 Highvisor
Run VM
Calls to EL2 functions must be via HVC trap

3
• Required for any access to EL2 registers HVC
• Configuring Stage 2 translations

2
Perform Context
Switch between Host Trap
Moving from Guest to Host OS requires a full and VM execution Lowvisor
context. Switch
context switch of EL1 system registers EL2

20
Configure VGIC,
virtual timer.
Setup stage 2
translation registers.
Perform Context Switch between VM
Enable stage 2 and Host execution context.
translation
Disable virtual interrupt.
Enable traps
Disable stage 2 translation.
Disable traps

36
36 1381 rev 32514
Maven Silicon, 2023:04:20
on Pre-DynamIQ and NeoverseType 2 Virtualization example
lic
Si

DynamIQ and Neoverse Type 2 virtualization


When HCR_EL2.E2H==1: Non-secure state
en

• Meaning of HCR_EL2.TGE==1 is redefined HCR_EL2.TGE==0 HCR_EL2.TGE==1


‐ HCR_EL2.TGE controls whether EL0 uses the
EL1&0 or the EL2 translation regime EL0 Guest App(s) Guest App(s) Host App(s)

Host OS runs at EL2 EL1 Guest Guest


av

• Host configuration stored in EL2 system registers OS OS


• No need to context switch when moving to a
guest
EL2
Hypervisor
Guest OS(s) run at EL1 Host OS
M

• Guest configurations stored in EL1 system


registers EL3
• Context switch only required when moving Secure Monitor
between guests
DynamIQ and Neoverse Type 2
Virtualization

37 1381 rev 32514


Maven Silicon, 2023:04:20

37
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Running Host OS
Non-secure state Host OS runs at EL2

EL0 Guest App(s) Guest App(s) Host App(s) Translation Regime has split tables and ASIDs
• HCR_EL2.E2H==1

3
EL1 Guest Guest
OS OS “EL1” AT commands and TLB Invalidates
operate of EL2 regime and use ASIDs

2
• HCR_EL2.TGE==1
EL2
Hypervisor

20
Host OS Configuration stored in EL2 system registers
• Accessed using EL1 opcodes

HCR_EL2.E2H HCR_EL2.TGE
Host OS 1 1

38
38 1381 rev 32514
Maven Silicon, 2023:04:20
on
lic
Si

Running Host OS application


Non-secure state Host OS apps runs at EL0
en

EL0 Guest App(s) Guest App(s) Host App(s) HCR_EL2.TGE==1


• EL0 uses the EL2 translation regime
Guest Guest • Exceptions at EL0 routed to EL2
• Stage 2 translations are disabled
av

EL1 OS OS

HCR_EL2.E2H==1
EL2
Hypervisor • EL2 translation regime has TTBR1
Host OS • Uses ASIDs
M

HCR_EL2.E2H HCR_EL2.TGE
Note: behaviour of TGE is changed to allow EL0
Host OS 1 1
exceptions to be masked by PSTATE.{I,F,A}
Host app 1 1

39 1381 rev 32514


Maven Silicon, 2023:04:20

39
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Running Guest OS
Non-secure state Guest OS runs at EL1
No Host OS context switch required
EL0 Guest App(s) Guest App(s) Host App(s) • Host OS configuration still in EL2 registers
• Guest OS configuration in EL1 registers

3
EL1 Guest Guest
OS OS HCR_EL2.TGE==0
• Allows execution at EL1

2
• Allows exception routing to EL1
EL2
Hypervisor

20
Host OS HCR_EL2.E2H only matters for EL2
• Would probably still be set
HCR_EL2.E2H HCR_EL2.TGE
Host OS 1 1
Host app 1 1
Guest OS x 0

40
40 1381 rev 32514
Maven Silicon, 2023:04:20
on
lic
Si

Running Guest OS app


Non-secure state Guest OS application runs at EL0
en

EL0 Guest App(s) Guest App(s) Host App(s)

EL1 Guest Guest


av

OS OS

EL2
Hypervisor
Host OS
M

HCR_EL2.E2H HCR_EL2.TGE
Host OS 1 1
Host app 1 1
Guest OS x 0
Guest app x 0
41 1381 rev 32514
Maven Silicon, 2023:04:20

41
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Virtual count and timer when E2H==1


When HCR_EL2.E2H==1
• EL2 accesses to the EL1 virtual timer registers access the new EL2 virtual timer
• EL2 accesses to the EL1 physical timer registers access the EL2 physical timer
• Virtual offset (CNTVCT_EL0) is treated as zero when read from EL2

3
When HCR_EL2.E2H==1 and HCR_TGE==1
• Accesses to the EL1 virtual and physical timers access the EL2 timers

2
• Virtual offset (CNTVCT_EL0) is treated as zero when read from EL2 and EL0

20
42
42 1381 rev 32514
Maven Silicon, 2023:04:20
on
lic
Si
en
av
M
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

DynamIQ and Neoverse

3
Synchronization

2
20
Confidential © 2023 Arm
on
2
lic
Si

Agenda
en

• Synchronization background
• Enforced atomicity
• Measured atomicity
• Local and global exclusive monitors
av
M

3 1509 rev 38654


Maven Silicon, 2023:04:20

3
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

The race for atomicity


With single-threaded data access, atomicity is implicit

In a multi-threaded situation, data shared between threads is vulnerable


• Once a thread has read the data, there is a race to write back the modified data before any other thread accesses it
• Losing this race violates atomicity

3
• This is a Bad Thing

2
Thread 1 / Core 1
Thread 2 / Core 2

20
Increment:
LDR X0, =shared Increment:
LDR X1, [X0] LDR X0, =shared
ADD X1, X1, #1 LDR X1, [X0]
STR X1, [X0] ADD X1, X1, #1
STR X1, [X0]

4 1509 rev 38654


Maven Silicon, 2023:04:20
on
4
lic
Si

Critical sections
en

Any set of compound operations needing to be atomic can be considered a critical section of code
• This becomes important if multiple threads have access to the same data

Some form of synchronization is needed to protect these sections


• Common terms: semaphore, flag, lock
• Many different algorithms can be used to implement synchronization
av

Increment:
BL lock
LDR X0, =shared
M

LDR X1, [X0] ; critical section read


ADD X1, X1, #1 ; critical section modify
STR X1, [X0] ; critical section write
BL unlock

5 1509 rev 38654


Maven Silicon, 2023:04:20

5
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

The lock() function


The simplest lock function can be expressed as shown
• But there is still a problem with this function

lock:
read lock_variable ; critical section read

3
if (lock_variable is UNLOCKED)
; critical section write

2
set lock_variable = LOCKED
else
goto lock

20
This is still an improvement over not using lock()
• Critical section read–write race has been isolated to this function
• But some way is needed to make the read & write of the lock variable atomic

6 1509 rev 38654


Maven Silicon, 2023:04:20
on
6
lic
Si

Atomicity in Arm DynamIQ and Neoverse Processors


en

DynamIQ and Neoverse CPUs provide two mechanisms for atomically modifying data

Enforced Atomicity operations


• Introduced in Armv8.1-A
• Hardware guarantees that memory is modified atomically
– Can sometimes avoid critical code sections entirely, and thus remove the need for a lock() function
av

• Simpler to use in software


• Only work for an IMPLEMENTATION DEFINED subset of memory types

Load/Store Exclusive operations


• Atomicity is “measured”, not enforced: memory is only updated when doing so is effectively atomic
M

• Will interwork with legacy processors which do not support the new enforced atomicity operations
– For example DynamIQ cluster(s) + Cortex-M3 power controller

7 1509 rev 38654


Maven Silicon, 2023:04:20

7
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Agenda
• Synchronization background
• Enforced atomicity
• Measured atomicity

3
• Local and global exclusive monitors

2
20
8 1509 rev 38654
Maven Silicon, 2023:04:20
on
8
lic
Si

Atomic Memory Accesses


en

Armv8.1-A introduced three sets of atomic memory access instructions to AArch64


• Compare and Swap: CAS and its variants
• Atomic memory operations: LD<OP> and ST<OP>
– The set of available operations for <OP> will be explained shortly
• Swap, SWP
av

Processors are required to support atomic accesses for memory which is


• Inner Shareable, Inner Write-Back, Outer Write-Back Normal with Read allocation hints and Write allocation hints and not transient
• Outer Shareable, Inner Write-Back, Outer Write-Back Normal with Read allocation hints and Write allocation hints and not transient

Support for atomic accesses on other memory types will require fabric / interconnect support
M

• Currently only supported on CHI-based interconnects (such as CMN-600)

9 1509 rev 38654


Maven Silicon, 2023:04:20

9
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Compare and swap


Compare and swap single register: CAS <Rs>, <Rt>, [<Xn|SP>]

• R= X or W

tmp = *Xn;
if (*Xn == Rs)

3
*Xn = Rt;
Rs = tmp;

Subword variants (CASB, CASH) are also provided

2
• W registers only
• Results are always zero-extended, not sign-extended

20
Compare and swap register pair: CASP <Rs>, <Rt>, [<Xn|SP>]
• Same behaviour as for CAS
• <Rs>:<Rs+1> and <Rt>:<Rt+1> are treated as a single 128-bit value

10 1509 rev 38654


Maven Silicon, 2023:04:20
on
10
lic
Si

Atomic memory operations (1): Mnemonics


en

All instructions both read & write memory


• Difference is that load operations return a result, store operations do not

Atomic loads: LD<OP> <Rs>, <Rt>, [<Xn|SP>]

• R = X or W
av

tmp = *Xn;
*Xn = *Xn <OP> Rs;
Rt = tmp;

Atomic stores: ST<OP> <Rs>, [<Xn|SP>]


M

• R = X or W

*Xn = *Xn <OP> Rs

Subword variants are LD<OP>H, LD<OP>B, ST<OP>H and ST<OP>B


• W registers only
• Results (for LD variants) are always zero extended

11 1509 rev 38654


Maven Silicon, 2023:04:20

11
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Atomic memory operations (2): the operations


Atomic memory operations are

<OP> Description <OP> Description


ADD Atomic add CLR Atomic bit clear

3
EOR Atomic Exclusive OR SET Atomic bit set
SMAX Atomic signed maximum SMIN Atomic signed minimum

2
UMAX Atomic unsigned maximum UMIN Atomic unsigned minimum

Example usages

20
LDEOR X0, X0, [X14]
LDCLR X1, X8, [SP]
STSMAXH W7, [X2]

12 1509 rev 38654


Maven Silicon, 2023:04:20
on
12
lic
Si

Swap
Swap memory contents: SWP <Rs>, <Rt>, [<Xn|SP>]
en

• R = X or W

tmp = *Xn;
*Xn = Rs;
av

Rt = tmp;

Subword variants (SWPB, SWPH) are also provided


• W registers only
• Results are always zero-extended, not sign-extended
M

13 1509 rev 38654


Maven Silicon, 2023:04:20

13
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Ordering Requirements
The atomic memory accesses are treated as performing both a load and a store
• For permission checking
• For data watchpoints

Atomic memory access instructions are provided with ordering options, which map to the architectural acquire and release

3
definitions
• Acquire, A; Release, L; Acquire and Release, AL
– ST<OP> instructions only support the release variant

2
Example usages

20
CASPAL X0, X0, [X14]
SWPA X1, X8, [SP]
STUMAXLH W7, [X2]

14 1509 rev 38654


Maven Silicon, 2023:04:20
on
14
lic
Si

How are atomics implemented?


en

The architecture does not specify how the atomics are implemented
• There are a number of possible approaches

Arm Arm Arm Arm Arm Arm


av

LDADD LDADD
LDADD + 1 6
Interconnect
6→7
M

Coherent Cache Coherent Cache


2 (load returns
LDADD original value)
+1

Cache Coherent Interconnect


2→3

Memory
Memory

15 1509 rev 38654


Maven Silicon, 2023:04:20

15
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Simple lock() implementation using atomics


A lock variable is stored in memory
• Address passed in X0
• Variable takes a value of LOCKED or UNLOCKED

#define LOCKED 1

3
#define UNLOCKED 0
; void lock(lock_t *ptr)

2
lock:
MOV W1, #LOCKED
; The unsigned maximum of (UNLOCKED, LOCKED) should

20
; return UNLOCKED if we’ve just set LOCKED
5: LDUMAXA W1, W1, [X0]
CBNZ W1, 5b
RET

; void unlock(lock_t *ptr)


unlock:
MOV W1, #UNLOCKED

16 1509 rev 38654


STLR
RET
W1, [X0]

Maven Silicon, 2023:04:20


on
16
lic
Si

Agenda
en

• Synchronization background
• Enforced atomicity
• Measured atomicity
• Local and global exclusive monitors
av
M

17 1509 rev 38654


Maven Silicon, 2023:04:20

17
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Load/Store Exclusive operations


Measured atomicity is provided by the exclusive operations:

• (R = X or W):

• Load Exclusive: LDXR <Rt>, [<Xn|SP>]

3
• Store Exclusive: STXR <Ws>, <Rt>, [<Xn|SP>]
Ws indicates whether the store completed successfully (0 = success, 1 = failure, these are the only possible values)

2

• Load / Store Exclusive Pair also supported

20
– LDXP <Rt1>, <Rt2>, [<Xn|SP>]
– STXP <Ws>, <Rt1>, <Rt2>, [<Xn|SP>]

• Acquire/release variants also available, for all access sizes


– LDAXR / STLXR

• Subword variants are also provided

18
– W registers only

1509 rev 38654


Maven Silicon, 2023:04:20
on
18
lic
Si

How does measured atomicity work?


en

An Exclusive Monitor is used to monitor a location between the exclusive load and store
• The LDXR/LDREX instruction causes the Exclusive Monitor to flag the accessed address as “exclusive”
• The STXR/STREX instruction checks whether the Exclusive Monitor is still in the exclusive state
– Store will only happen if exclusivity check passes

The Exclusive Monitor does not prevent another core or thread from reading or writing the monitored location
av

• It only monitors for whether the location has been written to since the LDXR/LDREX

The Exclusive Monitor is not a lock


M

LDXR
STXR FAIL LDXR

Open Exclusive

STXR PASS or CLREX


Or STXR to monitored location by another
core
19 1509 rev 38654
Maven Silicon, 2023:04:20

19
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Simple lock() implementation using Exclusives


; void lock(lock_t *ptr)
lock:
; Is it locked?
LDAXR W1, [X0] ; Load current value of lock with “acquire” option
CMP W1, #LOCKED ; Compare with “LOCKED”

3
B.EQ lock ; If LOCKED, try again

; Attempt to lock

2
MOV W1, #LOCKED
STXR W2, W1, [X0] ; Attempt to lock
CBNZ W2, lock ; If STXR failed, try again

20
RET

; void unlock(lock_t *ptr)


unlock:
MOV W1, #UNLOCKED ; Release the lock
STLR W1, [X0]

20 1509 rev 38654


RET

Maven Silicon, 2023:04:20


on
20
lic
Si

Multi thread lock example


lock( 0x00800028 ) lock( 0x00800028 )
en

lock: lock:
LDXR W1, [X0] LDXR W1, [X0]
CMP W1, #LOCKED CMP W1, #LOCKED
B.EQ lock B.EQ lock
; Attempt to lock ; Attempt to lock
av

MOV W1, #LOCKED MOV W1, #LOCKED


STXR W2, W1, [X0] STXR W2, W1, [X0]
CBNZ W2, lock CBNZ W2, lock
DMB SY DMB SY
RET RET
M

Memory at virtual address 0x00080020


UNLOCKED

Monitors

Open Open
Thread 0 Thread 1

21 1509 rev 38654 Lock owner = ??


Maven Silicon, 2023:04:20

21
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Multi thread lock example


lock( 0x00800028 ) lock( 0x00800028 )
lock: lock:
LDXR W1, [X0] LDXR W1, [X0]
CMP W1, #LOCKED CMP W1, #LOCKED
B.EQ lock B.EQ lock

3
; Attempt to lock ; Attempt to lock
MOV W1, #LOCKED MOV W1, #LOCKED
STXR W2, W1, [X0] STXR W2, W1, [X0]

2
CBNZ W2, lock CBNZ W2, lock
DMB SY DMB SY
RET RET

20
Memory at virtual address 0x00080020
UNLOCKED

Monitors

Open Open

22 1509 rev 38654


Maven Silicon, 2023:04:20
Thread 0
on
Lock owner = ??
Thread 1

22
lic
Si

Multi thread lock example


lock( 0x00800028 ) lock( 0x00800028 )
en

lock: lock:
LDXR W1, [X0] LDXR W1, [X0]
CMP W1, #LOCKED CMP W1, #LOCKED
B.EQ lock B.EQ lock
; Attempt to lock ; Attempt to lock
av

MOV W1, #LOCKED MOV W1, #LOCKED


STXR W2, W1, [X0] STXR W2, W1, [X0]
CBNZ W2, lock CBNZ W2, lock
DMB SY DMB SY
RET RET
M

Memory at virtual address 0x00080020


UNLOCKED

Monitors

W1 = UNLOCKED Exclusive Open


Thread 0 Thread 1

23 1509 rev 38654 Lock owner = ??


Maven Silicon, 2023:04:20

23
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Multi thread lock example


lock( 0x00800028 ) lock( 0x00800028 )
lock: lock:
LDXR W1, [X0] LDXR W1, [X0]
CMP W1, #LOCKED CMP W1, #LOCKED
B.EQ lock B.EQ lock

3
; Attempt to lock ; Attempt to lock
MOV W1, #LOCKED MOV W1, #LOCKED
STXR W2, W1, [X0] STXR W2, W1, [X0]

2
CBNZ W2, lock CBNZ W2, lock
DMB SY DMB SY
RET RET

20
Memory at virtual address 0x00080020
UNLOCKED

Monitors

W1 = UNLOCKED Exclusive Exclusive W1 = UNLOCKED

24 1509 rev 38654


Maven Silicon, 2023:04:20
Thread 0
on
Lock owner = ??
Thread 1

24
lic
Si

Multi thread lock example


lock( 0x00800028 ) lock( 0x00800028 )
en

lock: lock:
LDXR W1, [X0] LDXR W1, [X0]
CMP W1, #LOCKED CMP W1, #LOCKED
B.EQ lock B.EQ lock
; Attempt to lock ; Attempt to lock
av

MOV W1, #LOCKED MOV W1, #LOCKED


STXR W2, W1, [X0] STXR W2, W1, [X0]
CBNZ W2, lock CBNZ W2, lock
DMB SY DMB SY
RET RET
M

Memory at virtual address 0x00080020


UNLOCKED

Monitors

W1 = UNLOCKED Exclusive Exclusive W1 = UNLOCKED


Thread 0 Thread 1

25 1509 rev 38654 Lock owner = ??


Maven Silicon, 2023:04:20

25
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Multi thread lock example


lock( 0x00800028 ) lock( 0x00800028 )
lock: lock:
LDXR W1, [X0] LDXR W1, [X0]
CMP W1, #LOCKED CMP W1, #LOCKED
B.EQ lock B.EQ lock

3
; Attempt to lock ; Attempt to lock
MOV W1, #LOCKED MOV W1, #LOCKED
STXR W2, W1, [X0] STXR W2, W1, [X0]

2
CBNZ W2, lock CBNZ W2, lock
DMB SY DMB SY
RET RET

20
Memory at virtual address 0x00080020
UNLOCKED

Monitors

W1 = LOCKED Exclusive Exclusive W1 = UNLOCKED

26 1509 rev 38654


Maven Silicon, 2023:04:20
Thread 0
on
Lock owner = ??
Thread 1

26
lic
Si

Multi thread lock example


lock( 0x00800028 ) lock( 0x00800028 )
en

lock: lock:
LDXR W1, [X0] LDXR W1, [X0]
CMP W1, #LOCKED CMP W1, #LOCKED
B.EQ lock B.EQ lock
; Attempt to lock ; Attempt to lock
av

MOV W1, #LOCKED MOV W1, #LOCKED


STXR W2, W1, [X0]  PASSED STXR W2, W1, [X0]
CBNZ W2, lock CBNZ W2, lock
DMB SY DMB SY
RET RET
M

Memory at virtual address 0x00080020


LOCKED

Monitors

W1 = LOCKED Open Open W1 = LOCKED


Thread 0 Thread 1

27 1509 rev 38654 Lock owner = Thread 0


Maven Silicon, 2023:04:20

27
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Multi thread lock example


lock( 0x00800028 ) lock( 0x00800028 )
lock: lock:
LDXR W1, [X0] LDXR W1, [X0]
CMP W1, #LOCKED CMP W1, #LOCKED
B.EQ lock B.EQ lock

3
; Attempt to lock ; Attempt to lock
MOV W1, #LOCKED MOV W1, #LOCKED
STXR W2, W1, [X0] STXR W2, W1, [X0]  FAILED

2
CBNZ W2, lock CBNZ W2, lock
DMB SY DMB SY
RET RET

20
Memory at virtual address 0x00080020
LOCKED

Monitors

W1 = LOCKED Open Open W1 = LOCKED

28 1509 rev 38654


Maven Silicon, 2023:04:20
Thread 0
on Thread 1

Lock owner = Thread 0

28
lic
Si

Multi thread lock example


lock( 0x00800028 ) lock( 0x00800028 )
en

lock: lock:
LDXR W1, [X0] LDXR W1, [X0]
CMP W1, #LOCKED CMP W1, #LOCKED
B.EQ lock B.EQ lock
; Attempt to lock ; Attempt to lock
av

MOV W1, #LOCKED MOV W1, #LOCKED


STXR W2, W1, [X0] STXR W2, W1, [X0]
CBNZ W2, lock CBNZ W2, lock
DMB SY DMB SY
RET RET
M

Memory at virtual address 0x00080020


LOCKED

Monitors

W1 = LOCKED Open Open W1 = LOCKED


Thread 0 Thread 1

29 1509 rev 38654 Lock owner = Thread 0


Maven Silicon, 2023:04:20

29
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Multi thread lock example


lock( 0x00800028 ) lock( 0x00800028 )
lock: lock:
LDXR W1, [X0] LDXR W1, [X0]
CMP W1, #LOCKED CMP W1, #LOCKED
B.EQ lock B.EQ lock

3
; Attempt to lock ; Attempt to lock
MOV W1, #LOCKED MOV W1, #LOCKED
STXR W2, W1, [X0] STXR W2, W1, [X0]

2
CBNZ W2, lock CBNZ W2, lock
DMB SY DMB SY
RET RET

20
Memory at virtual address 0x00080020
LOCKED

Monitors

Open Exclusive W1 = LOCKED

30 1509 rev 38654


Maven Silicon, 2023:04:20
Thread 0
on Thread 1

Lock owner = Thread 0

30
lic
Si

Multi thread lock example


lock( 0x00800028 ) lock( 0x00800028 )
en

lock: lock:
LDXR W1, [X0] LDXR W1, [X0]
CMP W1, #LOCKED CMP W1, #LOCKED
B.EQ lock B.EQ lock
; Attempt to lock ; Attempt to lock
av

MOV W1, #LOCKED MOV W1, #LOCKED


STXR W2, W1, [X0] STXR W2, W1, [X0]
CBNZ W2, lock CBNZ W2, lock
DMB SY DMB SY
RET RET
M

Memory at virtual address 0x00080020


LOCKED

Monitors

Open Exclusive W1 = LOCKED


Thread 0 Thread 1

31 1509 rev 38654 Lock owner = Thread 0


Maven Silicon, 2023:04:20

31
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Multi thread lock example


lock( 0x00800028 ) lock( 0x00800028 )
lock: lock:
LDXR W1, [X0] LDXR W1, [X0]
CMP W1, #LOCKED CMP W1, #LOCKED
B.EQ lock B.EQ lock

3
; Attempt to lock ; Attempt to lock
MOV W1, #LOCKED MOV W1, #LOCKED
STXR W2, W1, [X0] STXR W2, W1, [X0]

2
CBNZ W2, lock CBNZ W2, lock
DMB SY DMB SY
RET RET

20
Memory at virtual address 0x00080020
LOCKED

Monitors

Open Exclusive W1 = LOCKED

32 1509 rev 38654


Maven Silicon, 2023:04:20
Thread 0
on Thread 1

Lock owner = Thread 0

32
lic
Si

Agenda
en

• Synchronization background
• Enforced atomicity
• Measured atomicity
• Local and global exclusive monitors
av
M

33 1509 rev 38654


Maven Silicon, 2023:04:20

33
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Where is the Exclusive Monitor?


A typical system will include multiple Exclusive Monitors
• One Local Monitor per core, capable of monitoring a single location
• One or more Global Monitors, capable of monitoring one location per core
• The Shareable and Cacheable attributes control which Exclusive Monitors are used

3
Non-shared
• Threads running on the same core only Multi-core Processor

2
• Uses Local Monitor only

Coherency Logic
Core 0
Cacheable + Inner/Outer Shareable Local Monitor

Global Monitor
20 Interconnect
• Threads running on any core within the domain
Core 1

Memory
• Typically Local Monitors + coherency logic
Local Monitor
Non-cacheable
• Threads running on different non-coherent cores
Processor
• Relies on Global Monitor in memory system
Local Monitor

34 1509 rev 38654


Maven Silicon, 2023:04:20
on
34
lic
Si

Context switching
The CLREX instruction can be used to clear the Local Exclusive Monitor’s state
en

• Available in A64, A32, and T32

Local Exclusive Monitor is also automatically cleared on returning from an exception


• Execution of ERET instruction in A64
• Any valid exception return instruction in A32 or T32
av

It is IMPLEMENTATION DEFINED whether the clearing of the Local Exclusive Monitor also clears the Global Exclusive Monitor
M

35 1509 rev 38654


Maven Silicon, 2023:04:20

35
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Granularity of Exclusive Monitor


The Exclusives Reservation Granule (ERG) is the granularity of the Exclusive Monitor
• Minimum spacing between addresses for the Monitor to distinguish between them
• The ERG of core’s Exclusive Monitor reported in Cache Type Register
– CTR_EL0 in AArch64
– CP15 CTR in AArch32

3
Placing two locks within one ERG can lead to false negatives
An STREX or STXR to either lock clears the exclusivity of both

2

• Architecturally-correct software will still function correctly – but may be less efficient

20
Typically, the ERG is one cache line

36 1509 rev 38654


Maven Silicon, 2023:04:20
on
36
lic
Si

Coherent lock example


lock( 0x00800028 ) lock( 0x00800030 )
en

lock: lock:
LDXR W1, [X0] LDXR W1, [X0]
CMP W1, #LOCKED CMP W1, #LOCKED
B.EQ lock B.EQ lock
; Attempt to lock ; Attempt to lock
av

MOV W1, #LOCKED MOV W1, #LOCKED


STXR W2, W1, [X0] STXR W2, W1, [X0]  FAIL
CBNZ W2, lock CBNZ W2, lock
DMB SY DMB SY
RET RET
M

Memory at virtual address 0x00080020


UNLOCKED LOCKED

Local
Monitors
W1 = LOCKED Open Open W1 = LOCKED
P0 P1

37 1509 rev 38654


Maven Silicon, 2023:04:20

37
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

WFE
If the lock is already taken software can wait for it to become available or yield to the scheduler

If it waits for the lock to become available, the WFE instruction can be used to enter standby mode to save power
• Core will wake on the next unmasked interrupt or on receipt of an Event

3
Events can be generated by:
• Executing SEV (send event) on any core

2
• Executing SEVL (send event local) on this core
• The Global Monitor being cleared
• Event stream from Generic Timer

20
38 1509 rev 38654
Maven Silicon, 2023:04:20
on
38
lic
Si

Programs still have to be smart


en

A lock is really just a flag


• Exclusive access allows atomic access to the flag
• Any thread or program that accesses the flag can know that it was set correctly

The actual resource that the lock is protecting can still be accessed
• The lock, and any exclusive accesses, are only related to the resource by what the code says
av

• If a program ignores the lock, there’s nothing in the architecture stopping it from accessing that resource

Or, if a program does not use the exclusive access instructions...


• There is nothing special about the memory used for locks
• Once the exclusive access is over, it is just another piece of data in memory
M

These slides show a simple spinlock implementation with a busy-loop


• Algorithms can be optimized based on your use case

39 1509 rev 38654


Maven Silicon, 2023:04:20

39
Copyright © 2023 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Thank You
Danke
Gracias
谢谢
ありがとう
Asante

3
Merci
감사합니다

2
धन्यवाद
Kiitos

20
‫شكرا‬
ً
ধন্যবাদ
‫תודה‬

Confidential © 2021 Arm


on
40
lic
Si

The Arm trademarks featured in this presentation are registered


en

trademarks or trademarks of Arm Limited (or its subsidiaries) in


the US and/or elsewhere. All rights reserved. All other marks
featured may be trademarks of their respective owners.

www.arm.com/company/policies/trademarks
av
M

Confidential © 2021 Arm

41
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

3
Programming the GIC

2
GICv3 and GICv4

20
Confidential © 2020 Arm Limited
on
2
lic
Si

Agenda
en

• Development of the GIC architecture (a brief history)

• Overview of the operation of a GIC


av

• Programming the GIC

• Initialization the GIC

• Configuring interrupts
M

• Interrupt handling

3 0873 rev 32248


Maven Silicon, 2023:04:20

3
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

GIC versions
GICv1 GICv2 GICv3 GICv4

Features: Adds: Adds: Adds:


• Up to 8 cores • Support for • Support for more • Direct injection of

3
• Up to 1020 interrupts virtualization than 8 cores virtual interrupts
IDs • Improved handling of • Support for message
• 8-bit priority Group 1 interrupts by signalled interrupts

2
• Software Generated secure software • System Register
Interrupts access to some
registers
• TrustZone support

20
• Vastly expanded the
Interrupt ID space

Implemented by: Implemented by: Implemented by:


• Cortex-A9 MPCore • Cortex-A15 MPCore* • Cortex-R52 CoreLink
• Cortex-A5 MPCore
• Cortex-A7 MPCore* • GIC-500 CoreLink
• Cortex-R7 MPCore

4
• CoreLink™ GIC-390

0873 rev 32248


• CoreLink GIC-400

* Inclusion of GIC is optional


Maven Silicon, 2023:04:20
on • GIC-600

4
lic
Si

Agenda
en

• Development of the GIC architecture (a brief history)

• Overview of the operation of a GIC


av

• Programming the GIC

• Initialization the GIC

• Configuring interrupts
M

• Interrupt handling

5 0873 rev 32248


Maven Silicon, 2023:04:20

5
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Interrupt types
SPI – Shared Peripheral Interrupt
• Peripheral interrupt, available to all the cores using the interrupt controller
• INTIDs 32 to 1019

PPI – Private Peripheral Interrupt

3
• Peripheral interrupt which is private to an individual core
• INTIDs 16 to 31

2
LPI – Locality-specific Peripheral Interrupt (new in GICv3)
• Peripheral interrupt, typically routed by an ITS

20
• INTIDs 8192+

SGI – Software Generated Interrupt


• Triggered by writing to a register within the interrupt controller
• INTIDs 0 to 15

INTIDs 1020 to 1023 are reserved

6
• Discussed later in the module

0873 rev 32248


Maven Silicon, 2023:04:20
on
6
lic
Si

Interrupt states
Active and
Inactive pendinga
en

• interrupt is not active and not pending

Pending Inactive Pending


• interrupt is asserted but not yet being serviced
av

Active
Activea
• interrupt is being serviced but not yet complete

Active & Pending


M

• interrupt is both active and pending a. Not applicable for LPIs.

Interrupt goes:
• Inactive → Pending when the interrupt is asserted
• Pending → Active when a CPU acknowledges the interrupt by reading the Interrupt Acknowledge Register (IAR)
• Active → Inactive when the same CPU deactivates the interrupt by writing the End of Interrupt Register (EOIR)

7 0873 rev 32248


Maven Silicon, 2023:04:20

7
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Interrupt security
GICv3 supports three group/security settings
• Set individually for each interrupt

Group 0
• Group 0 interrupts are always Secure

3
• Signalled as FIQ, regardless of current Security state
• Typically used for interrupts for the firmware running at EL3

2
Secure Group 1
• Signalled as FIQ if core is in Non-secure state

20
• Signalled as IRQ if core is in Secure state
• Typically used for interrupts for the trusted OS

Non-secure Group 1
• Signalled as FIQ if core is in Secure state
• Signalled as IRQ if core is in Non-secure state
• Typically used for interrupts for the rich OS or Hypervisor

8 0873 rev 32248


Maven Silicon, 2023:04:20
on
8
lic
Si

Interrupt security - example


Normal Secure
Secure Group 1
en

Non-secure Group 1

Secure Group 1 Trusted Non-secure Group 1


App EL0
Service
Group 0
Group 0
av

Trusted
Rich OS
OS
EL1
M

IRQ Vector IRQ Vector

Secure Monitor
EL3
FIQ Vector

SCR_EL3.FIQ=1 SCR_EL3.IRQ=0
9 0873 rev 32248
Maven Silicon, 2023:04:20

9
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Agenda
• Development of the GIC architecture (a brief history)

• Overview of the operation of a GIC

3
• Programming the GIC

2
• Initialization the GIC

20
• Configuring interrupts

• Interrupt handling

10 0873 rev 32248


Maven Silicon, 2023:04:20
on
10
lic
Si

Interfaces
en

Distributor
ITS
Redistributor Redistributor Redistributor Redistributor
av

CPU interface CPU interface CPU interface CPU interface


IRQ FIQ IRQ FIQ IRQ FIQ IRQ FIQ

Core Core Core Core

Multi-Core Processor Multi-Core Processor


M

Distributor, Redistributor and ITS interfaces are memory mapped


• GICD_,GICR_ and GITS_ registers

CPU interface accessed as system registers


• ICC_, ICH_ and ICV_ registers

11 0873 rev 32248


Maven Silicon, 2023:04:20

11
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Agenda
• Development of the GIC architecture (a brief history)

• Overview of the operation of a GIC

3
• Programming the GIC

2
• Initialization the GIC

20
• Configuring interrupts

• Interrupt handling

12 0873 rev 32248


Maven Silicon, 2023:04:20
on
12
lic
Si

Configuration of the Distributor


The GICD_CTLR provides the top level controls for the Distributor
en

There are individual enable bits for each Group


• EnableGrp0, EnableGrp1NS and EnableGrp1S
‐ Only Secure accesses can access EnableGrp0 and EnableGrp1S
• Interrupts belong to a disabled group cannot be forwarded to a core
av

There are also bits to select between GICv3 mode and legacy mode
• Legacy mode gives backwards compatibility with GICv2
• Controlled separately for Secure state (ARE_S) and Non-secure state (ARE_NS)
• These bits need to be set to 1 to enable GIC operation
M

13 0873 rev 32248


Maven Silicon, 2023:04:20

13
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Power management
The GICR_WAKER register is used to record
whether a core is awake or asleep
• Awake cores can receive interrupts
• Asleep cores cannot, but the GIC will generate wake Redistributor

3
requests instead Power
Controller ProcessorSleep ChildrenAsleep
Wake Request
Marking a cores as asleep

2
• Software writes ProcessorSleep=1
• Software polls ChildrenAsleep until it reads 1 CPU interface
‐ GIC now considers the core to be asleep

20
IRQ FIQ
Reset and Power
Marking a core as awake Core
• Software writes ProcessorSleep=0
• Software polls ChildrenAsleep until it reads 0
• At reset, all cores considered to asleep

Core must be marked as awake before CPU interface


is configured

14 0873 rev 32248


Maven Silicon, 2023:04:20
on
14
lic
Si

CPU interface
Like the Distributor, the CPU interface be set in GICv3 or legacy mode
en

• Controlled by the ICC_SRE_ELx.SRE bits


• Bits need to be set to enable GICv3 mode

Group enables
• Controlled by ICC_IGRPEN<n>_EL1 registers
av

• Interrupts belonging to a disabled Group cannot be delivered by the CPU interface

Each core has a priority mask


• ICC_PMR_EL1
• Only interrupts with a higher priority than the mask can be signalled as IRQs or FIQs
M

• Set to 0xFF to allow signalling of all interrupts

The binary point registers control pre-emption


• ICC_BPR<n>_EL1
• Sets how much higher priority interrupt has to be in order to pre-empt an interrupt already being handled
• Set to 7 for pre-emption

15 0873 rev 32248


Maven Silicon, 2023:04:20

15
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Agenda
• Development of the GIC architecture (a brief history)

• Overview of the operation of a GIC

3
• Programming the GIC

2
• Initialization the GIC

20
• Configuring interrupts

• Interrupt handling

16 0873 rev 32248


Maven Silicon, 2023:04:20
on
16
lic
Si

SPI, SGI and PPI configuration


Enable GIC
en

• Only enabled interrupts can be forwarded to a core


• Disabled interrupts can still become pending
Distributor
GICD_ISENABLER<n>
GICD_ICFGR<n> SPI
Priority configuration
GICD_IPRIORITYR<n>
• Each interrupt has an 8-bit priority associated with it GICD_IROUTER<n>
av

‐ Non-secure state can only access bottom half of priority range GICD_IGROUPR<n>
• 0x00 is the highest priority, 0xFF is the lowest priority GICD_IGRPMODR<n>
• A GIC can implement less than 8 bits of priority
Redistributor
Configuration GICR_ISENABLER0
• Whether the interrupt is level-sensitive or edge-triggered
M

GICR_ICFGR0 SGI & PPI


• SGIs are always edge-triggered GICR_IPRIORITYR<n>
configuration
GICR_IGROUPR0
Routing (SPIs only) GICR_IGRPMODR0
• Interrupt can target one named core, or any core

Security/Group

17 0873 rev 32248


Maven Silicon, 2023:04:20

17
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

LPI configuration
LPI configuration and status is held in tables in memory, not
registers Redistributor Redistributor

GICR_PROPBASER specifies base address and size of GICR_PROPBASER GICR_PROPBASER


configuration table

3
• LPI configuration is global GICR_PENDBASER GICR_PENDBASER

• Typically, one table shared by all Redistributors

2
GICR_PENDBASER specifies base address and size of pending
status table Pending Table Pending Table

20
• One table per Redistributor

Software must:
• Allocate memory for tables and set GICR_PROPBASER and Configuration
Table
GICR_PENDBASER registers to point at allocated memory
• Initialize contents of the Configuration table
• Initialize contents of the Pending table (zero memory)

18
• Enable Redistributor(s) by setting GICR_CTLR.EnableLPIs

0873 rev 32248


Maven Silicon, 2023:04:20
on
18
lic
Si

ITS
LPIs are message signalled and usually sent via an Interrupt
Translation Service (or ITS)
en

How is an interrupt translated?


Interrupt
Interrupt
Device
Interrupt
Translation
Interrupt
Translation Collection • Peripheral sends interrupt as a message to the ITS
Translation
Tables
Translation
Table Tables Table
Tables
Tables ‐ The message specifies the DeviceID (which peripheral) and an
EventID (which interrupt from that peripheral)
av

• ITS uses the DeviceID to index into the Device Table


Redistributor ‐ Returns pointer to a peripheral specific Interrupt Translation Table
Peripheral sends
interrupt as ITS Redistributor • ITS uses the EventID to index into the Interrupt Translation
message to ITS Table
: ‐ Returns the INTID and Collection ID
M

:
• ITS uses the Collection ID to index into the Collection Table
Redistributor
‐ Returns the target Redistributor

• ITS forwards interrupt to Redistributor

19 0873 rev 32248


Maven Silicon, 2023:04:20

19
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

ITS configuration
The ITS is controlled by a command queue (circular buffer) in memory
• Software maps/remaps interrupts by adding commands to the queue
• Software does not modify the ITS tables directly

3
Example:
• A timer has DeviceID 5 and sends EventID 0

2
• We decide to map the interrupt to INTID 8725 and deliver to the Redistributor 6
• The ITT allocated for the timer is at address 0x84500000
• We decide to use collection number 3

20
The command sequence we need is:
MAPD 5, 0x84500000, 2 Map DeviceID 5 to an Interrupt Translation Table, specifying 2-bit EventID width
MAPTI 5, 0, 8725, 3 Map EventID 0 to INTID 8725 and collection 3
MAPC 3, 6 Map collection 3 to Redistributor 6
SYNC 0x78400000 Synchronize changes on the Redistributor

20 0873 rev 32248


Maven Silicon, 2023:04:20
on
20
lic
Si

Agenda
en

• Development of the GIC architecture (a brief history)

• Overview of the operation of a GIC


av

• Programming the GIC

• Initialization the GIC

• Configuring interrupts
M

• Interrupt handling

21 0873 rev 32248


Maven Silicon, 2023:04:20

21
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Acknowledging interrupts

On taking an interrupt software must read one of the Interrupt Acknowledge registers
• This returns the INTID of the interrupt and updates the state machine

3
There are different registers for Group 0 and Group 1 interrupts
• ICC_IAR0_EL1 – for Group 0 interrupts

2
• ICC_IAR1_EL1 – for Group 1 interrupts

When an interrupt is acknowledged the Running Priority of the CPU interface takes on the priority of the interrupt

20
• Current value reported by ICC_RPR_EL1

22 0873 rev 32248


Maven Silicon, 2023:04:20
on
22
lic
Si

Reserved INTIDs
A read of the Acknowledge register might return one of the reserved INTIDs:
en

1020
• When in EL3, the highest priority interrupt targets Secure Group 1
‐ Exception taken to EL3, signalled interrupt targets Secure OS (S.EL1)

1021
av

• When in EL3, the highest priority interrupt targets Non-secure Group 1


‐ Exception taken to EL3, signalled interrupt targets Non-secure OS or Hypervisor (NS.EL1 or EL2)

1022 - Used in legacy operation only


M

1023
• There is no pending interrupt (after priority masking) that is targeting this core
• Or when in Non-secure state, the highest priority pending interrupt is Secure
• Or when in Secure EL1, the highest priority pending interrupt is Group 0
‐ Currently executing in the secure kernel (S.EL1), interrupt targets the Monitor (EL3)

23 0873 rev 32248


Maven Silicon, 2023:04:20

23
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

At the end of your interrupt handler

When the interrupt has been handled, software must perform:


• Priority drop
‐ The Running Priority of the CPU Interface returns to the value it had before acknowledging the interrupt

3
• Deactivation
‐ Moves the state machine of the INTID, typically from Active to Inactive

2
Whether these tasks are performed as a single operation or separately depends on the EOIMode bits
• EOIMode == 0

20
‐ Writing to ICC_EOIRn_EL1 performs both priority drop and deactivation
• EOIMode == 1
‐ Writing to ICC_EOIRn_EL1 only performs priority drop
‐ Writing to ICC_DIR_EL1 performs deactivation

24 0873 rev 32248


Maven Silicon, 2023:04:20
on
24
lic
Si
en

The trademarks featured in this


presentation are registered and/or
unregistered trademarks of ARM
Limited (or its subsidiaries) in the EU
av

and/or elsewhere. All rights


reserved. All other marks featured
may be trademarks of their
respective owners.
M

25 Confidential © 2020 Arm Limited

25
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Appendix

3
SBSA usage of PPIs

2
GICv3 and GICv4

20
Confidential © 2020 Arm Limited
on
26
lic
Si

PPI INTIDs
INTID Usage
en

20 PMBIRQ (Statistical Profiling interrupt)


21 PMU
22 Debug Communication Channel
23 PMU
av

24 CTI (Cross Trigger Interface) interrupt


25 GIC Maintenance interrupt
26 EL2 physical timer
27 EL1 virtual timer
M

28 EL2 virtual timer (Armv8.1)


29 Secure physical timer
30 Non-secure physical timer

27 0873 rev 32248


Maven Silicon, 2023:04:20

27
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Appendix

3
SGI changes

2
GICv3 and GICv4

20
Confidential © 2020 Arm Limited
on
28
lic
Si

Differences between GICv2 & GICv3


GICv3 changes the way SGIs are handles
en

• GICv2: interrupts are banked by source AND destination


• GICv3: interrupts are banked by destination only

Example: CPUs 0 and 1 simultaneously send SGI ID 5 to CPU 2


av

CPU0 CPU1 CPU2

SGI ID5

SGI ID5
M

GICv2: CPU 2 will see TWO interrupts, both with INTID 5


• Value returned from the IARs will have the sender’s ID prefixed to the INTID

GICv3: CPU 2 will see ONE interrupt


• Value returned from the IAR will just have the INTID

29 0873 rev 32248


Maven Silicon, 2023:04:20

29
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Appendix

3
Legacy Mode

2
GICv3 and GICv4

20
Confidential © 2020 Arm Limited
on
30
lic
Si

Legacy operation
GICv3 supports legacy operation
en

• Backwards compatible with GICv2


• Support for legacy operation is OPTIONAL
‐ Not supported by Cortex-R52, Cortex-A75 or Cortex-A55, or by GIC-600

Which programmers’ model is used (GICv3 or legacy) is controlled by:


av

• GICD_CTLR.ARE_NS/S bits enable the affinity routing of interrupts


‐ 0 = GICv2 style routing of interrupts (reset value, if legacy operation supported)
‐ 1 = New GICv3 affinity routing of interrupt
M

• ICC_SRE_ELn.SRE bits enable the system register interface


‐ 0 = Memory mapped interface (reset value, if legacy operation supported)
‐ 1 = System register interface

31 0873 rev 32248


Maven Silicon, 2023:04:20

31
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Legacy operation: Supported combinations


If the GIC supports legacy operation, only specific combinations are supported:

Non-secure Secure Non-secure Secure Non-secure Secure

3
ARE_NS=1 ARE_S=1 ARE_NS=1 ARE_S=0 ARE_NS=0 ARE_S=0

ICC_SRE_EL1.SRE=X ICC_SRE_EL1.SRE=1 ICC_SRE_EL1.SRE=X ICC_SRE_EL1.SRE=0 ICC_SRE_EL1.SRE=0 ICC_SRE_EL1.SRE=0 EL1

2
ICC_SRE_EL2.SRE=1 ICC_SRE_EL2.SRE=1 ICC_SRE_EL2.SRE=0 EL2

20
ICC_SRE_EL3.SRE=1 ICC_SRE_EL3.SRE=1 ICC_SRE_EL3.SRE=0 EL3

All GICv3 Legacy S.EL1 (Secure OS) All legacy

32 0873 rev 32248


Maven Silicon, 2023:04:20
on
32
lic
Si
en

Appendix
av

Virtualization
GICv3 and GICv4
M

Confidential © 2020 Arm Limited

33
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Virtual Interrupts (GICv3)


The interrupt controller can signal physical interrupts and virtual interrupts
• Virtual interrupts can only be taken by the core in Non-secure EL1 and EL0

Interrupt controller provides multiple CPU interfaces

3
• Physical CPU interface – Used by Hypervisor and Secure world for handling physical interrupts
• Virtual CPU Interface – Used by virtual machine for handling virtual interrupts
• Virtual interface control – Used by Hypervisor to configure generate virtual interrupts and task switching

2
GIC

20
Distributor

Redistributor

Physical CPU interface Virtual CPU interface

IRQ FIQ VIRQ VFIQ

34 0873 rev 32248


Maven Silicon, 2023:04:20
on
CPU

34
lic
Si

Virtual interrupt signaling (GIC)


en
av

Virtual
M

CPU Interface

Interrupt
Distributor

Physical
CPU Interface

35 0873 rev 32248


Maven Silicon, 2023:04:20

35
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Virtual interrupt signalling (GIC)


1. External IRQ arrives at the GIC and sends a PHYSICAL IRQ to the CPU

2 3
20
Virtual
CPU interface
External Interrupt
Source
Distributor

Physical IRQ
CPU interface

36 0873 rev 32248


Maven Silicon, 2023:04:20
on
36
lic
Si

Virtual interrupt signalling (GIC)


1. External IRQ arrives at the GIC
en

2. GIC signals a PHYSICAL IRQ to the CPU


3. CPU moves to EL2, Hypervisor reads the interrupt status from the Physical CPU Interface
av

Virtual
M

CPU interface
External Interrupt
Source
Distributor

Physical IRQ
CPU interface Hypervisor

SCR_EL3.IRQ = 0
HCR_EL2.IMO = 1
37 0873 rev 32248
Maven Silicon, 2023:04:20

37
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Virtual interrupt signalling (GIC)


1. External IRQ arrives at the GIC
2. GIC signals a PHYSICAL IRQ to the CPU
3. CPU moves to EL2, Hypervisor reads the interrupt status from the Physical CPU Interface
4. Hypervisor writes to the GIC List Register to register a VIRTUAL IRQ → pending state

2 3
20
Virtual
CPU interface
External Interrupt
Source
Distributor

Physical Hypervisor
CPU interface

38 0873 rev 32248


Maven Silicon, 2023:04:20
on
38
lic
Si

Virtual interrupt signalling (GIC)


1. External IRQ arrives at the GIC
en

2. GIC signals a PHYSICAL IRQ to the CPU


3. CPU moves to EL2, Hypervisor reads the interrupt status from the Physical CPU Interface
4. Hypervisor writes to the GIC List Register to register a VIRTUAL IRQ → pending state
5. GIC asserts vIRQ signal to CPU
av

Virtual vIRQ
M

CPU interface
External Interrupt
Source
Distributor

Physical Hypervisor
CPU interface

39 0873 rev 32248


Maven Silicon, 2023:04:20

39
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Virtual interrupt signalling (GIC)


1. External IRQ arrives at the GIC
2. GIC signals a PHYSICAL IRQ to the CPU
3. CPU moves to EL2, Hypervisor reads the interrupt status from the Physical CPU Interface
4. Hypervisor writes to the GIC List Register to register a VIRTUAL IRQ → pending state

3
5. GIC asserts vIRQ signal to CPU
6. EL2 returns to EL0 or EL1, back to virtual machine

2
20
Virtual vIRQ
CPU interface Guest OS
External Interrupt
Source
Distributor

Physical
CPU interface

40 0873 rev 32248


Maven Silicon, 2023:04:20
on
40
lic
Si

Virtual interrupt signalling (GIC)


1. External IRQ arrives at the GIC
en

2. GIC signals a PHYSICAL IRQ to the CPU


3. CPU moves to EL2, Hypervisor reads the interrupt status from the Physical CPU Interface
4. Hypervisor writes to the GIC List Register to register a VIRTUAL IRQ → pending state
5. GIC asserts vIRQ signal to CPU
6. EL2 returns to EL0 or EL1, back to virtual machine
av

7. CPU takes a virtual IRQ exception, and Guest OS running on the virtual machine interacts with the Virtual CPU interface

Virtual vIRQ
M

CPU interface Guest OS


External Interrupt
Source
Distributor

Physical
CPU interface

41 0873 rev 32248


Maven Silicon, 2023:04:20

41
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Virtualization in GICv4
GICv4 adds support for direct injection of virtual interrupts
• In some instances, removes the need to enter the Hypervisor
• Requires an ITS, and only supported for LPIs

3
Hypervisor tells the ITS in advance about mappings between virtual and physical interrupts
• Mapping includes:

2
‐ EventID/Device of physical interrupt
‐ Virtual INTID
‐ Which virtual CPU the virtual interrupt belongs to

20
‐ Which physical CPU the virtual CPU is expected to be running on

If the virtual CPU is running when the interrupt occurs, the hardware generates a virtual interrupt
• If not, a physical door-bell interrupt is sent instead

List registers still available


• Needed to virtualize interrupts that do not come through the ITS (e.g. SPIs)

42 0873 rev 32248


Maven Silicon, 2023:04:20
on
42
lic
Si

GICv4 example (1)


Hypervisor issues ITS commands to map interrupts
en

• VMAPI and VMAPTI used to map EventID/DeviceID to virtual INTID and virtual CPU
‐ Can optionally specify a physical doorbell interrupt
• VMAPP used to map virtual core to a physical core
av

Redistributor Virtual Guest OS


CPU Interface
GICR_VPENDBASER
ITS
GICR_VPROPBASER
M

Physical Hypervisor
CPU Interface

Translation Virtual
Tables PE
table
VMAPP, VMAPTI

43 0873 rev 32248


Maven Silicon, 2023:04:20

43
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

GICv4 example (2)


When interrupt occurs, ITS uses EventID/DeviceID to retrieve translation
• Returns virtual INTID and virtual PE
• Physical INTID of doorbell interrupt (if applicable)
• Redistributor where the virtual PE might be scheduled

2 3
Redistributor Virtual Guest OS
CPU Interface
Message from
peripheral GICR_VPENDBASER
ITS

20
GICR_VPROPBASER

Physical Hypervisor
CPU Interface

Translation Virtual
Tables PE
table

44 0873 rev 32248


Maven Silicon, 2023:04:20
on
44
lic
Si

GICv4 example (3)


Redistributor checks whether the target virtual PE is currently scheduled
en

• Checks the value of GICR_VPENDBASER, which identifies the per-virtual PE pending table

If check passes, virtual interrupt forwarded to CPU interface


• Virtual CPU Interface raises a virtual interrupt (subject to enables/priority checks)
av

Redistributor Virtual Guest OS


CPU Interface Virtual IRQ
GICR_VPENDBASER
ITS
GICR_VPROPBASER
M

Physical Hypervisor
CPU Interface

Translation Virtual
Tables PE
table

45 0873 rev 32248


Maven Silicon, 2023:04:20

45
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

GICv4 example (4)


If Virtual PE not currently scheduled:
• Virtual interrupt recorded as pending in Virtual Pending Table
• Optionally, a physical doorbell interrupt forwarded, to trigger the Hypervisor to re-schedule

2 3
Redistributor Virtual Guest OS
CPU Interface
GICR_VPENDBASER
ITS

20
GICR_VPROPBASER

Physical Hypervisor
CPU Interface Physical IRQ

Translation Virtual
Tables PE
table

46 0873 rev 32248


Maven Silicon, 2023:04:20
on
46
lic
Si
en
av
M
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

3
Booting

2
Neoverse and DynamIQ CPUs

20
Confidential © 2020 Arm Limited
on
2
lic
Si

What is booting?
The processor environment must be initialized before an OS kernel or bare metal app can be run
en

In simple systems this includes:


• Configuring system registers to define the processor context
• Configuring peripherals
av

One must consider that resources can be:


• Per-core (e.g. MMU)
• Per-processor (e.g. L3 cache)
• Global (e.g. Interconnect)
M

Must also consider power management


• Architecture defines a single reset vector used for both power-on-reset (cold boot) and low-power recovery (warm boot)
• Boot code must determine the type of reset and can skip some steps if recovering from a low-power mode
‐ For example: interrupt controller setup

3 1497 rev 32514


Maven Silicon, 2023:04:20

3
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Agenda
• Booting an Arm Neoverse or DynamIQ processor in AArch64

• Booting multi-core and multi-processor systems

3
• Real-world booting

2
20
4 1497 rev 32514
Maven Silicon, 2023:04:20
on
4
lic
Si

Processor state at cold reset


Core comes out of reset in the EL3 (the highest implemented exception level)
en

Start address and execution state are IMPLEMENTATION DEFINED

AA64nAA32[y] signals determine EL3 execution state


av

RVBARADDR[y][] signals determine AArch64 start address Arm DSU,


[y] Cores
VINITHI[y] signals determine AArch32 vector base address
M

Architecture guarantees bare minimum known reset state


• All asynchronous interrupts are masked
• Most system registers, special purpose registers, and general purpose registers have UNKNOWN reset values
• The MMU and caches are guaranteed to be disabled for the highest implemented exception level
• The caches are in an architecturally UNKNOWN state

5 1497 rev 32514


Maven Silicon, 2023:04:20

5
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Processor state at warm reset


The Reset Management Register (RMR_ELx) can be used to request a warm reset
• Only implemented for EL3 (the highest exception level)
• Set the AA64 bit to determine the processor’s execution state after the warm reset
• Set the RR bit to request the warm reset

3
31 2 1 0

2
RES0 RR AA64

20
The start address is IMPLEMENTATION DEFINED
• The DSU cluster’s RVBARADDR signals determine the AArch64 start address for both cold and warm resets

RMR_ELx is architecturally mapped to the AArch32 RMR register


• Allows for use of an existing AArch32 bootloader that warm resets into AArch64 EL3

6 1497 rev 32514


Maven Silicon, 2023:04:20
on
6
lic
Si

Moving to lower exception levels


Execution state of lower exception levels is UNKNOWN at reset
en

• Must be configured before use

System registers for lower exception levels have UNKNOWN reset values
• Must initialize to safe values before leaving higher exception level
av

Can drop to any lower exception level without incrementally moving down
• Target exception level and all levels above must have their execution state configured before moving down
M

MMU disabled MMU enabled? MMU enabled?

EL3 EL2 EL1

Reset SCTLR_EL3.M ERET SCTLR_EL2.M ERET SCTLR_EL1.M


SCR_EL3.RW HCR_EL2.RW

EL2 AArch64? EL1 AArch64?

7 1497 rev 32514


Maven Silicon, 2023:04:20

7
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Moving to lower exception levels


To move down from ELx to ELy for the first time:

1. Initialize lower level’s System Control Register (SCTLR_ELy) to a known safe value
‐ Clear M, C, and I bits, ensure any RES1 bits are set to 1, ensure any RES0 bits are cleared to 0

3
2. Configure lower level execution state
‐ SCR_EL3.NS controls security state of EL1

2
‐ SCR_EL3.RW controls execution state of EL2 and Secure EL1
‐ HCR_EL2.RW controls execution state of Non-secure EL1

20
3. Configure the Saved Program Status Register (SPSR_ELx) for the current exception level
‐ See next slide

4. Set the current exception level’s Exception Link Register (ELR_ELx) to desired entry point

5. Perform an exception return (ERET) to move to the lower exception level

8 1497 rev 32514


Maven Silicon, 2023:04:20
on
8
lic
Si

Saved Program Status Register


SPSR_ELx holds the saved process state when an exception is taken to ELx
en

31 28 27 22 21 20 19 10 9 6 5 4 0

NZCV RES0 SS IL RES0 DAIF RES0 M


av

Example: “AArch64 EL2 with SP_EL2, all asynchronous exceptions masked”

1. Clear NZCV flags


M

2. Clear SS (Software Step) and IL (Illegal Execution State) bits


3. Set DAIF to 0b1111 to mask all asynchronous exceptions
4. Clear M[4] for “AArch64”
5. Set M[3:1] to 0b100 for “EL2”
6. Set M[0] to 0b1 for “SP_ELx (SP_EL2)”

9 1497 rev 32514


Maven Silicon, 2023:04:20

9
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Example: EL3 to EL2

LDR X0, =0x30C50830


MSR SCTLR_EL2, X0 ; Initialize SCTLR_EL2 to safe value

LDR X0, =0x431 ; Non-secure state with EL2 as AArch64

3
MSR SCR_EL3, X0 ; SCR_EL3.{RW,NS} = {1,1}

2
LDR X0, =0x3C9 ; “AArch64 EL2 with SP_EL2, mask all DAIF”
MSR SPSR_EL3, X0

20
ADRP X0, el2_entry ; Set EL2 entry point address
ADD X0, X0, :lo12:el2_entry ; (Can be skipped if el2_entry is 4KB aligned)
MSR ELR_EL3, X0

ERET ; Perform exception return to drop to EL2

10 1497 rev 32514


Maven Silicon, 2023:04:20
on
10
lic
Si

What does boot code need to handle?


A bare metal application will typically need to do the following:
en

• Install vector table


• Unmask SError interrupts
• Set stack pointer
• Perform platform-specific initialization
av

• CPU-specific power-up sequences


• Invalidate TLB and data/unified caches • Initialize DDR controller
• Enable MMU • Initialize PLLs
• Configure interrupt controller • Configure UARTs
• Enable floating point / SIMD registers • Configure other peripherals
M

• Initialize C library, which will branch to main()


• Enable desired interrupt IDs
• Unmask IRQs and FIQs

11 1497 rev 32514


Maven Silicon, 2023:04:20

11
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

CPU-specific power-up sequences


Different implementations can require additional steps to be taken when powering up

These sequences may involve configuring IMPLEMENTATION DEFINED registers


• Take care with legacy code: DynamIQ and Neoverse processors are the first which may contain a heterogeneous set of cores on
the same cluster

3
Details can be found in the relevant implementation’s documentation:

2
• Technical Reference Manual (TRM)
• Errata publications

20
12 1497 rev 32514
Maven Silicon, 2023:04:20
on
12
lic
Si

Enabling floating point / SIMD


Access to floating point / SIMD registers is implicitly enabled by disabling appropriate trap bits
en

MSR CPTR_EL3, XZR ; Disable FP/SIMD access traps to EL3


LDR X0, =0x33FF
MSR CPTR_EL2, X0 ; Disable FP/SIMD access traps to EL2
av

LDR X0, =(0b11 << 20)


MSR CPACR_EL1, X0 ; Disable FP/SIMD access traps to EL1
ISB ; Barrier to synchronize processor context

In legacy AArch32 systems, floating point / SIMD register access was explicitly enabled via FPEXC
M

• FPEXC is architecturally mapped to FPEXC32_EL2 to allow a hypervisor to configure access

LDR X0, =(0b1 << 30)


MSR FPEXC32_EL2, X0
ISB ; Barrier to synchronize processor context

13 1497 rev 32514


Maven Silicon, 2023:04:20

13
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Enabling the MMU and caches (1)


The caches are in an architecturally UNKNOWN state at reset
• Possible for garbage data in the caches to be treated as valid

Architecturally we need to invalidate the caches before enabling them

3
All Neoverse or DynamIQ implementations will invalidate:
• All levels of implemented cache (L1 / L2 / L3)

2
• Translation Lookaside Buffers
• Micro-architectural features (table walk buffers, snoop filters etc.)

20
14 1497 rev 32514
Maven Silicon, 2023:04:20
on
14
lic
Si

Enabling the MMU and caches (2)


en

MSR TTBR0_EL3, X0 ; Setup translation regime


MSR MAIR_EL3, X1
MSR TCR_EL3, X2
ISB
av

LDR X0, =0x30C51835 ; Enable MMU and caches


MSR SCTLR_EL3, X0 ; SCTLR_EL3.{M,C,I} = {1,1,1}
ISB ; Barrier to synchronize processor context
M

What is the problem with this code?

15 1497 rev 32514


Maven Silicon, 2023:04:20

15
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Enabling the MMU and caches (3)


There is a race condition between the ISB instruction and MMU actually being enabled

Normally an ISB is enough to synchronize processor context, but not when enabling the MMU
• The ISB only guarantees that the write to SCTLR_ELn has completed by that point
• Architecturally the write to SCTLR_ELn could complete immediately – If so, where will the next instruction come from?

3
‐ Original flat-mapped VA-PA mapping of PC, or new VA-PA mapping of PC?

2
The code enabling the MMU must exist in a valid address space before and after the MMU is enabled
• Arm recommends simply flat-mapping the page containing the write to SCTLR_ELn and the following ISB

20
VA PA Original Mapping New Mapping VA PA Original Mapping New Mapping
0x8014 0x008014 MSR SCTLR_EL3, X0 ??? 0x8014 0x8014 MSR SCTLR_EL3, X0 MSR SCTLR_EL3, X0
0x8018 0xFEC018 ISB ??? 0x8018 0x8018 ISB ISB
0x801C 0xFEC01C ??? B __main 0x801C 0x801C B __main B __main

If the write to SCTLR_EL3 completes before reaching the ISB, Flat-mapping the region of memory containing the instruction
the processor may execute the instruction at the new VA-PA sequence guarantees that the ISB is the next instruction

16 1497 rev 32514


mapping of PC, which might not be an ISB

Maven Silicon, 2023:04:20


on to be executed

16
lic
Si

Additional considerations
Some parts of EL2 must be configured even if you do not have a hypervisor
en

• VMPIDR_EL2 controls the value seen by Non-secure EL1 reads of MPIDR_EL1


• VPIDR_EL2 controls the value seen by Non-secure EL1 reads of MIDR_EL1
• VTTBR_EL2 holds the Virtual Machine ID (VMID) for Non-secure EL1 memory accesses and maintenance operations

Some peripherals may need to be at least partially configured while still in the secure world
av

• Example: All GIC interrupts are secure on reset; secure world must configure appropriate interrupts as non-secure

Appendix slide lists other key registers that one must be aware of
M

17 1497 rev 32514


Maven Silicon, 2023:04:20

17
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Agenda
• Booting an Arm Neoverse or DynamIQ processor in AArch64

• Booting multi-core and multi-processor systems

3
• Real-world booting

2
20
18 1497 rev 32514
Maven Silicon, 2023:04:20
on
18
lic
Si

Multi-core processors
Boot code must handle both per-core and processor-wide resources
en

In SMP systems, one core is designated the “primary” core and is responsible for initializing global resources, such as
system peripherals and booting an OS kernel
av

Secondary cores are held in reset or placed in a holding pen until the primary core wakes them

Primary CPU Primary Boot Wake Secondaries


SMP OS Kernel
M

Secondary CPU Secondary Boot

Secondary CPU Secondary Boot

19 1497 rev 32514


Maven Silicon, 2023:04:20

19
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Multi-processor systems
Many systems now include multiple clusters of multi-core processors

Resources can now be classified as:


• Per-core (e.g. MMU)

3
• Per-cluster (e.g. L3 cache)
• Global (e.g. Interconnect)

2
Typically only a single core in a single cluster is released from reset on power-up
• Usually core0 as reported by MPIDR_EL1

20
Additional setup to consider:
• Configure cache coherent interconnect
• Configure system memory controllers

20 1497 rev 32514


Maven Silicon, 2023:04:20
on
20
lic
Si

Agenda
• Booting an Arm Neoverse or DynamIQ processor in AArch64
en

• Booting multi-core and multi-processor systems


av

• Real-world booting
M

21 1497 rev 32514


Maven Silicon, 2023:04:20

21
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Simple boot sequences


Bare metal applications typically have a simple built-in boot sequence
• Can be built into a single standalone binary image

3
C Library
Reset Handler main()
Initialization

2
20
Systems running an OS will typically have multiple standalone boot stages

App
Bootstrap Bootloader OS Kernel
App
(ROM or Flash) (UEFI / uBoot) (Linux)
App

22 1497 rev 32514


Maven Silicon, 2023:04:20
on
22
lic
Si

Booting complex systems (1)


Complex system power management can require an entire separate microcontroller in the SoC
en

• System Control Processor (SCP)


• Requires its own boot and runtime firmware images
• Example: The Juno’s on-SoC Arm Cortex-M3

Boot process can involve both generic and heavily platform-specific code
av

Security considerations
• Trusted boot?
• Trusted services?
M

Most real-world systems must also support runtime firmware services


• Core hotplug
• Subsystem deep idle

23 1497 rev 32514


Maven Silicon, 2023:04:20

23
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Booting complex systems (2)


The open source community has slowly been pushing platform-specific code out of the Linux kernel
• Responsibility is being shifted back up to the bootstrap and bootloader stages

App

3
Bootstrap Bootloader OS Kernel
App
(ROM or Flash) (UEFI / uBoot) (Linux)
App

2
EL3 EL2 EL2 / EL1 EL0

20
The bootloader stage runs in the normal world
• No trusted boot
• No trusted services
• No runtime firmware services for systems that implement EL3 (all Armv8-A Cortex-A processors)
‐ Runtime firmware services need to run at the highest implemented exception level

Need a solution that replaces the traditionally simple bootstrap stage

24 1497 rev 32514


Maven Silicon, 2023:04:20
on
24
lic
Si

Arm Trusted Firmware-A


A reference implementation of Secure world software for booting Armv8-A systems
en

• Designed for reuse and porting to other Armv8-A models and hardware platforms

Implements a number of standards defined by Arm:


• Power State Coordination Interface (PSCI)

av

Trusted Board Booting Requirements (TBBR)


• Secure Monitor Interface (SMC)
• Document references provided in Appendix slide

BSD-licensed open source project hosted on GitHub:


M

• https://github.com/Arm-software/Arm-trusted-firmware

Currently targets:
• Foundation/Base Fixed Virtual Platform (FVP) models
• N1-SDP

25 1497 rev 32514


Maven Silicon, 2023:04:20

25
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Trusted ROM

Bootloader stages Trusted SRAM


DDR

Trusted Firmware-A
BL1 BL2 BL3-1 OS Loader Rich OS Kernel
Cold / WArm Boot
Platform Initialization e.g. UEFI, U-Boot
Detect Resident Runtime e.g. Linux
Trusted Bootloader ‘BL3-3’ but not part of TF
Trusted Bootstrap

3
Trusted Firmware defines a number of bootloader stages with different responsibilities

2
BL1 establishes a chain of trust for Trusted boot

20
BL2:
• Uses the chain of trust to authenticate other bootloader stage images
• Performs critical platform-specific initialization

BL3-1:
• Acts as runtime Secure Monitor (SMC) interface for runtime firmware services running at EL3
• Handles world switching to a Trusted OS running at Secure-EL1, which in turn provides Trusted services

26 1497 rev 32514


Maven Silicon, 2023:04:20
on
26
lic
Si

Trusted Firmware-A

Trusted Firmware-A architecture SoC specific firmware


en

Normal World Secure World

Trusted Trusted
EL0 Application Application Application Application
Service Service
av

EL1 Rich OS Kernel Rich OS Kernel Trusted OS Kernel

EL2 Hypervisor No EL2 in Secure world


M

SMC Dispatcher Secure EL1 Payload Dispatch

EL3 PSCI Core Interface Arm System IP Library World Switch Library

PSCI Platform SoC SMC Calls

27 1497 rev 32514


Maven Silicon, 2023:04:20

27
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Using Trusted Firmware-A


It is expected that a number of partners will port the Trusted Firmware to their own Armv8-A platforms

The project documentation includes:


• A User Guide, which details how to build and use the project

3
• A Firmware Design Document, which outlines the architecture, boot flow, and functionality of the project
• A Porting Guide, which describes all steps required for porting the project to a new platform

2
Arm will continue to develop the Trusted Firmware-A project in collaboration with interested parties in order to
provide a full reference implementation of PSCI, TBBR, and Secure Monitor code

20
This will benefit all developers working with Armv8-A TrustZone technology

28 1497 rev 32514


Maven Silicon, 2023:04:20
on
28
lic
Si
en

The trademarks featured in this


presentation are registered and/or
unregistered trademarks of ARM
av

Limited (or its subsidiaries) in the EU


and/or elsewhere. All rights
reserved. All other marks featured
may be trademarks of their
M

respective owners.

29 Confidential © 2020 Arm Limited

29
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Appendix

2 3
20
Confidential © 2020 Arm Limited
on
30
lic
Si

Key registers
This is not an exhaustive list!
en

• Consult the Armv8-A Architecture Reference Manual


• Consider using or porting the Trusted Firmware-A
‐ Trusted Firmware-A has been specifically designed to be ported to other Armv8-A systems

Register Description Example value When to set


av

SCTLR_ELx System controls Set RES1, clear reset Before entering ELx
VBAR_ELx Vector table base address Vector table base address At entry to ELx
GICD_IGROUP<Rn> Set each IRQ’s Group (0/1) Make NS-EL1 IRQs Group 1 In the Secure world
GICC_PMR Priority mask Value in range [0x80 .. 0xFF] In the Secure world
M

SCR_EL3 Secure Monitor controls Set RW, NS, RES1, clear rest Before leaving EL3
HCR_EL2 Hypervisor controls Set RW, clear rest Before entering EL1
VMPIDR_EL2 MPIDR_EL1 read by EL1 MPIDR_EL1 read by EL2/EL3 Before entering EL1
VPIDR_EL2 MIDR_EL1 read by EL1 MIDR_EL1 read by EL2/EL3 Before entering EL1
VTTBR_EL2 Holds EL1 VMID Clear all Before entering EL1

31 1497 rev 32514


Maven Silicon, 2023:04:20

31
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Not to be reproduced by any means without prior written consent.

Document references
Power State Coordination Interface (PSCI)
• Standardised SMC interface for CPU hotplug, subsystem deep idle, trusted kernel migration, etc
• Document number: Arm DEN 0022
• http://infocenter.Arm.com/help/index.jsp?topic=/com.Arm.doc.den0022c/index.html

3
SMC Calling Conventions (SMCC)
• Defines a common calling mechanism for use with the Secure Monitor Call (SMC) instruction

2
• Document number: Arm DEN 0028
• http://infocenter.Arm.com/help/index.jsp?topic=/com.Arm.doc.den0028a/index.html

20
Trusted Board Booting Requirements (TBBR)
• Defines the steps required for booting a trusted system, and aims to aid in both system design and certification efforts
• Available by request under NDA, contact your Arm Partner Manager

32 1497 rev 32514


Maven Silicon, 2023:04:20
on
32
lic
Si
en
av
M
Arm Glossary

3
The Arm Glossary is a constantly growing and evolving source of Arm
reference. It is only available online.

2
To access it please visit :

20
https://developer.arm.com/glossary
on
lic
Si
en
av
M
Copyright © Arm Ltd 2023. All rights reserved
Not to be reproduced by any means without prior written consent

Arm Customer Services

3
& Support

2
20
© 2023 Arm
on
1
lic
Si

Arm Support Services


Providing the help you need, when you need it
en

We have a range of services available to help reduce risk and shorten time-to-market

Design Reviews
av

Technical Support

RTL Physical Software


IP Selection Architecture Silicon Bring-up
Implementation Implementation Development
M

Documentation and FAQs

Training

More info on arm.com/support


2 0454 rev 38465 Maven Silicon, 2023:04:20

2
Copyright © Arm Ltd 2023. All rights reserved
Not to be reproduced by any means without prior written consent

Arm Support Services Overview

Technical Support Expert Staff Design Reviews

3
Satisfaction rates are Over 160 highly trained >75% of reviews identified at
consistently high 90%-95% Applications Engineers least 1 critical issue

2
20
Arm Developer Training Arm Community
Over 1,900 technical Over 4400 partners Developer resources and
documents available trained so far in 2020 discussion forums

3 0454 rev 38465 Maven Silicon, 2023:04:20


on
More info on arm.com/support

3
lic
Si

Global team proving local support


en

Manchester
av

Warwick Munich

Cambridge* Beijing
Tokyo
Budapest
Richardson
Sophia Antipolis Shanghai*

Austin*
M

Seoul
Shenzhen

Bangalore Taipei

Hsinchu

*Regional Support Headquarters


4 0454 rev 38465 Maven Silicon, 2023:04:20

4
Copyright © Arm Ltd 2023. All rights reserved
Not to be reproduced by any means without prior written consent

Technical support
Resolve technical issues quickly and efficiently
• Keep your project on track Consistently high customer
Experienced Arm engineers answer your questions satisfaction 90%-95%

3
• Access to vast amount of Arm knowledge
Consistently high customer satisfaction
• Constantly monitored through customer surveys Enhanced Support option includes:

2
Worldwide support team located in the US, Europe and Asia • Regular contact with named
• 160+ Applications Engineers covering all product areas Applications Engineer

20
• Global teams providing local support • Quicker turnaround time on
Online portal to track your support enquiries support cases
• developer.arm.com/support • Service Tokens to be redeemed
on onsite support, training,
and/or Design Reviews

5 0454 rev 38465 Maven Silicon, 2023:04:20


on
5
lic
Si

Technical Support Overview


Resolve technical issues quickly and efficiently
en

Self-service Standard Support Enhanced Support

• Documents • Unlimited number of Includes all standard support


• technical support cases benefits plus:
av

Downloads
• Articles • Access to the case tracking • Regular contact with a named
• Forums system Applications Engineer
• Evaluation products • Product maintenance • Quicker turnaround time
on support cases
M

• Online Training • Access to online


documentation • Service tokens
• Intro to Arm training

6 0454 rev 38465 Maven Silicon, 2023:04:20

6
Copyright © Arm Ltd 2023. All rights reserved
Not to be reproduced by any means without prior written consent

Arm Training Overview Processors


AMBA Multimedi
Arm Training gets partner teams up to speed quickly Protocols a

• Detailed training across all Arm Products System IP


• Covers advanced technologies such as Arm DynamIQ, CryptoCell and TrustZone
Security
Delivered by senior Arm engineers in a way that works best for you

3
Tools

2
20
Face-to-face courses Virtual Classroom Online Training
Experienced trainers deliver on-site Experienced trainers deliver live On-demand short videos with bite-
training at a location of your online training, customized to meet sized topics. Accessible wherever
choice. Courses can be customized your needs, at a date and time that you are and whenever you need

7
to meet your needs.

0454 rev 38465 Maven Silicon, 2023:04:20


on
best suits your team. them.

7
lic
Si

Arm Design Reviews


Design Reviews of the Arm subsystem:
en

• Catch critical design errors while they can still be fixed


• Reduce the risk of expensive re-spins or a compromised product
• More than 100 engagements so far, 75% of which identified serious issues
Implementation services to accelerate tape out while meeting PPA targets
av
M

8 0454 rev 38465 Maven Silicon, 2023:04:20

8
Copyright © Arm Ltd 2023. All rights reserved
Not to be reproduced by any means without prior written consent

Design Reviews reduces errors


Out of 46 RTL and Design Review reports, 30% uncovered 1-4 Issues and 24% 5-9. Only
17% of the reports have no Issues highlighted.

16

3
14

12

2
10

20
8

9 0454 rev 38465


0 1-4

Maven Silicon, 2023:04:20


5-9
on 10-14 15-19 20+ N/A

9
lic
Si

Service tokens
A flexible way to get the help you need, when you need it
en

Overview Training
•A flexible way to plan your services budget
• Example: 20 tokens can cover a 4-day on-site
•Valid for 12 months
av

training for up to 15 trainees;


•Quarterly report includes:
• Number of outstanding tokens Design Reviews
• Expiration dates
• Example: 25 tokens covers an RTL Design Review;
• History of token usage
M

On-site support

• Example: 9 tokens give you 4 days on-site support;

10 0454 rev 38465 Maven Silicon, 2023:04:20

10
Copyright © Arm Ltd 2023. All rights reserved
Not to be reproduced by any means without prior written consent

Arm Developer
One location for all technical
content, including:

3
• Technical product information

• Product documentation

2
• Developer resources

20
• Developer downloads

• Training course information and


online booking

11 0454 rev 38465 Maven Silicon, 2023:04:20


on
11
lic
Si

Arm Community
en

▪ Forum
▪ Blogs
▪ Communities:
av

▪ Android
▪ Arm Development Platforms
▪ Graphics & Multimedia
▪ Internet of Things
M

▪ Processors
▪ SoC Design
▪ Software Tools

12 0454 rev 38465 Maven Silicon, 2023:04:20

12
Copyright © Arm Ltd 2023. All rights reserved
Not to be reproduced by any means without prior written consent

Empowering the SoC Architect with Arm IP Explorer

DISCOVER SEARCH COMPARE CONFIGURE SIMULATE

3
Guides non-advanced Displays IP efficiently; Juxtaposes intra-IP Simplifies IP parameter Reveals custom
IP users towards best-fit intuitively save and details for fast and analysis, guides configs, inter/intra-IP software
IP for their project. start working with IP. accurate analysis. and renders RTL. workload performance.

2
20
PROJECTS
Saves IP & project information to organize SoC investigation activities and provide expert SoC creation guidance .

COLLABORATE
Invite colleagues, Arm FAEs and Arm Approved Design Partners to collaborate on your projects providing input and suggestions.

13 0454 rev 38465 Maven Silicon, 2023:04:20


on
13
lic
Si

Arm Approved Design Partners


Network of Trusted Design Houses
en

Membership Tiers
• Design Services (Non-commercial Usage)
• SoC Services (Commercial Usage)
av

Entitlements
• IP in the Arm Flexible Access (AFA) Mainstream
Package
• Software Development Tools
M

• Marketing Rights to the Program

Leverages the Flexible Access Program


• Membership Agreement
• Purchase Package Order Form

14 0454 rev 38465 Maven Silicon, 2023:04:20

14
Copyright © Arm Ltd 2023. All rights reserved
Not to be reproduced by any means without prior written consent

Arm Approved Training Partners


Learn Arm Technologies from Accredited Training

Global network of companies endorsed and supported


by Arm that offer

3
• Arm Software training courses under license
• Local language versions of our software-based

2
training courses
• Complementary training

20
• Public schedules
The training partners have demonstrated that they can
deliver Arm architecture and technology training to a
high technical level of understanding

15 0454 rev 38465 Maven Silicon, 2023:04:20


on
15
lic
Si

Need help?
en

Tech Support Training

Go to arm.com/support Go to arm.com/training
av
M

Design Reviews Design Services

Go to arm.com/design-reviews Go to arm.com/arm-approved

16 0454 rev 38465 Maven Silicon, 2023:04:20

16
Copyright © Arm Ltd 2023. All rights reserved
Not to be reproduced by any means without prior written consent

Thank You!
Danke!

3
Merci!
谢谢!

2
ありがとう!

20
Gracias!
Kiitos!

17 © 2023 Arm
on
17
lic
Si
en
av
M

You might also like