ARM Cortex M4 Architecture
ARM Cortex M4 Architecture
Architecture and
ASM Programming
Chapter
Introduction
In this chapter programming the Cortex-M4 in assembly and C
will be introduced. Preference will be given to explaining code
development for the STM32F4 Discovery and LPC4088 Quick
Start. The basis for the material presented in this chapter is the
course notes from the ARM LiB program1.
Overview
Cortex-M4 Memory Map
Cortex-M4 Memory Map
Bit-band Operations
Cortex-M4 Program Image and Endianness
ARM Cortex-M4 Processor Instruction Set
ARM and Thumb Instruction Set
Cortex-M4 Instruction Set
1. LiB Low-level Embedded NXP LPC4088 Quick Start
32
Peripherals
External RAM
SRAM
External device
Vendor specific
Memory
Private peripherals
e.g. NVIC, SCS
0xFFFFFFFF
0x00000000
0x20000000
0x1FFFFFFF
0x40000000
0x3FFFFFFF
0x60000000
0x5FFFFFFF
0xA0000000
0x9FFFFFFF
0xE0000000
0xDFFFFFFF
0xE0100000
0xE00FFFFF
512MB
512MB
512MB
1GB
1GB
512MB
Reserved
Fetch patch and breakpoint unit
Reserved
External PPB
ROM table
Internal PPB
External PPB
33
PPB
SCS
NVIC
Debug Ctrl
AHB bus
On-chip FLASH
(Code Region)
On-chip SRAM
(SRAM Region)
External SRAM,
FLASH
External LCD
Timer UART
GPIO
Peripheral Region
SD card
35
Bit-band Operations
Bit-band operation allows a single load/store operation to
access a single bit in the memory, for example, to change a
single bit of one 32-bit data:
Normal operation without bit-band (read-modify-write)
Read the value of 32-bit data
Modify a single bit of the 32-bit value (keep other bits
unchanged)
Write the value back to the address
Bit-band operation
Directly write a single bit (0 or 1) to the bit-band alias
address of the data
Bit-band alias address
Each bit-band alias address is mapped to a real data
address
When writing to the bit-band alias address, only a single
bit of the data will be changed
36
;Bit-band Operation
LDR
LDR
ORR.W
STR
LDR
MOV
STR
R1,
R0,
R0,
R0,
=0x20000000
[R1]
#0x8
[R1]
;Setup address
;Read
;Modify bit
;Write back
R1, =0x2200000C
R0, #1
R0, [R1]
;Setup address
;Load data
;Write
Read-Modify-Write operation
Read the real data address (0x20000000)
Modify the desired bit (retain other bits unchanged)
Write the modified data back
Bit-band operation
Directly set the bit by writing 1 to address 0x2200000C,
which is the alias address of the fourth bit of the 32-bit
data at 0x20000000
In effect, this single instruction is mapped to 2 bus transfers: read data from 0x20000000 to the buffer, and then
write to 0x20000000 from the buffer with bit [3] set
37
0x20000008
0x20000004
0x20000000
0x22000100
0x22000080
0x22000000
0x2200000C
0x22000018
38
0x23FFFFFF
External RAM
Peripherals
0x20100000
0x20000000
512MB
0x40000000
0x3FFFFFFF
0x60000000
0x5FFFFFFF
SRAM
31MB non-bit-band region
1MB Bit-band region
512MB
0x20000000
0x1FFFFFFF
Code
512MB
0x00000000
39
Interrupt
occurs
Read data at 0x00
Main program
310
Interrupt
returns
Program
Image
Vector table
0x00000000
External Interrupts
SysTick
PendSV
Reserved
Debug monitor
SVCall
Reserved
Usage fault
Bus fault
MemManage fault
Hard fault vector
NMI vector
Reset vector
Initial MSP value
311
312
Cortex-M4 Endianness
Cortex-M4 Endianness
Endian refers to the order of bytes stored in memory
Little endian: lowest byte of a word-size data is stored in
bit 0 to bit 7
Big endian: lowest byte of a word-size data is stored in bit
24 to bit 31
Cortex-M4 supports both little endian and big endian
However, Endianness only exists in the hardware level
Address
[31:24]
[23:16]
[15:8]
[7:0]
Byte3
Byte2
Byte1
Byte0
Byte1
Byte0
Byte1
Byte0
0x00000008
Word 3
Byte3
0x00000004
Byte2
Word 2
Byte3
0x00000000
Byte2
Word 1
[23:16]
[15:8]
[7:0]
Byte0
Byte1
Byte2
Byte3
Byte2
Byte3
Byte2
Byte3
Word 3
Byte0
Byte1
Word 2
Byte0
Byte1
Word 1
313
314
Thumb remap
to ARM
ARM
Instruction
decoder
Instructions
Executing
315
316
317
318
Rd, #lsb, #width
Rd, Rn, #lsb, #width
{Rd,} Rn, Op2
BFC
BFI
BIC, BICS
Rm
label
BX
ASR, ASRS
Rm
AND, ANDS
BLX
Rd, label
ADR
label
ADD, ADDW
BL
ADD, ADDS
#imm
ADC, ADCS
BKPT
Operands
Mnemonic
Branch indirect
Breakpoint
Bit Clear
Branch
Logical AND
Add
Add
Brief description
N,Z,C
N,Z,C
N,Z,C
N,Z,C,V
N,Z,C,V
N,Z,C,V
Flags
CMN
CMP
CPSID
CPSIE
DSB
EOR, EORS
ISB
DMB
Compare
Compare Negative
Rd, Rm
CLZ
Rn, label
CBZ
Clear Exclusive
Rn, label
CBNZ
Brief description
CLREX
Operands
Mnemonic
N,Z,C
N,Z,C,V
N,Z,C,V
Flags
319
320
Rt, [Rn, #offset]
LDREX
LDRH, LDRHT
LDRD
Rt, [Rn]
LDRB, LDRBT
LDREXH
LDR
Rt, [Rn]
Rn{!}, reglist
LDMFD, LDMIA
LDREXB
Rn{!}, reglist
LDMDB, LDMEA
Rn{!}, reglist
LDM
Brief description
If-Then condition block
Operands
IT
Mnemonic
Flags
Operands
Rt, [Rn, #offset]
Rt, [Rn, #offset]
Rt, [Rn, #offset]
Rd, Rm, <Rs|#n>
Rd, Rm, <Rs|#n>
Rd, Rn, Rm, Ra
Rd, Rn, Rm, Ra
Rd, Op2
Rd, #imm16
Rd, #imm16
Rd, spec_reg
spec_reg, Rm
Mnemonic
LDRSB, LDRSBT
LDRSH, LDRSHT
LDRT
LSL, LSLS
LSR, LSRS
MLA
MLS
MOV, MOVS
MOVT
MOVW, MOV
MRS
MSR
Move Top
Move
Brief description
N,Z,C,V
N,Z,C
N,Z,C
N,Z,C
N,Z,C
Flags
321
322
{Rd,} Rn, Op2
{Rd, } Rn, Rm, Op2
reglist
reglist
{Rd, } Rn, Rm
{Rd, } Rn, Rm
{Rd, } Rn, Rm
ORR, ORRS
PKHTB, PKHBT
POP
PUSH
QADD
QADD16
QADD8
Saturating Add 8
Saturating Add 16
Pack Halfword
Logical OR
Logical OR NOT
Move NOT
ORN, ORNS
Rd, Op2
MVN, MVNS
No Operation
{Rd,} Rn, Rm
MUL, MULS
Brief description
NOP
Operands
Mnemonic
N,Z,C
N,Z,C
N,Z,C
N,Z
Flags
Operands
{Rd, } Rn, Rm
{Rd, } Rn, Rm
{Rd, } Rn, Rm
{Rd, } Rn, Rm
{Rd, } Rn, Rm
{Rd, } Rn, Rm
{Rd, } Rn, Rm
Rd, Rn
Rd, Rn
Rd, Rn
Rd, Rn
Rd, Rm, <Rs|#n>
Mnemonic
QASX
QDADD
QDSUB
QSAX
QSUB
QSUB16
QSUB8
RBIT
REV
REV16
REVSH
ROR, RORS
Rotate Right
Reverse Bits
Saturating Subtract 8
Saturating Subtract 16
Saturating Subtract
Saturating Add
Brief description
N,Z,C
Flags
323
324
Rd, Rm
{Rd,} Rn, Op2
{Rd, } Rn, Rm
{Rd, } Rn, Rm
{Rd, } Rn, Rm
{Rd,} Rn, Op2
Rd, Rn, #lsb, #width
{Rd,} Rn, Rm
RRX, RRXS
RSB, RSBS
SADD16
SADD8
SASX
SBC, SBCS
SBFX
SDIV
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
SHADD16
SHADD8
SHASX
SEV
Operands
Mnemonic
Send Event
Signed Divide
Signed Add 8
Signed Add 16
Reverse Subtract
Brief description
N,Z,C,V
GE
GE
GE
N,Z,C,V
N,Z,C
Flags
Operands
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
Rd, Rn, Rm, Ra
Rd, Rn, Rm, Ra
RdLo, RdHi, Rn, Rm
RdLo, RdHi, Rn, Rm
RdLo, RdHi, Rn, Rm
Rd, Rn, Rm, Ra
Rd, Rn, Rm, Ra
RdLo, RdHi, Rn, Rm
Rd, Rn, Rm, Ra
Mnemonic
SHSAX
SHSUB16
SHSUB8
SMLAD, SMLADX
SMLAL
SMLALD, SMLALDX
SMLAWB, SMLAWT
SMLSD
SMLSLD
SMMLA
Brief description
Flags
325
326
Operands
Rd, Rn, Rm, Ra
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
RdLo, RdHi, Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
Rd, #n, Rm {,shift #s}
Rd, #n, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
Mnemonic
SMMLS, SMMLR
SMMUL, SMMULR
SMUAD
SMULL
SMULWB, SMULWT
SMUSD, SMUSDX
SSAT
SSAT16
SSAX
SSUB16
SSUB8
Signed Subtract 8
Signed Subtract 16
Signed Saturate 16
Signed Saturate
Brief description
GE
Flags
Operands
Rn{!}, reglist
Rn{!}, reglist
Rn{!}, reglist
Rt, [Rn, #offset]
Rt, [Rn, #offset]
Rt, Rt2, [Rn, #offset]
Rd, Rt, [Rn, #offset]
Rd, Rt, [Rn]
Rd, Rt, [Rn]
Rt, [Rn, #offset]
Rt, [Rn, #offset]
{Rd,} Rn, Op2
Mnemonic
STM
STMDB, STMEA
STMFD, STMIA
STR
STRB, STRBT
STRD
STREX
STREXB
STREXH
STRH, STRHT
STRT
SUB, SUBS
Subtract
Brief description
N,Z,C,V
Flags
327
328
{Rd,} Rm {,ROR #n}
{Rd,} Rm {,ROR #n}
{Rd,} Rm {,ROR #n}
[Rn, Rm]
SXTB16
SXTB
SXTH
TBB
Rn, Op2
SXTAH
TST
SXTAB16
Rn, Op2
SXTAB
TEQ
#imm
SVC
SUB, SUBW
TBH
Operands
Mnemonic
Test
Test Equivalence
Supervisor Call
Subtract
Brief description
N,Z,C
N,Z,C
N,Z,C,V
Flags
Operands
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
Rd, Rn, #lsb, #width
{Rd,} Rn, Rm
RdLo, RdHi, Rn, Rm
Mnemonic
UADD16
UADD8
USAX
UHADD16
UHADD8
UHASX
UHSAX
UHSUB16
UHSUB8
UBFX
UDIV
UMAAL
Unsigned Divide
Unsigned Add 8
Unsigned Add 16
Brief description
GE
GE
GE
Flags
329
330
Operands
RdLo, RdHi, Rn, Rm
RdLo, RdHi, Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm, Ra
Rd, #n, Rm {,shift #s}
Rd, #n, Rm
Mnemonic
UMLAL
UMULL
UQADD16
UQADD8
UQASX
UQSAX
UQSUB16
UQSUB8
USAD8
USADA8
USAT
USAT16
Unsigned Saturate 16
Unsigned Saturate
Brief description
Flags
Operands
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm
{Rd,} Rn, Rm,{,ROR #}
{Rd,} Rn, Rm,{,ROR #}
{Rd,} Rn, Rm,{,ROR #}
{Rd,} Rm {,ROR #n}
{Rd,} Rm {,ROR #n}
{Rd,} Rm {,ROR #n}
Sd, Sm
{Sd,} Sn, Sm
Sd, <Sm | #0.0>
Mnemonic
UASX
USUB16
USUB8
UXTAB
UXTAB16
UXTAH
UXTB
UXTB16
UXTH
VABS.F32
VADD.F32
VCMP.F32
Floating-point Add
Floating-point Absolute
Unsigned Subtract 8
Unsigned Subtract 16
Brief description
FPSCR
GE
GE
GE
Flags
331
332
Operands
Sd, <Sm | #0.0>
Sd, Sm
Sd, Sd, #fbits
Sd, Sm
Sd, Sm
Sd, Sm
{Sd,} Sn, Sm
{Sd,} Sn, Sm
{Sd,} Sn, Sm
{Sd,} Sn, Sm
{Sd,} Sn, Sm
Rn{!}, list
Mnemonic
VCMPE.F32
VCVT.S32.F32
VCVT.S16.F32
VCVTR.S32.F32
VCVT<B|H>.F32.F16
VCVTT<B|T>.F32.F16
VDIV.F32
VFMA.F32
VFNMA.F32
VFMS.F32
VFNMS.F32
VLDM.F<32|64>
FPSCR
Compare two floating-point registers, or one floatingpoint register and zero with Invalid Operation check
Floating-point Divide
Flags
Brief description
Operands
<Dd|Sd>, [Rn]
{Sd,} Sn, Sm
{Sd,} Sn, Sm
Sd, #imm
Sd, Sm
Sn, Rt
Sm, Sm1, Rt, Rt2
Dd[x], Rt
Rt, Dn[x]
Rt, FPSCR
FPSCR, Rt
{Sd,} Sn, Sm
Mnemonic
VLDR.F<32|64>
VLMA.F32
VLMS.F32
VMOV.F32
VMOV
VMOV
VMOV
VMOV
VMOV
VMRS
VMSR
VMUL.F32
Floating-point Multiply
Brief description
FPSCR
N,Z,C,V
Flags
333
334
{Sd,} Sn, Sm
list
list
Sd, Sm
Rn{!}, list
Sd, [Rn]
{Sd,} Sn, Sm
VNMUL
VPOP
VPUSH
VSQRT.F32
VSTM
VSTR.F<32|64>
VSUB.F<32|64>
Floating-point Subtract
Floating-point Multiply
Note: full explanation of each instruction can be found in Cortex-M4 Devices Generic User Guide (Ref-4)
Sd, Sn, Sm
VNMLS.F32
WFI
Sd, Sn, Sm
VNMLA.F32
Floating-point Negate
Sd, Sm
VNEG.F32
Brief description
WFE
Operands
Mnemonic
Flags
Condition execution
e.g. EQ= equal, NE= not equal, LT= less than
Suffix
EQ, NE, CS, CC, MI, PL, VS, VC, HI, LS,
GE, LT, GT, LE
BNE label
Example
Example explanation
335
C Calling Assembly
For real-time DSP applications the most common scenario
involving assembly code writing, if needed at all, will be C calling assembly. In simple terms the rules are:
336
r8
r7
r6
r5
r4
r3
r2
r1
r0
r9
r15
r14
r13
r12
r11
r10
v5
v4
v3
v2
v1
a4
a3
a2
a1
v8
v7
PC
LR
SP
IP
C Calling Assembly
337
An
(3.1)
A = A1 AN
(3.2)
n=1
where
338
C Version
We implement this simple routine in C using a declared vector length N and vector contents in the array v
The C source, which includes the called function
norm_sq_c is given below:
/******************************************************
Vector norm-squared routine in C
******************************************************/
int main(void){
int16_t x = 0;
int16_t v[5] = {1,2,3,6,7};
...
x = norm_sq_c (v, 5);// call c function
sprintf(my_debug, "Norm: The answer is %d\n", x);
TM_USART_Puts(USART6, my_debug);
...
}
int16_t norm_sq_c(int16_t* v, int16_t n)
{
int16_t i;
int16_t out = 0;
for(i=0; i<n; i++)
{
out += v[i]*v[i];
}
return out;
}
339
; Number of elements: R1
MOVS R2, R0 ; move the address in R0 to R2
MOVS R0, #0 ; initialize the result
sum_loop
LDRSH R3, [R2],#0x2; load int16_t value pointed to
; by R2 into R3, then increment
MLA R0, R3, R3, R0; sq & accum in one step (faster)
SUBS R1, R1, #1; R1 = R1 - 1, decrement the count
CMP R1, #0
; compare to 0 and set Z register
BNE sum_loop; branch if compare not zero
BX LR
; return R0
ENDFUNC
END
; End of file
From just the C source it is not obvious that the function prototype for norm_asm is actually an assembly routine
The answer is again 99
From CoolTerm
Performance Comparison
In the Keil IDE debugger we set break points around the
function to b timed:
341
cycles
86
time
0.47us
norm_sq_asm
cycles
86
time
0.51us
norm_sq_c with O3
norm_sq_c with O0
Then make note of the States and Sec in the registers window:
cycles
49
time
0.29us
BEST!
343
Sample Results
For an input of 64 the output is 8, as expected
cycles
125
time
0.74us
Useful Resources
Architecture Reference Manual:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0403c/index.html
344