ARM Book 1
ARM Book 1
Programming
Techniques
Document Number: ARM DUI 0021A
Issued: June 1995
Beta Draft
Copyright Advanced RISC Machines Ltd (ARM) 1995
Neither the whole nor any part of the information contained in, or the product described in, this
datasheet may be adapted or reproduced in any material form except with the prior written permission
of the copyright holder.
The product described in this datasheet is subject to continuous developments and improvements.
All particulars of the product and its use contained in this datasheet are given by ARM in good faith.
However, all warranties implied or expressed, including but not limited to implied warranties
or merchantability, or fitness for purpose, are excluded.
This datasheet is intended only to assist the reader in the use of the product. ARM Ltd shall not be liable
for any loss or damage arising from the use of any information in this datasheet, or any error or omission
in such information, or any incorrect use of the product.
Change Log
Issue Date By Change
A June 95 PB/BH/EH/AP Created
Beta Draft
ii Programming Techniques
ARM DUI 0021A
TOC Contents
1 Introduction 1-1
1.1 About this manual 1-2
1.2 Feedback 1-3
Contents-1
Programming Techniques
ARM DUI 0021A
Contents
4 ARM Assembly Language Basics 4-1
4.1 Introduction 4-2
4.2 Structure of an Assembler Module 4-4
4.3 Conditional Execution 4-6
4.4 The ARM’s Barrel Shifter 4-10
4.5 Loading Constants Into Registers 4-14
4.6 Loading Addresses Into Registers 4-17
4.7 Jump Tables4-21
4.8 Using the Load and Store Multiple Instructions 4-23
11 Exceptions 11-1
11.1 Overview 11-2
11.2 Entering and Leaving an Exception 11-5
11.3 The Return Address and Return Instruction 11-6
11.4 Writing an Exception Handler 11-8
11.5 Installing an Exception Handler 11-12
11.6 Exception Handling on Thumb-Aware Processors 11-14
Typographical conventions
The following typographical conventions are used in this manual:
Filenames
Unless otherwise stated, filenames are quoted in Unix format—for example:
examples/basicasm/gcd1.s
If you are using the PC platform, you must translate them into their DOS equivalent:
EXAMPLES\BASICASM\GCD1.S
1.2 Feedback
1.2.1 Feedback on the Software Development Toolkit
If you have feedback on the Software Development Toolkit, please contact either your supplier
or ARM Ltd. You can send feedback via e-mail to: [email protected].
In order to help us give a rapid and useful response, please give:
• details of which hosting and release of the ARM software tools you are using
• a small sample code fragment which reproduces the problem
• a clear explanation of what you expected to happen, and what actually happened
This chapter introduces the components of the ARM software development toolkit, and
takes you through compiling, linking and running a simple ARM program.
2.1 Introducing the Toolkit 2-2
2.2 The Hello World Example 2-4
ARM
Re-targetable libraries Software Full documentation
Development
Toolkit
Utilities
armcc
compile
.o
.c
C source module(s) link executable
C library
int main(void)
{
printf("Hello World\n");
return 0;
}
Use the following command to compile and link the code:
armcc hello.c -o hello
The argument to the -o flag gives the name of the file which will hold the final output of the link
step. The linker is automatically called after compilation (because in this instance the -c flag has
not been specified). Note that flags are case-sensitive.
To execute the code under software emulation, enter:
armsd hello
at the system prompt. armsd will start, load in the file, and display the armsd: prompt to indicate
that it is waiting for a command. Type
go
and press Return. The debugger should respond with “Hello World”, followed by a message
indicating that the program terminated normally.
2.2.3 Debugging
Next, re-compile the program to include high-level debugging information, and use the debugger
to examine the code. Compile the program using:
armcc -g hello.c -o hello2
where the -g option instructs the compiler to add debug information.
Load hello2 into armsd:
armsd hello2
and set a breakpoint on the first statement in main by entering:
break main
at the armsd: prompt.
To execute the program up to the breakpoint, enter:
go
The debugger reports that it has stopped at breakpoint #1, and displays the source line. To view
the ARM registers, enter:
reg
To list the C source, enter:
type
This displays the whole source file. type can also display sections of code: for example if you
enter:
type 1,6
lines 1 to 6 of the source will be displayed.
main
MOV ip,sp
STMDB sp!,{fp,ip,lr,pc}
SUB fp,ip,#4
CMP sp,sl
BLMI __rt_stkovf_split_small
ADD a1,pc,#L000024-.-8
BL _printf
MOV a1,#0
LDMDB fp,{fp,sp,pc}
L000024
DCB 0x48,0x65,0x6c,0x6c
DCB 0x6f,0x20,0x77,0x6f
DCB 0x72,0x6c,0x64,0x0a
DCB 00,00,00,00
AREA |C$$data|,DATA
|x$dataseg|
EXPORT main
IMPORT _printf
IMPORT __rt_stkovf_split_small
END
Note Your code may differ slightly from the above, depending on the version of armcc in use.
This chapter describes the features of the ARM Processor which are of special interest to
the programmer.
3.1 Introduction 3-2
3.2 Memory Formats 3-3
3.3 Instruction Length 3-4
3.4 Data Types 3-4
3.5 Processor Modes 3-4
3.6 Processor States 3-5
3.7 The ARM Register Set 3-6
3.8 The Thumb Register Set 3-8
3.9 Program Status Registers 3-10
3.10 Exceptions 3-12
Architectures 1 and 2
The original architecture—Version 1—was implemented only by ARM1, and was never used in
a commercial product.
ARM Architecture Version 2 was the first to be used commerially. It extended Version 1 by
adding:
• the multiply and multiply accumulate instructions (MUL and MLA)
• support for coprocessors
• a further two banked registers for FIQ mode
Version 2a introduced an Atomic Load and Store instruction (SWP) and the use of Coprocessor
15 as a system control coprocessor. Versions 1, 2 and 2a all supported a 26-bit address bus and
combined in register 15 a 24-bit Program Counter (PC) and 8 bits of processor status.
Architecture 3
Version 3 of the architecture extended the addressing range to 32 bits, defining a 30-bit Program
Counter value in register 15. The status information was moved from register 15 to a new 11-bit
status register (the Current Program Status Register or CPSR). Version 3 also added two new
privileged processing modes (Version 2 has just three: Supervisor, IRQ and FIQ). The new
modes, Undefined and Abort, allowed coprocessor emulation and virtual memory support in
Supervisor mode. In addition, a further five status registers (the Saved Program Status Registers
or SPSRs) were defined, one for each privileged processor mode, in which the CPSR contents
is preserved when the corresponding exception is taken.
A variant of the Version 3 architecture—Version 3M—added multiply and multiply accumulate
instructions that produce a 64 bit result (SMULL, UMULL, SMLAL, UMLAL).
Architecture 4
Version 4 added halfword load and store instructions and sign extended byte and halfword load
instructions. It also reserved some SWI instruction space for architecturally defined operations,
added a new privileged processor mode called System (that uses the User mode registers) and
defined several new undefined instructions.
A variant of Version 4 called 4T incorporates an instruction decoder for a 16-bit subset of the ARM
instruction set (known as Thumb). Processors which have this decoder are referred to as being
Thumb-aware.
8 9 10 11 8
4 5 6 7 4
0 1 2 3 0
11 10 9 8 8
7 6 5 4 4
3 2 1 0 0
Byte 8 bits
Halfword 16 bits
halfwords must be aligned to 2-byte boundaries (Architecture 4 only)
Word 32 bits
words must be aligned to four-byte boundaries
Load and store operations can transfer bytes, halfwords and words to and from memory.
Signed operands are in two’s complement format.
Mode changes may be made under software control or may be caused by external interrupts or
exception processing. Most application programs will execute in User mode. The other modes,
known as privileged modes, will be entered to service interrupts or exceptions or to access
protected resources: see ➲3.10 Exceptions on page 3-12.
1 On execution of the BX instruction with the state bit clear in the operand register.
2 On the processor taking an exception (IRQ, FIQ, RESET, UNDEF, ABORT, SWI etc.).
In this case, the PC is placed in the exception mode’s link register, and execution
commences at the exception’s vector address. See ➲3.10 Exceptions on page 3-12
and ➲Chapter 11, Exceptions.
Register 13 (also known as the Stack Pointer or SP) is banked across all modes to provide
a private Stack Pointer for each mode (except System mode which shares the
user mode R13).
Register 14 (also known as the Link Register or LR) is used as the subroutine return
address link register. R14 is also banked across all modes (except System
mode which shares the user mode R14).
When a Subroutine call (Branch and Link instruction) is executed, R14 is set
to the subroutine return address. The banked registers R14_SVC, R14_IRQ,
R14_FIQ, R14_ABORT and R14_UNDEF are used similarly to hold the return
address when exceptions occur (or a subroutine return address if subroutine
calls are executed within interrupt or exception routines). R14 may be treated
as a general-purpose register at all other times.
Register 15 is used specifically to hold the Program Counter (PC). When R15 is read, bits
[1:0] are zero and bits [31:2] contain the PC. When R15 is written bits[1:0] are
ignored and bits[31:2] are written to the PC. Depending on how it is used, the
value of the PC is either the address of the instruction plus n (where n is 8 for
ARM state and 4 for Thumb state) or is unpredictable.
CPSR is the Current Program Status Register. This is accessible in all processor
modes, and contains the condition code flags, interrupt enable flags, and
current processor mode. In Architecture 4T, the CPSR also holds the
processor state. See ➲3.9 Program Status Registers on page 3-10 for more
information.
R0 R0 R0 R0 R0 R0
R1 R1 R1 R1 R1 R1
R2 R2 R2 R2 R2 R2
R3 R3 R3 R3 R3 R3
R4 R4 R4 R4 R4 R4
R5 R5 R5 R5 R5 R5
R6 R6 R6 R6 R6 R6
R7 R7 R7 R7 R7 R7
R8 R8 R8 R8 R8 R8_FIQ
R9 R9 R9 R9 R9 R9_FIQ
PC PC PC PC PC PC
R0 R0 R0 R0 R0 R0
R1 R1 R1 R1 R1 R1
R2 R2 R2 R2 R2 R2
R3 R3 R3 R3 R3 R3
R4 R4 R4 R4 R4 R4
R5 R5 R5 R5 R5 R5
R6 R6 R6 R6 R6 R6
R7 R7 R7 R7 R7 R7
PC PC PC PC PC PC
The Thumb state registers relate to the ARM state registers in the following way:
• Thumb state R0-R7 and ARM state R0-R7 are identical
• Thumb state CPSR and SPSRs and ARM state CPSR and SPSRs are identical
• Thumb state SP maps onto ARM state R13
• Thumb state LR maps onto ARM state R14
• The Thumb state Program Counter maps onto the ARM state Program Counter (R15)
Lo registers
R3 R3
R4 R4
R5 R5
R6 R6
R7 R7
R8
R9
R10
Hi registers
R11
R12
Stack Pointer (SP) Stack Pointer (R13)
Link Register (LR) Link Register (R14)
Program Counter (PC) Program Counter (R15)
CPSR CPSR
SPSR SPSR
Figure 3-3: Mapping of Thumb state registers onto ARM state registers
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
N Z C V I F T M4 M3M2 M1 M0
Interrupt disable bits The I and F bits are the interrupt disable bits. When set, these
disable the IRQ and FIQ interrupts respectively.
The state bit Bit T is the processor state bit. When the state bit is set to 0, this
indicates that the processor is in ARM state (ie. executing 32-bit
ARM instructions). When it is set to 1, this indicates that the
processor is in Thumb state (executing 16-bit Thumb instructions)
The state bit is only implemented on Thumb-aware processors
(Architecture 4T). On non Thumb-aware processors the state bit will
always be zero.
The mode bits The M4, M3, M2, M1 and M0 bits (M[4:0]) are the mode bits. These
determine the mode in which the processor operates, as shown in
➲Table 3-4: The mode bits, below. Not all combinations of the mode
bits define a valid processor mode. Only those explicitly described
can be used.
..
User mode and System mode do not have an SPSR, since they are not entered on any
exception and therefore do not need a register in which to preserve the CPSR. In User mode or
System mode, reads from the SPSR return an unpredictable value, and writes to the SPSR are
ignored.
Address 0x14 (omitted from the above table) holds the Address Exception vector, which is only
used when the processor is configured for a 26-bit address space.
1 copying the address of the next instruction into the appropriate Link Register
2 copying the CPSR into the appropriate SPSR
3 forcing the CPSR mode bits to a value corresponding to the exception
4 forcing the PC to fetch the next instruction from the relevant vector
It may also set the interrupt disable flags to prevent otherwise unmanageable nestings of
exceptions from taking place.
If the processor is Thumb-aware (Architecture 4T) and is operating in Thumb state, it will
automatically switch into ARM state.
1 moves the Link Register, minus an offset where appropriate, to the PC. The offset will
vary depending on the exception type.
2 copies the SPSR back to the CPSR
3 clears the interrupt disable flags, if they were set on entry
If the processor is Thumb-aware (Architecture 4T), it will restore the operating state (ARM or
Thumb) which was in force at the time the exception occurred.
This chapter explains the concepts behind, and the basics of programming in, ARM
assembly language.
4.1 Introduction 4-2
4.2 Structure of an Assembler Module 4-4
4.3 Conditional Execution 4-6
4.4 The ARM’s Barrel Shifter 4-10
4.5 Loading Constants Into Registers 4-14
4.6 Loading Addresses Into Registers 4-17
4.7 Jump Tables 4-21
4.8 Using the Load and Store Multiple Instructions 4-23
Load/store architecture
Only load and store instructions can access memory. This means that data processing
operations have to use intermediate registers, loading the data from memory beforehand and
storing it back again afterwards. However, this is not as inefficient as one might think. Most
operations actually require several instructions to carry out the required calculation, and each
instruction will run as fast as possible instead of being slowed down by external memory
accesses.
32-bit instructions
All instructions are of the same length, so the processor can fetch every instruction from memory
in one cycle. In addition, all instructions are stored word-aligned in memory, which means that
the bottom two bits of the program counter (r15) are always set zero.
32-bit addresses
Processors implementing Versions 1 and 2 of the ARM Architecture only had a 26-bit addressing
range. All later ARM processors have a 32-bit addressing. Those implementing ARM
Architectures 3 and 4 (but not 4T) have retained the ability to perform 26-bit addressing
backwards compatibility.
37 registers
These comprise:
• 30 general purpose registers, 15 of which are accessible at any one time
• 6 status registers, of which either one or two are accessible at any one time
• a program counter
The banking of registers gives rapid context switching for dealing with exceptions and privileged
operations: see ➲Chapter 3, Programmer’s Model for a summary of the ARM register set.
Conditional execution
All instructions are executed conditionally on the state of the Current Program Status Register
(CPSR). Only data processing operations with the S bit set change the state of the current
program status register.
Co-processor instructions
These support a general way to extend the ARM’s architecture in a customer-specific manner.
Examples
EQ Z set (equal)
MI N set (negative)
VS V set (overflow)
1 2 CMP r0, r1 1
1 2 BLT less 3
1 1 BAL gcd 3
1 1 CMP r0, r1 1
1 1 BEQ end 3
Total = 13
1 2 CMP r0, r1 1
1 1 BNE gcd 3
1 1 CMP r0, r1 1
Total = 10
CF Destination 0
..0 Destination CF
ASR arithmetic shift right by n bits. The bits fed into the top end of the
operand are copies of the original top—or sign—bit.
(signed division by 2n)
Destination CF
Destination CF
Destination CF
The barrel shifter can be used in several of the ARM’s instruction classes. The options available
in each case are described below.
Literal pools
A literal pool is a portion of memory set aside for constants. By default, a literal pool is placed at
every END directive. However, for large programs, this may not be accessible throughout the
program (due to the LDR offset being a 12-bit value, giving a 4Kbyte range), so further literal
pools can be placed using the LTORG directive.
When an LDR, Rd, = instruction needs to access a constant in a literal pool, the assembler
first checks previously encountered literal pools to see whether the desired constant is already
available and addressable. If so, it addresses the existing constant, otherwise it will attempt to
place the constant in the next available literal pool. If this is not addressable—because it does
not exist or is further than 4Kbytes away—an error will result, and an additional LTORG should
be placed close to (but after) the failed LDR Rd,= instruction.
To see how this works in practice, consider the following example. The instructions listed as
comments are the ARM instructions which are generated by the assembler:
END
Strings
The following program contains a function, strcopy, which copies a string from one memory
location to another. Two arguments are passed to the function: the address of the source string
and the address of the destination. The last character in the string is a zero, and will be copied.
AREA StrCopy, CODE
ENTRY ; mark the first instruction
main ADR r1, srcstr ; pointer to first string
ADR r0, dststr ; pointer to second string
BL strcopy ; copy the first into second
SWI 0x11 ; and exit
strcopy
LDRB r2, [r1], #1 ; load byte, then update address
STRB r2, [r0], #1 ; store byte, then update address
CMP r2, #0 ; check for zero terminator
BNE strcopy ; keep going if not
MOV pc, lr ; return
END
ADR is used to load the addresses of the two strings into registers r0 and r1, for passing to
strcopy. These two strings have been stored in memory using the assembler directive DCB
(Define Constant Byte). The first string is 33 bytes long, so the ADR offset to the second (as a
non-word aligned offset) is limited to 255 bytes, which is therefore within reach.
0 result = argument1
1 result = argument2
2 result = argument1 + argument2
3 result = argument1 – argument2
4 result = argument2 – argument1
Values outside this range will have the same effect as value 0.
AREA ArithGate, CODE ; name this block of code
ENTRY ; mark the first instruction to call
main MOV r0, #2 ; set up three parameters
MOV r1, #5
MOV r2, #15
BL arithfunc ; call the function
SWI 0x11 ; terminate
arithfunc ; label the function
CMP r0, #4 ; Treat code as unsigned integer
BHI ReturnA1 ; If code > 4 then return first argument
ADR r3, JumpTable ; Load address of the jump table
LDR pc,[r3,r0,LSL #2] ; Jump to appropriate routine
JumpTable
DCD ReturnA1
DCD ReturnA2
DCD DoAdd
DCD DoSub
DCD DoRsb
ReturnA1
MOV r0, r1 ; Operation 0, >4
MOV pc,lr
ReturnA2
MOV r0, r2 ; Operation 1
MOV pc,lr
This chapter presents some useful strategies for optimising the performance of your ARM
assembly language programs.
5.1 Introduction 5-2
5.2 Integer to String Conversion 5-3
5.3 Multiplication by a Constant 5-8
5.4 Division by a Constant 5-12
5.5 Using 16-bit Data on the ARM 5-17
5.6 Pseudo Random Number Generation 5-25
5.7 Loading a Word from an Unknown Alignment 5-27
5.8 Byte Order Reversal 5-28
5.9 ARM Assembly Programming Performance Issues 5-29
5.2.1 Algorithm
To convert a signed integer to a decimal string, generate a '-' and negate the number if it is
negative; then convert the remaining unsigned value.
To convert a given unsigned integer to a decimal string, divide it by 10, yielding a quotient and
a remainder. The remainder is in the range 0-9 and is used to create the last digit of the decimal
representation. If the quotient is non-zero it is dealt with in the same way as the original number,
creating the leading digits of the decimal representation; otherwise the process has finished.
5.2.2 Implementation
utoa
STMFD sp!, {v1, v2, lr} ; function entry - save some v-registers
; and the return address.
MOV v1, a1 ; preserve arguments over following
MOV v2, a2 ; function calls
MOV a1, a2
BL udiv10 ; a1 = a1 / 10
LDMFD sp!, {v1, v2, pc} ; function exit - restore and return
Explanation
On entry, a2 contains the unsigned integer to be converted and a1 addresses a buffer to hold the
character representation of it.
On exit, a1 points immediately after the last digit written.
Both the buffer pointer and the original number have to be saved across the call to udiv10. This
could be done by saving the values to memory. However, it turns out to be more efficient to use
two 'variable' registers, v1 and v2 (which, in turn, have to be saved to memory).
Because utoa calls other functions, it must save its return link address passed in lr. The
function therefore begins by stacking v1, v2 and lr using STMFD sp!, {v1,v2,lr}.
In the next block of code, a1 and a2 are saved (across the call to udiv10) in v1 and v2
respectively and the given number (a2) is moved to the first argument register (a1) before calling
udiv10 with a BL (Branch with Link) instruction.
On return from udiv10, 10 times the quotient is subtracted from the original number (preserved
in v2) by two SUB instructions. The remainder (in v2) is ready to be converted to character form
(by adding ASCII '0') and to be stored into the output buffer.
But first, utoa has to be called to convert the quotient, unless that is zero. The next four
instructions do this, comparing the quotient (in a1) with 0, moving the quotient to the second
argument register (a2) if not zero, moving the buffer pointer to the first argument/result register
(a1), and calling utoa if the quotient is not zero.
Note that the buffer pointer is moved to a1 unconditionally: if utoa is called recursively, a1 will
be updated but will still identify the next free buffer location; if utoa is not called recursively, the
next free buffer location is still needed in a1 by the following code which plants the remainder
digit and returns the updated buffer location (via a1).
The remainder (in a2) is converted to character form by adding '0' and is then stored in the
location addressed by a1. A post-incrementing STRB is used which stores the character and
increments the buffer pointer in a single instruction, leaving the result value in a1.
Finally, the function is exited by restoring the saved values of v1 and v2 from the stack, loading
the stacked link address into pc and popping the stack using a single multiple-load instruction:
LDMFD sp!, {v1,v2,pc}
1 Load the registers named in {...} in ascending register number order from memory at
[sp], [sp,4], [sp,8] ...
2 Add 4 * number-of-registers to sp.
Many, if not most, register-save requirements in simple assembly language programs can be met
using this approach to stacks.
A more complete treatment of run-time stacks requires a discussion of:
• stack-limit checking (and extension)
• local variables and stack frames
In the utoa program, you must assume the stack is big enough to deal with the maximum depth
of recursion, and in practice this assumption will be valid. The biggest 32-bit unsigned integer is
about four billion, or ten decimal digits. This means that at most 10 x 3 registers = 120 bytes have
to be stacked. Because the ARM Procedure Call Standard guarantees that there are at least 256
bytes of stack available when a function is called, and because we can guess (or know) that
udiv10 uses no stack space, we can be confident that utoa is quite safe if called by an
APCS-conforming caller such as a compiled C test harness.
The stacking technique illustrated here conforms to the ARM Procedure Call Standard only if the
function using it makes no function calls. Since utoa calls both udiv10 and itself, it really ought
to establish a proper stack frame—see ➲The ARM Software Development Toolkit Reference
Manual: Chapter 19, ARM Procedure Call Standard. If you really want to write functions that can
'plug and play together' you will have to follow the APCS exactly.
5.3.1 Introduction
The MUL instruction has the following syntax:
MUL Rd, Rm, Rs
The timing of this instruction depends on the value in Rs. The ARM6 datasheet specifies that for
Rs between 2^(2m-3) and 2^(2m-1)-1 inclusive takes 1S + mI cycles.
Note ARM 7M family processors have a different implementation of MUL. This leads to a different
relationship of cycle counts to values of Rs.
When multiplying by a constant value, it is possible to replace the general multiply with a fixed
sequence of adds and subtracts that have the same effect. For instance, multiply by 5 could be
achieved using a single instruction:
ADD Rd, Rm, Rm, LSL #2 ; Rd = Rm + (Rm * 4) = Rm * 5
This is obviously better than the MUL version:
MOV Rs, #5
MUL Rd, Rm, Rs
The cost of the general multiply includes the instructions needed to load the constant into a
register (up to four may be needed, or an LDR from a literal pool) as well as the multiply itself.
EXPORT mulby105
mulby105
RSB a1,a1,a1,LSL #4
RSB a1,a1,a1,LSL #3
MOV pc,lr
AREA |C$$data|,DATA
|x$dataseg|
END
Notice that the compiler has found the short multiply-by-constant sequence.
Method Cycles
On ARM6 processors, the 2-bit Booth's Multiplier used by MUL takes a number of I-cycles
depending on the value in Rs (in this case m=8, as Rs lies between 8192 and 32767). In this
case, multiply-by-constant performs better. On the ARM60, an instruction fetch is an external
memory S-cycle, or a cache F-cycle (if there is a cache hit) on cached processors.
With slow memory systems and non-cached processors, I-cycles can be much faster than other
cycles because they are internal to the ARM core. So, the general multiply can sometimes be
the fastest option (for large constants where an efficient solution cannot be found). It should also
use less memory. If the load-a-constant stage could be moved outside a loop, this favours the
general multiply, as there is only the MUL to execute.
Multiply by constant 4S 4F
consider the underlined portion as a 0.32 fixed-point number (truncating any bits past the most
significant 32). 0.32 means 0 bits before the decimal point and 32 after it.
== (x * (2^32/y)) / 2^32
2 10000000000000000000000000000000 #
3 01010101010101010101010101010101 *
4 01000000000000000000000000000000 #
5 00110011001100110011001100110011 *
6 00101010101010101010101010101010 *
7 00100100100100100100100100100100 *
8 00100000000000000000000000000000 #
9 00011100011100011100011100011100 *
10 00011001100110011001100110011001 *
11 00010111010001011101000101110100
12 00010101010101010101010101010101 *
13 00010011101100010011101100010011
14 00010010010010010010010010010010 *
15 00010001000100010001000100010001 *
16 00010000000000000000000000000000 #
17 00001111000011110000111100001111 *
18 00001110001110001110001110001110 *
19 00001101011110010100001101011110
20 00001100110011001100110011001100 *
21 00001100001100001100001100001100 *
22 00001011101000101110100010111010
23 00001011001000010110010000101100
24 00001010101010101010101010101010 *
25 00001010001111010111000010100011
The lines marked with a ’#’ are the special cases 2^n, which have already been dealt with.
The lines marked with a ’*’ have a simple repeating pattern.
1 0 3 1 0 1
2 0 5 2 1 2
2 1 6 2 0 3
3 0 9 3 2 4
3 1 10 3 1 6
3 2 12 3 0 7
4 0 17 4 3 8
4 1 18 4 2 12
4 2 20 4 1 14
4 3 24 4 0 15
5 0 33 5 4 16
5 1 34 5 3 24
5 2 36 5 2 28
5 3 40 5 1 30
5 4 48 5 0 31
For the repeating patterns, it is a relatively easy matter to calculate the product by using a
multiply-by-constant method.
The result can be calculated in a small number of instructions by taking advantage of the
repetition in the pattern. This corresponds to the optimal solution in the multiply-by-constant
problem (see ➲5.3 Multiplication by a Constant on page 5-8).
The actual multiply is slightly unusual due to the need to return the top 32 bits of the 64-bit result.
It efficient to calculate just the top 32 bits. This can be achieved by modifying the multiply-by-
constant sequence so that the input value is shifted right rather than left.
Consider this fragment of the divide-by-ten code (x is the input dividend as used in the above
equations):
SUB a1, x, x, lsr #2 ; a1 = x*%0.11000000000000000000000000000000
ADD a1, a1, a1, lsr #4 ; a1 = x*%0.11001100000000000000000000000000
ADD a1, a1, a1, lsr #8 ; a1 = x*%0.11001100110011000000000000000000
ADD a1, a1, a1, lsr #16 ; a1 = x*%0.11001100110011001100110011001100
MOV a1, a1, lsr #3 ; a1 = x*%0.00011001100110011001100110011001
The SUB calculates (for example):
a1 = x - x/4
= x - x*%0.01
= x*%0.11
Therefore, just five instructions are needed to perform the multiply.
Note This program uses ANSI control codes, so should work on most terminal types under Unix and
also on the PC. It will not work on HP-UX if the terminal emulator used is an HPTERM.
An XTERM should be used to run this program on the HP-UX.
5.9.5 Multiplication
Be aware of the time taken by the ARM multiply and multiply accumulate instructions.
When multiplying by a constant value note that using the multiply instruction is often not the
optimal solution. The issues involved are discussed in the ➲5.3 Multiplication by a Constant on
page 5-8.
2 98 100
3 66 134
4 48 150
5 38 160
10 18 180
100 0 198
This chapter explains some techniques for optimising the output of the ARM C compiler, and
gives details of how to build code for deeply embedded applications.
6.1 Introduction 6-2
6.2 Writing Efficient C for the ARM 6-3
6.3 Improving Code Size and Performance 6-11
6.4 Choosing a Division Implementation 6-14
6.5 Using the C Library in Deeply Embedded Applications 6-17
Leaf functions
In 'typical' programs, about half of all function calls made are to leaf functions (a leaf function is
one that makes no calls from within its body).
Often, a leaf function is rather simple. On the ARM, if it is simple enough to compile using just
five registers (a1-a4 and ip), it will carry no function entry or exit overhead. A surprising
proportion of useful leaf functions can be compiled within this constraint.
Extern variables are fundamentally more expensive: each has its own base pointer. Thus each
access to an extern is likely to cost two LDR instructions or an LDR and an STR. It is much less
likely that a pointer to an extern will become a global CSE—and almost certain that there cannot
be several such CSEs—so if a function accesses lots of extern variables, it is bound to incur
significant access costs.
A further cost occurs when a function is called: the compiler has to assume—in the absence of
inter-procedural data flow analysis—that any non- const static or extern variable could be side-
effected by the call. This severely limits the scope across which the value of a static or extern
variable can be held in a register.
Sometimes a programmer can do better than a compiler could do, even a compiler that did
interprocedural data flow analysis. An example in C is given by the standard streams: stdin,
stdout and stderr. These are not pointers to const objects (the underlying FILE structs are
modified by I/O operations), nor are they necessarily const pointers (they may be assignable in
some implementations). Nonetheless, a function can almost always safely slave a reference to
a stream in a local FILE * variable.
It is a common practice to mimic the standard streams in applications. Consider, for example,
the shape of a typical non-leaf printing function:
extern FILE *out; extern FILE *out;
/* the output stream */ /* the output stream */
The compiler handles this code fragment well, generating 276 bytes of code and string literals.
But we could do better. If performance is not critical (and in disassembly it never is) we could
look up the code in a table, using something like:
char *cond_of_instr(unsigned instr)
{
static struct {char name[3]; unsigned code;}
conds[] = {
"EQ", 0x00000000,
"NE", 0x10000000,
....
"NV", 0xf0000000,
};
int j;
for (j = 0; j < sizeof(conds)/sizeof(conds[0]); ++j)
if ((instr & 0xf0000000) == conds[j].code)
return conds[j].name;
return "";
}
This fragment compiles to 68 bytes of code and 128 bytes of table data. Already this is a 30%
improvement on the switch() case, but this schema has other advantages: it copes well with
a random code-to-string mapping and, if the mapping is not random, admits further optimisation.
For example, if the code is stored in a byte (char) instead of an unsigned and the comparison is
with (instr >> 28) rather than (instr & 0xF0000000)then only 60 bytes of code and 64
bytes of data are generated for a total of 124 bytes.
Another advantage for table lookup is that is possible to share the same table between a
disassembler and an assembler—the assembler looks up the mnemonic to obtain the code
value, rather than the code value to obtain the mnemonic. Where performance is not critical, the
symmetric property of lookup tables can sometimes be exploited to yield significant space
savings.
Finally, by exploiting the denseness of the indexing and the uniformity of the returned value it is
possible to do better again, both in size and performance, by direct indexing:
char *cond_of_instr(unsigned instr)
{
return "\
EQ\0\0NE\0\0CC\0\0CS\0\0MI\0\0PL\0\0VS\0\0VC\0\0\
HI\0\0LS\0\0GE\0\0LT\0\0GT\0\0LE\0\0AL\0\0NV" + (instr >> 28)*4;
}
–g
-g severely impacts the size and performance of generated code, since it turns off all compiler
optimisations. You should use it only when actually debugging your code, and it should never
be enabled for a release build.
–Ospace -Otime
These options are complementary:
–zpj0
This disables crossjump optimisation. Crossjump optimisation is a space-saving optimisation
whereby common sections of code at the end of each element in a switch() statement are
identified and commoned together, each occurrence of the code section being replaced with a
branch to the commoned code section.
However, this optimisation can lead to extra branches being executed which may decrease
performance, especially in interpreter-like applications which typically have large switch()
statements.
Use the -zpj0 option to disable this optimisation if you have a time-critical switch()
statement.
Alternatively, you can use:
#pragma nooptimise_crossjump
–apcs /nofp
By default, armcc generates code which uses a dedicated frame pointer register. This register
holds a pointer to the stack frame and is used by the generated code to access a function’s
arguments.
A dedicated frame pointer can make the code slightly larger. By specifying -apcs/nofp on the
command line, you can force armcc to generate code which does not use a frame pointer, but
which accesses the function’s arguments via offsets from the stack pointer instead.
Note tcc never uses a frame pointer, so this option does not apply when compiling Thumb code.
–apcs /noswst
By default, armcc generates code which checks that the stack has not overflowed at the head of
each function. This code can contribute several percent to the code size, so it may be worthwhile
disabling this option with -apcs /noswst.
Be careful, however, to ensure that your program’s stack is not going to overflow, or that you have
an alternative stack checking mechanism such as an MMU-based check.
Note tcc has stack checking disabled by default.
–ARM7T
This option applies to armcc only.
By default, armcc generates code which is suitable for running on processors that implement
ARM Architecture 3 (eg. ARM6, ARM7). If you know that the code is going to be run on a
processor with halfword support, you can use the -ARM7T option to instruct the compiler to use
the ARM Architecture 4 halfword and signed byte instructions. This can result in significantly
improved code density and performance when accessing 16-bit data.
–pcc
The code generated by the compiler can be slightly larger when compiling with the -pcc switch.
This is because of extra restrictions on the C language in the ANSI standard which the compiler
can take advantage of when compiling in ANSI mode.
Note If your code will compile in ANSI mode, do not use the -pcc switch.
Note If armlib.32l is not in the current directory, you will need to specify the full pathname.
In this instance, the linker will produce the following output:
ARM Linker: Unreferenced AREA unused.o(C$$code) (file unused.o)
omitted from output.
Unrolled divide
The default divide implementation 'unrolled' is fast, but occupies a total of 416 bytes
(55 instructions for the signed version plus 49 instructions for the unsigned version). This is
an appropriate default for most Toolkit users who are interested in obtaining maximum
performance.
Small divide
Alternatively you can change this file to select 'small' divide which is more compact at
136 bytes(20 instructions for signed plus 14 instructions for unsigned) but somewhat slower,
as there is considerable looping overhead.
For a comparison of the speed difference between these two routines, see ➲Table 6-1: Signed
division example timings on page 6-15 (the speed of divide is data-dependent).
0/100 22 19
9/7 22 19
4096/2 70 136
1000000/3 99 240
If you have a specific requirement, you can modify the supplied routines to suit your application.
For instance, you could write an unrolled-2-times version or you could combine the signed and
unsigned versions to save more space.
6.4.3 Summary
The standard division routine used by the C library can be selected by using the options file in
the C library build area. If the supplied variants are not suitable, you can write your own.
For real-time applications, the maximum division time must be as short as possible to ensure that
the calculation can complete in time. In this case, the functions __rt_sdiv64by32 and
__rt_sdiv32by16 are useful.
caller's pc = 0XE92DE000
returning...
C$$data 4
C$$code$$__rt_fpavailable 8 __rt_fpavailable
C$$code$$__rt_alloc 68 __rt_alloc
C$$code$$__rt_stkovf 76 __rt_stkovf_split_*
TOTAL 736
If floating point support is definitely not required, then the EnsureNoFPSupport variable can be
set to {TRUE}, and some extra space will be saved. After making any modifications to
rtstand.s, the size of the various areas can be found using one of the following commands:
decaof -b rtstand.o
decaof -q rtstand.o
From the above table it is clear that for many applications the standalone runtime library will be
roughly 0.5Kb.
This chapter explains how to write programs that contain routines written both in C and
assembly language, and how to use the ARM Procedure Call Standard to pass arguments
and results between them.
7.1 Introduction 7-2
7.2 Using the ARM Procedure Call Standard 7-3
7.3 Passing and Returning Structures 7-9
.s
ASM source module armasm .o
.c
C source module(s) armcc -c .o
4 v1 register variable
5 v2 register variable
6 v3 register variable
7 v4 register variable
8 v5 register variable
Simplistically:
a1-a4, f0-f3 are used to pass arguments to functions. a1 is also used to return
integer results, and f0 to return FP results. These registers can be
corrupted by a called function.
v1-v5, f4-f7 are used as register variables. They must be preserved by called
functions.
union polymorphic_ptr
{ struct A *a;
struct B *b;
int *i;
}
whereas the structure used in the previous example is not:
struct { char ch1, ch2; }
An integer-like structure has its contents returned in al. This means that a1 is not needed to pass
a pointer to a result struct in memory, and is instead used to pass the first argument.
mul64
MOV ip, a1, LSR #16 ; ip = a_hi
MOV a4, a2, LSR #16 ; a4 = b_hi
BIC a1, a1, ip, LSL #16 ; a1 = a_lo
BIC a2, a2, a4, LSL #16 ; a2 = b_lo
MUL a3, a1, a2 ; a3 = a_lo * b_lo (m_lo)
MUL a2, ip, a2 ; a2 = a_hi * b_lo (m_mid1)
MUL a1, a4, a1 ; a1 = a_lo * b_hi (m_mid2)
MUL a4, ip, a4 ; a4 = a_hi * b_hi (m_hi)
ADDS ip, a2, a1 ; ip = m_mid1 + m_mid2 (m_mid)
ADDCS a4, a4, #&10000 ; a4 = m_hi + carry (m_hi')
ADDS a1, a3, ip, LSL #16 ; a1 = m_lo + (m_mid<<16)
ADC a2, a4, ip, LSR #16 ; a2 = m_hi' + (m_mid>>16) + carry
MOV pc, lr
This code is fine for use with assembly language modules, but in order to use it from C we need
to tell the compiler that this routine returns its 64-bit result in registers. This can be done by
making the following declarations in a header file:
typedef struct int64_struct
{ unsigned int lo;
unsigned int hi;
} int64;
This chapter explains how to generate programs which use overlays, and the linker’s scatter
loading facility, and describes the use of the ARM shared libraries.
8.1 Using Overlays 8-2
8.2 ARM Shared Libraries 8-8
-OVERLAY causes the linker to compute the size of the overlay segments automatically,
and to abut distinct memory partitions.
The linker generates a set of files in a directory specified by the -OUTPUT
option. Overlay segments to be forced to specific memory addresses in a
simple form of scatter loading. However, PCIT entries will be generated even
for non-clashing overlays, producing extra overheads in terms of code size and
execution speed. For this reason the -OVERLAY option is not recommended
for generating scatter loaded images, and -SCATTER should be used instead.
-SCATTER instructs the linker to create either an extended AIF file or a directory of files.
The overlays will be placed into load regions and the linker will add information
to the executable to allow the overlay manager to copy the overlay segments
from the correct load region. The directory of output files will be suitable for use
in a ROM-based system.
All the overlay segments must have an execution address specified in the
scatter load description file. The linker will not place overlay segments
automatically. The scatter loading scheme does not support dynamic overlays.
With scatter loading, PCIT information is not generated for execution regions
not marked as overlays, so these regions do not have any overlay overhead
associated with them.
LDMIA r0!,{r2,r3,r4}
CMP r8,r3
SUBNE r1,r1,#1
BNE search_loop
TEQ r0,#0
MOVEQ r0,#2
BEQ SevereErrorHandler
initLoop
LDR r1,[r0],#4
CMP r1,#0
MOVEQ pc,lr
LDMIA r0!,{r2,r3}
CMP r1,#16
BLT copyWords
copy4Words
LDMIA r3!,{r4,r5,r6,r7}
STMIA r2!,{r4,r5,r6,r7}
SUBS r1,r1,#16
BGT copy4Words
BEQ initLoop
copyWords
SUBS r1,r1,#8
LDMIAGE r3!,{r4,r5}
STMIAGE r2!,{r4,r5}
BEQ initLoop
LDR r4,[r3]
STR r4,[r2]
B initLoop
;
; A couple of MACROS to make the table entries easier to add.
; The execname parameter is the name of execution to initialise or copy.
;
MACRO
InitEntry $execname
LCLS lensym
LCLS basesym
LCLS loadsym
LCLS namecp
namecp SETS "$execname"
lensym SETS "|Image$$":CC:namecp:CC:"$$Length|"
basesym SETS "|Image$$":CC:namecp:CC:"$$Base|"
loadsym SETS "|Load$$":CC:namecp:CC:"$$Base|"
IMPORT $lensym
IMPORT $basesym
IMPORT $loadsym
DCD $lensym
DCD $basesym
DCD $loadsym
MEND
ziTable
ZIEntry root ; Zero initialised data from the root read/write
; region
DCD 0
InitTable
InitEntry root ; Initialised data from the root read/write region
DCD 0
END
Prerequisites
You can make a shared library from any number of object files, including reentrant stubs of other
shared libraries, provided that:
• each object file conforms to a reentrant version of the ARM Procedure Call Standard
and each code area has the REENTRANT attribute
• there are no unresolved references resulting from the linking together of the component
objects
An immediate consequence of the second rule is that it is impossible to make two shared libraries
which refer to one another: to make the second library and its stub would require the stub of the
first, but to make the first and its stub would require the stub of the second.
The first rule is not 100% necessary, and is difficult to enforce. The linker warns you if it finds a
non-reentrant code area in the list of objects to be linked into a shared library, but will build the
library and its matching stub anyway. You must decide whether the warning is real, or merely a
formality.
This chapter describes how to construct simple ROM images which contain C code.
9.1 Introduction 9-2
9.2 Application Startup 9-2
9.3 Using the C Library in ROM 9-14
9.4 Troubleshooting Hints and Tips 9-18
; Now fall into the LDR PC, Reset_Addr instruction which will continue
; execution at 'Reset_Handler'
Vector_Init_Block
LDR PC, Reset_Addr
LDR PC, Undefined_Addr
LDR PC, SWI_Addr
LDR PC, Prefetch_Addr
LDR PC, Abort_Addr
NOP
LDR PC, IRQ_Addr
LDR PC, FIQ_Addr
; Set up the SVC stack pointer last and return to SVC mode
MOV R0, #Mode_SVC:OR:I_Bit:OR:F_Bit ; No interrupts
MSR CPSR, R0
LDR R13, =SVC_Stack
IMPORT C_Entry
[ :DEF:THUMB
ORR lr, pc, #1
BX lr
CODE16 ; Next instruction will be Thumb
]
BL C_Entry
; In a real application we wouldn't normally expect to return, however
; this example does so the debug monitor swi SWI_Exit is used to halt the
; application.
SWI SWI_Exit
END
--- ex.c -----------------------------------------------------------
/* We use the following Debug Monitor SWIs to write things out
* in this example
*/
extern __swi(0) WriteC(char c); /* Write a character */
extern __swi(2) Write0(char *s); /* Write a string */
void C_Entry(void)
{
if (rom_data_base == ram_data_base) {
Write0("Warning: Image has been linked as an application. To link as a
ROM image\r\n");
Write0(" link with the options -RO <rom-base> -RW <ram-
base>\r\n");
}
1 Compile the C file ex.c with the following command. The compiler will generate one
warning which may be ignored.
armcc -c -fc -apcs 3/noswst/nofp ex.c
-apcs 3/noswst Tells the assembler that this code is only suitable for
use with other code which does not have software stack
checking. Code which uses software stack checking
cannot generally be mixed with code which does not.
The assembler will mark the object file as containing
code which does not perform software stack checking
so that the linker can give an error if it is mixed with
code which does.
3 Build the ROM image using armlink.
armlink -o ex1_rom -Bin -RO 0xf0000000 -RW 0x10000000 -First
init.o(Init) -Remove -NoZeroPad -Map -Info Sizes init.o ex.o
void C_Entry(void)
{
char s[80];
if (rom_data_base == ram_data_base) {
Write0("Warning: Image has been linked as an application.
To link as a ROM image\r\n");
Write0(" link with the options -RO <rom-base> -RW
<ram-base>\r\n");
}
If armlib.16l is not in the current directory, you will need to specify the directory on the
command line.
This will produce the following output:
object file code inline inline 'const' RW 0-Init debug
size data strings data data data data
init.o 236 0 0 0 0 0 0
sprintf.o 40 12 184 0 0 0 0
<ctype.h>
You must call the _ctype_init() function in your initialisation if you wish to use any of the
ctype.h functions.
isalnum isalpha iscntrl isdigit isgraph islower isprint ispunct
isspace tolower toupper isxdigit
<math.h>
acos asin atan atan2 cos sin tan cosh
sinh tanh exp frexp ldexp log log10 modf
pow sqrt ceil fabs floor fmod
<setjmp.h>
setjmp longjmp
<stdlib.h>
atof atoi atol strtod strtol strtoul rand srand
bsearch qsort abs div labs ldiv mblen mbtowc
wctomb mbstowcs wcstombs
<locale.h>
You must call the _locale_init() function in your initialisation if you wish to use any of the
locale.h functions.
setlocale localeconv
<time.h>
mktime asctime ctime gmtime difftime localtime strftime
Cause
You have compiled your C code with stack checking enabled. The C compiler generates code
which calls one of the above functions when stack overflow is detected.
Solution
This problem may be fixed in one of the following ways:
• Recompile your C code with the -apcs 3/noswst option to disable stack checking.
• Link with a C library which provides support for stack limit checking (of the pre-built C
libraries provided with the release only armlib_n.32x does not support stack limit
checking)
Note: This is usually only possible in an application environment as the C libraries stack
overflow handling code relies heavily on the application environment.
• Write a pair of functions __rt_stkovf_split_big and
__rt_stkovf_split_small. This will usually just generate an error for debugging
purposes.
The code might look similar to the following:
EXPORT __rt_stkovf_split_big
EXPORT __rt_stkovf_split_small
__rt_stkovf_split_big
__rt_stkovf_split_small
ADR R0, stack_overflow_message
SWI Debug_Message ; System dependent SWI to
; write a debugging message
B . ; and loop forever.
stack_overflow_message
DCB "Stack overflow", 0
Cause
Parts of your code have been compiled or assembled with software stack checking enabled and
parts without. Alternatively, you have linked with a library which has software stack checking
enabled whereas your code has it disabled or vice versa.
Solution
Make sure all your code is compile/assembled with either -apcs 3/noswst or -apcs 3/
swst.
Link with the correct library, of the pre-built libraries provided with the release the libraries
armlib.16x and armlib_i.32x have stack checking disabled, all others have stack
checking enabled.
Problem
The linker reports __main as being undefined.
Cause
When the compiler compiles the function main it generates a reference to the symbol __main
to force the linker to include the basic C run time system from the C library. If you are not linking
with a C library and have a function main you may get this error.
Solution
This problem may be fixed in one of the following ways:
• If the main function is only used when building an application version of your ROM
image for debugging purposes, you should comment it out with a #ifdef when
building a ROM image.
Usually when building a ROM image you will call the C entry point something other than
main such as C_Entry or ROM_Entry to avoid confusion.
Problem
The linker reports a number of undefined symbols of the form:
__rt_... or __16__rt_...
Cause
These are run time support functions which are called by code generated by the compiler to
perform tasks which cannot be performed simply in ARM or Thumb code such as integer division
or floating point operations.
For example, the following code will generate a call to the run time support function __rt_sdiv
to perform a division.
int test(int a, int b)
{
return a/b;
}
Solution
You should assemble file examples/clstand/rtstand.s and link this in. A Thumb version
of this file is available in the thumb subdirectory.
Note The divide routines in rtstand.s use Demon SWIs to report division be zero. You may need to
edit rtstand.s to change these SWIs if your system does not support them.
Problem
The linker produces the error message:
ARM Linker: (Fatal) No entry point for image.
ARM Linker: garbage output file aif removed
Cause
You have not defined an entry point. You must define the entry point even if the entry point is the
start of the ROM image.
Problem
The compiler produces errors of the form:
Serious error: illegal character (0x24 = '$') in source
Cause
The $ character is not allowed in variable names as standard by ANSI although many compilers
allow this.
Solution
Use the -fc option on the C compiler to tell it to allow $ in variable names.
Problem
When loading an image into the ARMulator and trying to run it, the following error occurs:
*** Error: Can't go
Cause
armsd does not know the location at which it should begin executing your image.
Solution
Tell armsd where to start executing using the command:
pc = <address in hex>
Re-enter the go command.
If your image is to be executed from its base address, the address you specify above should be
the same address as that used in the getfile command with which you loaded the image.
Problem
The image is bigger than expected (bigger than the size given by -info sizes).
This problem may also be caused by the image having a large section of zeros on the end of it.
Cause
By default, when generating a plain binary image, the linker expands zero initialised areas with
zero bytes in the image.
The area will then be zero initialised when the image is loaded directly into memory.
Solution
Use the -NoZeroPad option to tell the linker not to expand the zero init area.
Causes
There are a number of possible causes:
• If the hex words look as though they are reversed instruction words, armsd may be
using the wrong endianness.
Solution
Reconfigure your copy of armsd to the opposite endianness and try again.
• You may have linked it as an application image instead of a plain binary image. If the
disassembly looks something like the following, then this is the case.
0x10000000: 0xe1a00000 .... : nop
0x10000004: 0xe1a00000 .... : nop
0x10000008: 0xeb00000c .... : bl 0x10000040
0x1000000c: 0xeb00001b .... : bl 0x10000080
0x10000010: 0xef000011 .... : swi 0x11
Solution
Relink with the -bin flag and without any -aif flag.
• The initialisation code may not be at the start of the image because you have omitted
the -First option.
Solution
Try relinking with the -First option to see if this resolves the problem.
Problem
The image loads without problem but when trying to run, it crashes/hangs immediately.
Causes
Any of the causes in the previous problem may also apply here.
Another possibility is that it has been linked or loaded at the wrong address.
Solution
Check that the address is the same on each of the following:
• The linker’s -RO option
• The GetFile command in armsd
• The PC= command in armsd
If all this is correct, try setting the PC to the start and using the Step In command to step through
all the initialisation code to see if it is going wrong in the initialisation.
bit 31 bit 30
0 0 is a memory access
1 1 generates an abort
00040000
Paged RAM
00020000
Read-only RAM
00000000
typedef union {
char byte[PAGESIZE];
ARMword word[PAGESIZE/4];
} page;
typedef struct {
page *p[8]; /* eight pages of memory */
int mapped_in;
} ModelState;
The example does not consider different endian modes. It assumes that the ARM is configured
to be the same endianness as the host architecture.
#define OFFSET(addr) ((addr) & 0x7fff)
#define WORDOFF(addr) (OFFSET(addr)>>2)
unsigned ARMul_MemoryInit(ARMul_State *state,
unsigned long initmemsize)
s=(ModelState *)malloc(sizeof(ModelState));
if (memory==NULL) return FALSE;
for (i=0;i<8;i++) {
s->p[i]=(page *)malloc(sizeof(page));
if (s->p[i]==NULL) return FALSE;
memset(s->p[i], 0, sizeof(page));
}
return TRUE;
}
if (highpage)
mem=s->p[s->mapped_in];
else
mem=s->p[0];
The memory models must track the numbers of N, S, I and C cycles that occur in the ARMulator.
These counts are used to provide the $statistics and $statistics_inc variables in
armsd.
if (account) { /* an ARMulator request */
if (seq==LOW) state->NumNcycles++;
else state->NumScycles++;
}
switch ((address>>30)&0x3) {
case 0: /* 00 - memory access */
if (Nrw==LOW)
return mem->word[WORDOFF(address)];
You do not need to extract the relevant byte or halfword presented on the data bus for byte or
half-word loads, as the ARM will do this for you. Note that this is not true of the high speed
memory interface.
else /* write - need to do right width access */
/* Ignore writes out of supervisor mode to the "low" page */
if (highpage || account==FALSE || state->NtransSig==LOW) {
Note The trans value supplied is not correct. Use the NtransSig in the ARMul_State instead.
if (mas0==LOW) { /* byte or word */
if (mas1==LOW) /* byte */
mem->byte[OFFSET(address)]=dataOut;
else
mem->word[WORDOFF(address)]=dataOut;
} else { /* half-word */
ARMword offset=OFFSET(address) & ~1;
mem->byte[offset]=dataOut>>8;
mem->byte[offset+1]=dataOut;
}
}
break;
return 0;
}
...
}
MemAccess is similar to the following:
...
{
int highpage=(address & (1<<17));
page *mem;
SunOS
HP/UX
Macintosh
Windows 95 & Windows NT
DOS
It is NOT possible to rebuild the RDP drivers for:
Windows 3.1 the Windows 3.1 serial/parallel drivers are complicated by extra
'thunking' between the 32-bit application and 16-bit Windows. Users
wishing to rebuild the RDP drivers should upgrade to Windows 95 or
Windows NT.
serdrive.h This is a header file for the RDP I/O interface. It defines a
DriverDesc struct which contains function pointers for
routines to open, close, read, write the comms link. This is the
official interface between the debugger and the RDP comms
link.
name This should be a unique name for the driver, which is used by
armsd as a command-line argument. For example the standard
serial driver defines the name "SERIAL".
OpenProc This function opens a connection, and returns a handle onto it.
This handle is passed into the other driver functions. On the
standard Unix serial port drivers it merely opens the appropriate
serial port and returns the (Unix) file handle.
ConfigProc The configuration function is used to set the linespeed on the
connection and it initialise it.
ReadProc WriteProc These functions read and write data across the link. The RDP
assumes that the link is error free, and that characters are not
lost (must have flow-control or adequate buffering).
CloseProc The CloseProc is called when the RDP wishes to close the
connection.
LoggingProc This is called when the level of RDP logging is changed. (e.g.
when the $rdi_log variable is changed in armsd). For a
description of the meaning of the logging values, see ➲The ARM
Software Development Toolkit Reference Manual: Chapter 7,
Symbolic Debugger.
armul
New sources, e.g. memory models, serial drivers, etc., should be placed in the source
directory.
Any new memory models can be added to the ARMulator Makefile by adding a rule to it for
the model. For example:
example.o: $(SRC)example.c $(HFILES)
$(CC) $(CFLAGS) -c $(SRC)example.c
Any new serial drivers also need rules adding (similar to the above), but also need to be added
to the OFILES list of object files at the top of the Makefile to be linked into the resulting armsd.
You also need to declare the driver in drivers.c.
An ARMulator can then be built using make:
make MODEL=example
This compiles to an ARMulator based armsd which uses the example memory model. By
default, armsd will be rebuilt with the armvirt memory model.
MSVC20
WATCOM10
CLX
MSVC20
WATCOM10
There is a choice of compilers for rebuilding the ARMulator DLL, Microsoft Visual C++ (produces
the fastest code, but only runs under 32-bit Windows - Windows NT or Windows 95) and Watcom
C/C++ V10.0a; the rebuild kit provides makefiles for both these compilers.
As described above, new rules may be added to the ARMulator DLL makefile ARMULATE.MAK.
When complete, the make procedure will produce a file called ARMULATE.DLL, this should be
placed in the BIN sub-directory of the ARM Tools200 installation directory, typically
C:\ARM200\BIN.
This chapter explains how the ARM deals with exceptions, and discusses the issues
involved in writing exception handlers.
11.1 Overview 11-2
11.2 Entering and Leaving an Exception 11-5
11.3 The Return Address and Return Instruction 11-6
11.4 Writing an Exception Handler 11-8
11.5 Installing an Exception Handler 11-12
11.6 Exception Handling on Thumb-Aware Processors 11-14
Exception Description
Reset Occurs when the CPU reset pin is asserted. Only expected to
occur for signalling power-up, or for resetting as if the CPU has
just powered up. It can therefore be useful for producing soft
resets.
Undefined Instruction Occurs if neither the CPU nor any attached coprocessor
recognises the currently executing instruction.
Prefetch Abort Occurs when the CPU attempts to execute an instruction which
has prefetched from an illegal address, ie. an address that the
memory management subsystem has determined as inaccessible
to the CPU in its current mode.
Data Abort Occurs when a data transfer instruction attempts to load or store
data at an illegal address.
IRQ Occurs when the CPU’s external interrupt request pin is asserted
(low) and the I bit in the CPSR is clear.
FIQ Occurs when the CPU’s external fast interrupt request pin is
asserted (low) and the F bit in the CPSR is clear.
1 Copies the Current Program Status Register (CPSR) into the Saved Program Status
Register (SPSR) for the mode in which the exception will be handled.
This saves the current mode, interrupt mask and condition flags.
2 Sets the appropriate CPSR mode bits:
a) to change to the appropriate mode, also mapping in the appropriate banked
registers for that mode.
b) to disable interrupts.
IRQs are disabled once any other exception occurs, and FIQs are also disabled
when a FIQ occurs.
3 Stores the return address (PC – 4) in LR_<mode>.
4 Sets the PC to the appropriate vector address.
This forces the branch to the appropriate exception handler.
1 Attach itself to the undefined instruction vector, storing the old contents.
2 Examine the undefined instruction to see if it should be emulated.
This is similar way to the way a SWI handler extracts the number of a SWI, but rather
than extracting the bottom 24 bits, the emulator must extract bits 24 to 27, which
determine if the instruction is a coprocessor operation:
• If bits 27-24 = 1110 or 110x, the instruction is a coprocessor instruction.
• If bits 8-11 show that this coprocessor emulator should handle the instruction,
the emulator should process the instruction and return to the user program.
• Otherwise the emulator should pass the exception onto the original handler (or
the next emulator in the chain) using the vector stored when the emulator was
installed.
Once any chain of emulators is exhausted, no further processing of the instruction can take
place, so the undefined instruction handler should report an error and quit.
In each case, the MMU can load the required virtual memory into physical memory (the address
which caused the abort being stored in the MMU’s Fault Address Register (FAR)). Once this is
done, the handler can return and retry executing the instruction.
oldvec = *vector;
*vector = vec;
return (oldvec);
}
Code to call this to install an IRQ handler might be:
unsigned *irqvec = (unsigned *)0x18;
unsigned *irqaddr = (unsigned *)0x38; /* For example */
*irqaddr = (unsigned)IRQHandler;
Install_Handler (irqaddr,irqvec);
Again in this case the returned, original contents of the IRQ vector are discarded.
Switch to Switch to
Save CPU and
ARM state Thumb state
register state
Entry veneer
Handle the
exception
This chapter explains how to implement SWIs and how to call them from your programs.
12.1 Introduction 12-2
12.2 Implementing a SWI Handler 12-7
12.3 Loading the Vector Table 12-9
12.4 Calling SWIs from your Application 12-11
12.5 Development Issues: SWI Handlers and Demon 12-15
12.6 Example SWI Handler 12-18
1 copies the Current Program Status Register (CPSR) into the Supervisor mode Saved
Program Status Register (SPSR_SVC)
This saves the current mode, interrupt mask and condition flags.
2 sets the CPSR mode bits to cause a change to Supervisor mode
This maps in the banked Stack Pointer (SP_SVC) and Link Register (LR_SVC).
3 sets the CPSR IRQ disable bit
This means that the SWI handler will execute without ordinary interrupts being taken.
The FIQ disable bit is not set, so fast interrupts can still be taken. You can choose to
turn IRQs back on, or disable FIQs, within the handler itself.
4 stores the value (PC – 4) into LR_SVC
This means that the link register now points to the next instruction to be executed when
the SWI has been handled. The SWI itself is located at PC – 8.
5 forces the PC to 0x8
Address 0x8 is the SWI entry in the vector table. Typically, this will contain a branch
instruction to the handler.
31 28 27 24 23 0
Store registers
Extract SWI Number
Handle a SWI
Return
Restore registers
Return
Previous sp_svc
spsr_svc
lr_svc
r12 reg[12]
r0
sp_svc reg[0] *reg
SWIJumpTable
DCD SWInum0
DCD SWInum1
;
; DCD for each of other SWI routines
;
Notice that the contents of the vector are updated by the routine itself; the return value is the
previous contents of the vector. The reason for returning this value will be examined shortly. For
now, as no use is made of the previous contents, this could be called from your C program with:
Install_Handler((unsigned)SWIHandler,swivec);
where
unsigned *swivec = (unsigned *)0x8;
This again returns the original contents of the vector. Temporarily ignoring this returned value,
this routine could be called from the user’s C program by:
Install_LDR_Handler(swivec,(unsigned)swiaddr);
where
unsigned *swivec = (unsigned *) 0x8;
unsigned *swiaddr= (unsigned*)0x38; /*An address<=4k from vector*/
*swiaddr = (unsigned)SWIHandler;
void output_newline(void)
{ SWI_WriteC(13);
SWI_WriteC(10);
}
In the declaration of SWI_WriteC, notice how __swi(0) declares the SWI_WriteC 'function'
to be in-line SWI number 0.
Compile this to ARM assembly language source using:
armcc -S -li -apcs 3/32bit newline.c -o newline.s
SWI_InstallHandler_block
__value_in_regs
__swi(0x70) SWI_InstallHandler(unsigned r0, unsigned r1,
unsigned r2);
Note The BEQ Dswivec instruction would not actually branch to the required stored vector, but would
instead jump to the address where the location of that pointer is stored in the data area. It is cited
here to illustrate the location to which the handler is attempting to branch. The easiest way to
write it is as a branch to a known address, which in this case would be:
BEQ 0x20
However, this can be done more flexibly by importing the Dswivec label into the assembly
language module. You can then store the address where the Demon vector is stored within the
module and force the PC to that address.
MakeChain
LDR r0, =swichain ; Load address of swichain into r0.
LDR r1, =Dswivec ; Load address of Dswivec into r1.
LDR r2, [r1] ; Load contents of Dswivec, i.e. the
; location of the stored Demon vector.
STR r2, [r0] ; Store vector location within range
; of PC relative load.
MOV pc,lr ; Return from routine.
Note that while developing under Demon, you will not need to set up a stack in Supervisor mode,
as Demon creates a 512 byte stack for you.
Once development is finished, you will need to do two further things before producing the code
for your final system:
• set up the Supervisor mode stack (as described earlier)
• remove the additions that patch your handler in front of Demon’s
12.6.1 install.c
/***************************/
/* File: install.c */
/* Author: Andy Beeson */
/* Date: 7th February 1994 */
/***************************/
#include <stdio.h>
#include <stdlib.h>
struct four_results
{ unsigned a;
unsigned b;
unsigned c;
unsigned d;
};
int main ()
{
struct four_results r_259; /* Results from SWI 259 */
unsigned *swivec = (unsigned *)0x8; /* Pointer to SWI vector */
*Dswivec = Install_Handler ((unsigned)SWIHandler, swivec);
Update_Demon_Vec (swivec, Dswivec);
MakeChain ();
printf("Hello 256\n");
my_swi_256 ();
printf("Hello 257\n");
my_swi_257 (257);
printf("Hello 258\n");
printf(" Result = %u\n",my_swi_258 (1,2,3,4));
printf ("Hello 259\n");
r_259 = my_swi_259 (10,20,30,40);
printf (" Results are: %u %u %u %u\n",
r_259.a,r_259.b,r_259.c,r_259.d);
printf("The end\n");
return (0);
}
EXPORT SWIHandler
EXPORT MakeChain
IMPORT C_SWI_Handler
IMPORT Dswivec
SWIHandler
SUB r13, r13, #4 ; leave space to store spsr
STMFD r13!,{r0-r12,r14} ; store registers
MOV r1, r13 ; second parameter to C routine
; is register values.
LDR r0,[r14,#-4] ; Calculate address of SWI instruction
; and load it into r0
BIC r0,r0,#0xff000000 ; mask off top 8 bits of instruction
MRS r2, spsr
STR r2,[r13,#14*4] ; store spsr on stack at original r13
BL C_SWI_Handler ; Call C routine to handle SWI
CMP r0, #0 ; Has C routine handled SWI ?
; 0 = no, 1 = yes
LDR r2, [r13,#14*4] ; extract spsr from stack
MSR spsr,r2 ; and restore it
LDMFD r13!, {r0-r12,lr} ; Restore original registers
ADD r13,r13,#4
; Now need to decide whether to return from handler or to call
; the next handler in the chain (the debugger's).
MOVNES pc,lr ; return from handler if SWI handled
LDR pc, swichain ; else jump to address containing
; instruction to branch to address of
; debugger's SWI handler.
swichain
DCD 0
MakeChain
LDR r0, =swichain ; Load address of swichain into r0.
LDR r1, =Dswivec ; Load address of Dswivec into r1.
LDR r2, [r1] ; Load contents of Dswivec, i.e. the
; location of the stored Demon vector.
STR r2, [r0] ; Store vector location within range
; of PC relative load.
This chapter explains how to run benchmarks on the ARM processor, and how to use the
profiling facilities to help improve the size and performance of your code.
13.1 Introduction 13-2
13.2 Measuring Code and Data size 13-3
13.3 Timing Program Execution Using the ARMulator 13-5
13.4 Profiling Programs using the ARMulator 13-9
-info sizes gives a breakdown of the code and data sizes of each object file or
library member making up an image.
-info totals gives a summary of the total code and data sizes of all object files
and all library members making up an image.
Note If armlib.32l is not in the current directory, you need to specify its full path name.
The -info totals option causes armlink to produce output similar to the following—the exact
figures may vary since these are dependent on the version of the compiler and library being
used:
code inline inline'const' RW 0-Init debug
size data strings data data data data
Object totals 2272 28 1540 0 48 10200 0
Library totals 34408 400 764 128 700 1176 0
Grand totals 36680 428 2304 128 748 11376 0
The columns in the table have the following meanings:
code size gives the code size, excluding any data which has been placed in
the code segment (see inline data, below).
inline data reports the size of the data included in the code segment by the
compiler. Typically, this data will contain the addresses of variables
which are accessed by the code, plus any floating point immediate
values or immediate values that are too big to load directly into a
register. In does not include inlined strings, which are listed
separately (see inline strings, below).
The ROM and RAM requirements for the Dhrystone program would be:
ROM = code size + inline data + inline strings + const data + RW data
= 36680 + 428 + 2304 + 128 + 748
= 40278
RAM = RW data + 0-Init data
= 748 + 11376
= 12124
To repeat this experiment with the Thumb compiler, issue the command:
tcc -c -Ospace -DMSC_CLOCK dhry_1.c dhry_2.c
This time use armlink’s -info sizes option to give a complete breakdown of the code and
data sizes:
armlink -o dhry -info sizes dhry_1.o dhry_2.o armlib.16l
ARM Thumb
ARM Thumb
ARM Thumb
Name Displays the function names. The current function in a section starts
at the column’s left-hand edge: parent and child functions are shown
indented.
cum% Shows the total percentage time spent in the current function plus the
time spent in any functions which it called. In the case of main, the
program spent 96.04% of its time in main and its children.
self% Shows the percentage time spent in the current function on each
parent function’s behalf.
desc% Shows the percentage time spent in children of the current function
on the current function’s behalf. For example, in the case of main
only 0.16% of the time is spent in main itself, whereas 95.88% of the
time is spent in functions called by main.
calls Reports the number of times a function is called from the current
function. The call count for main is 0 because main is the top-level
function, and is not called by any other functions.
The section for insert_sort shows that it made 243432 calls to strcmp, and that this
accounted for 59.44% of the time spent in strcmp (the desc% column shows 0 in this case
because strcmp does not call any functions).
In the case of strcmp, qs_string_compare (which is called by qsort), shell_sort and
insert_sort made respectively 13021, 14059 and 243432 calls to strcmp and the time spent
in strcmp is shared out between the functions in the ratio 3.17% to 3.43% to 59.44%.
Run the program using armsd or the Windows debugger. You should obtain the following output:
umull (0xA0000000*0x10101010) = 0x0a0a0a0a00000000
umlal (+0x00500000*0x10101010) = 0x0a0f0f0f05000000
smull (0xA0000000*0x10101010) = 0xf9f9f9fa00000000
smlal (+0x00500000*0x10101010) = 0xf9fefeff05000000
Recompile main.c with armcc and re-run it to check that you get the same results:
armcc -c main.c armlink -o mul main.o mul.o armlib.32l
G J
-g option
armcc 2-5 Jump tables 4-21
H L
Halfword data 5-17 LDM instruction 5-29
Handling SWIs LDR Rd, = mechanism 4-15, 4-18
in Thumb state 11-15 LDR/STR instruction 4-12
Hello World example 2-4 Leaf functions 6-3
hello.c file 2-4 Leaving
Hi registers 3-9 exceptions 11-5
description 3-9 Link register 12-2
Linker outputs
shared libraries 8-11
I Linking
with libraries 2-6
Increment /Decrement, Before/After 4-23 Literal pools 4-15
Initialisation on RESET 9-2 Little endian. See Memory format
In-line functions 6-8 Lo registers 3-9
M switching
to ARM 3-5
to Thumb 3-5
main.c file 14-8
Memory formats 3-3 Optimising
big endian description 3-3 multiple loads 14-5
little endian description 3-3 multiple stores 14-5
loading big endian 5-21 register usage 5-30
loading little endian 5-18 Overflow
storing big endian 5-22 detecting 5-23
storing little endian 5-20 Overlay manager 8-3
Memory models -OVERLAY option 8-2
armfast 10-2 Overlays
armproto 10-2 clash detection 8-2
armvirt 10-2 managing 8-3
Minimising non-sequential cycles 5-35 -OVERLAY option 8-2
Modules (RISC OS) 8-10 -SCATTER option 8-2
MOV/MVN instruction 4-14 using 8-2
Multiple instructions
load/store 4-2
Multiple loads P
optimising 14-5 Page table generation 5-25
Multiple stores Passing
optimising 14-5 and returning structures 7-9
Multiple versus single transfers 4-23 arguments 6-5
PC relative expressions 4-18