0% found this document useful (0 votes)
8 views

Arm Unit 3

Uploaded by

Nikhil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Arm Unit 3

Uploaded by

Nikhil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

UNIT – III

Embedded C codes: Overview of C compiler and optimization, Basic C


data types, Local variable types, C looping and structures, Register
allocation, Function calls, Mixing C and Assembly Programming,
Instruction Scheduling.

ARM UNIT-3 Radha R C


Overview of C compiler and optimization:
• This will give an idea of the problems the C compiler faces when
optimizing code written by the user.
• Understanding these problems helps programmer to write source
code that will compile more efficiently in terms of increased speed
and reduced code size.
• Optimizing code takes time and reduces source code readability.
Usually, it’s only worth optimizing functions that are frequently
executed and important for performance.
• Profiling tool, found in most ARM simulators, helps to find frequently
executed functions.

ARM UNIT-3 Radha R C


• C compilers have to translate your C function literally into assembler
so that it works for all possible inputs.
• In practice, many of the input combinations are not possible or won’t
occur.
Example code: The memclr function clears N bytes of memory at
address data.

ARM UNIT-3 Radha R C


Problems that the compiler face :
• It does not know whether N can be 0 on input or not. Therefore the
compiler needs to test for this case explicitly before the first iteration
of the loop.
• The compiler doesn’t know whether the data array pointer is four-
byte aligned or not. If it is four-byte aligned, then the compiler can
clear four bytes at a time using an int store rather than a char store.
• Nor does it know whether N is a multiple of four or not. If N is a
multiple of four, then the compiler can repeat the loop four times or
store four bytes at a time using an int store.
• The compiler must be conservative and assume all possible values for
N and all possible alignments for data.
• To write efficient C code, programer must be aware of areas where
the C compiler has to be conservative.
• The limits of the processor architecture the C compiler is mapping
to, and the limits of a specific C compiler

ARM UNIT-3 Radha R C


Basic C Data Types:
How ARM compilers handle the basic C data types
some of these types are more efficient to use for local variables than
others

ARM UNIT-3 Radha R C


ARM UNIT-3 Radha R C
• Loads that act on 8- or 16-bit values extend the value to 32 bits
before writing to an ARM register.
• Unsigned values are zero-extended, and signed values sign-extended.
This means that the cast of a loaded value to an int type does not cost
extra instructions.
• Similarly, a store of an 8- or 16-bit value selects the lowest 8 or 16 bits
of the register.
• The cast of an int to smaller type does not cost extra instructions on a
store.
• Prior to ARMv4, ARM processors were not good at handling signed
8-bit or any 16-bit values.
• Therefore ARM C compilers define char to be an unsigned 8-bit
value, rather than a signed 8-bit value as is typical in many other
compilers.

ARM UNIT-3 Radha R C


Compilers armcc and gcc use the above datatype mappings for an
ARM target.
The exceptional case for type char is worth noting as it can cause
problems when you are porting code from another processor
architecture.
A common example is using a char type variable i as a loop counter,
with loop continuation condition i ≥ 0. As i is unsigned for the ARM
compilers, the loop will never terminate. Fortunately armcc
produces a warning in this situation: unsigned comparison with 0.
Compilers also provide an override switch to make char signed.
For example, the command line option -fsigned-char will make char
signed on gcc.
The command line option -zc will have the same effect with armcc.
ARM UNIT-3 Radha R C
Local Variable Types:
ARMv4-based processors can efficiently load and store 8-, 16-, and 32-
bit data. However, most ARM data processing operations are 32-bit
only.
So, use a 32-bit datatype, int or long, for local variables wherever
possible.
Avoid using char and short as local variable types, even if you are
manipulating an 8- or 16-bit value.
The one exception is when you want wrap-around to occur. If you
require modulo arithmetic of the form 255 + 1 = 0, then use the char
type.

ARM UNIT-3 Radha R C


The following code checksums a data packet containing 64 words. It
shows why you should avoid using char for local variables.

• It looks as though declaring i as a char is efficient as char uses less


register space or less space on the stack than an int.
• On the ARM, both these assumptions are wrong. All ARM registers are
32-bit and all stack entries are at least 32-bit.
• Furthermore, to implement the i++ exactly, the compiler must account
for the case when i = 255. Any attempt to increment 255 should
produce the answer 0. ARM UNIT-3 Radha R C
I is declare as character

I is declare as integer

BCC: Branch if Carry is


Clear"

ARM UNIT-3 Radha R C


• In the first case, the compiler inserts an extra AND instruction to reduce i to the
range 0 to 255 before the comparison with 64.
• This instruction not used in the second case.
• The ARM data processing operations always operate on 32-bit quantities. You
should therefore:
• Use a 32-bit data type (e.g. int) for local variables.
• Avoid char and short for local variables, even if you’re manipulating a char or short value.
The exception to this is when you require wrap-around or modulo arithmetic
(e.g. 255+1 → 0).
• The compiler is emitting an AND r1,r1,#0xff instruction even though it should know
that i never exceeds 64.
• If we change i from char to unsigned int the AND disappears: it’s no longer
necessary to account for wrap-around.
• Remember that this isn’t just a saving of one instruction or cycle. It
saves 64 instructions: one for each iteration.
• This is an inner loop. Optimizations to inner loops are highly beneficial.
ARM UNIT-3 Radha R C
• suppose the data packet contains 16-bit values
The expression sum + data[i] is an integer and
so can only be assigned to a short using an
(implicit or explicit) narrowing cast.
The compiler must insert extra instructions to
implement the narrowing cast:

The loop is now three instructions


longer than the loop for example
checksum_v2 earlier

ARM UNIT-3 Radha R C


There are two reasons for the extra instructions:
• The LDRH instruction does not allow for a shifted address offset as
the LDR instruction did in checksum_v2. Therefore the first ADD in
the loop calculates the address of item i in the array. The LDRH loads
from an address with no offset.
• The cast reducing total + array[i] to a short requires two MOV
instructions. The compiler shifts left by 16 and then right by 16 to
implement a 16-bit sign extend. The shift right is a sign-extending
shift so it replicates the sign bit to fill the upper 16 bits.
• We can avoid the second problem by using an int type variable to
hold the partial sum. We only reduce the sum to a short type at the
function exit.

ARM UNIT-3 Radha R C


• First problem can be solved by accessing the array by incrementing
the pointer data rather than using an index as in data[i]. This is
efficient regardless of array type size or element size. All ARM load
and store instructions have a postincrement addressing mode.
• uses int type local variables to avoid unnecessary casts. It increments
the pointer data instead of using an index offset data[i].

ARM UNIT-3 Radha R C


• The *(data++) operation translates to a single ARM instruction that
loads the data and increments the data pointer.
• You could write sum += *data; data++; or even *data++ instead.
• The compiler produces the following output. Three instructions have
been removed from the inside loop, saving three cycles per loop
compared to checksum_v3.

ARM UNIT-3 Radha R C


• Function Argument Types

•In this example the input and output values are both short.
•Yet the values will be passed in (and out) in 32-bit-wide registers.
•Should the compiler assume values are already in the range of
a short?
•Or should the compiler force the values to this range by sign-
extending the low 16 bits to fill the 32-bit register?
•The compiler must make a compatible decision for both the caller
and the callee.

ARM UNIT-3 Radha R C


Armcc: It assumes input values are in the correct
range.

Gcc: Makes no assumptions about the range of argument


values so sign extends the values on entry.
ARM UNIT-3 Radha R C
Signed versus Unsigned Types:
Compares the efficiencies of signed int and unsigned int.
If your code uses addition, subtraction, and multiplication, then there is
no performance difference between signed and unsigned operations.
However, there is a difference when it comes to division. Consider the
following short example that averages two integers:

ARM UNIT-3 Radha R C


Compiler adds one to the sum before shifting by right if the sum is
negative. In other words it replaces x/2 by the statement:

ARM UNIT-3 Radha R C


• As x is signed, in C on an ARM target, a divide by two is not a right
shift if x is negative. For example, −3 ≫ 1 = −2 but −3/2 = −1. Division
rounds towards zero, but arithmetic right shift rounds towards −∞

ARM UNIT-3 Radha R C


ARM UNIT-3 Radha R C
C looping and structures:
About efficient ways to code for and while loops on the ARM.
Loops with a fixed number of iterations and then move on to loops
with a variable number of iterations.
Loops with a Fixed Number of Iterations:

ARM UNIT-3 Radha R C


Compiler generated code takes three instructions to implement FOR
loop
• An ADD to increment i
• A compare to check if i is less than 64
• A conditional branch to continue the loop if i < 64
On the ARM, a loop should only use two instructions:
• A subtract to decrement the loop counter, which also sets the
condition code flags on the result
• A conditional branch instruction

ARM UNIT-3 Radha R C


• Example shows the improvement if we switch to a decrementing loop
rather than an incrementing loop.

The SUBS and BNE instructions implement the loop and it takes four instructions per loop. But
using increment operation it takes 5 instructions per iteration.

ARM UNIT-3 Radha R C


• For an unsigned loop counter i we can use either of the loop
continuation conditions i!=0 or i>0. As i can’t be negative,
they are the same condition.
• For a signed loop counter, it is tempting to use the condition
i>0 to continue the loop

• When i = -0x80000000, two sections of code generate


different answers. For the first piece of code the SUBS
instruction compares i with 1 and then decrements i. Since -
0x80000000 < 1, the loop terminates.

ARM UNIT-3 Radha R C


• For the second piece of code, we decrement i and then compare with
0. Modulo arithmetic means that i now has the value +0x7fffffff,
which is greater than zero. Thus the loop continues for many
iterations.
• Therefore you should use the termination condition i!=0 for signed or
unsigned loop counters. It saves one instruction over the condition
i>0 for signed i.
Loops Using a Variable Number of Iterations:
For the checksum routine to handle packets of arbitrary size. We pass
in a variable N giving the number of words in the data packet.
We count down until N = 0 and don’t require an extra loop counter i.

ARM UNIT-3 Radha R C


Notice that the compiler checks that N is nonzero on entry to the function.
Often this check is unnecessary since array won’t be empty.
In this case a do-while loop gives better performance and code density
than a for loop.
ARM UNIT-3 Radha R C
This example shows how to use a do-while loop to remove the test for N
being zero that occurs in a for loop.

Compare this with the output for checksum_v7 using for loop, while loop save
two-cycles in comparing with for loop.

ARM UNIT-3 Radha R


C
Loop Unrolling:
Each loop iteration costs two instructions in addition to the body of the
loop: a subtract to decrement the loop count and a conditional
branch. This is called as loop overhead.
On ARM7 or ARM9 processors the subtract takes one cycle and the
branch three cycles, giving an overhead of four cycles per loop.
This loop overhead can be avoided by unrolling a loop—repeating the
loop body several times, and reducing the number of loop iterations by
the same proportion.

ARM UNIT-3 Radha R


C
• The following code unrolls our packet checksum loop by four times.
We assume that the number of words in the packet N is a multiple of
four.

Reduced the loop overhead from 4N cycles to (4N)/4=N cycles

ARM UNIT-3 Radha R C


There are two questions when unrolling a loop:
■ How many times should I unroll the loop?
■ What if the number of loop iterations is not a multiple of the unroll amount?
For example, what if N is not a multiple of four in checksum_v9?
For first question, only unroll loops that are important for the overall performance
of the application. Otherwise unrolling will increase the code size with little
performance benefit. Unrolling may even reduce performance by evicting more
important code from the cache. Suppose the loop is important, for example, 30% of
the entire application. The loop overhead cost is 3/128, roughly 3%. Recalling that
the loop is 30% of the entire application, overall the loop overhead is only 1%.
Unrolling the code further gains little extra performance, but has a significant
impact on the cache contents. It is usually not worth unrolling further when the
gain is less than 1%.
For the second question, try to arrange it so that array sizes are multiples of your
unroll amount. If this isn’t possible, then you must add extra code to take care of
the leftover cases. This increases the code size a little but keeps the performance
high.
ARM UNIT-3 Radha R C
This example handles the checksum of any size of data packet using a
loop that has been unrolled four times.
The second for loop handles the remaining cases when N is not a
multiple of four. Note that both N/4 and N&3 can be zero, so we can’t
use do-while loops.

ARM UNIT-3 Radha R C


ARM UNIT-3 Radha R C
Register Allocation:
The compiler attempts to allocate a processor register to each local
variable used in a C function.
It will try to use the same register for different local variables if the use
of the variables do not overlap.
When there are more local variables than available registers, the
compiler stores the excess variables on the processor stack.
These variables are called spilled or swapped out variables since they
are written out to memory.
Spilled variables are slow to access compared to variables allocated to
registers.

ARM UNIT-3 Radha R C


To implement a function efficiently, you need to
■ Minimize the number of spilled variables
■ Ensure that the most important and frequently accessed variables
are stored in registers
The number of processor registers available in ARM C compilers for
allocating variables are given in Table.
Table shows the standard register names and usage when following the
ARM-Thumb procedure call standard (ATPCS), which is used in code
generated by C compilers.

ARM UNIT-3 Radha R C


ARM UNIT-3 Radha R C
ARM UNIT-3 Radha R
C
ARM UNIT-3 Radha R C
Efficient Register Allocation
■ Try to limit the number of local variables in the internal loop of
functions to 12. The compiler should be able to allocate these to ARM
registers.
■ Guide the compiler as to which variables are important by ensuring
these variables are used within the innermost loop.

ARM UNIT-3 Radha R


C
Function Calls:
The ARM Procedure Call Standard (APCS) defines how to pass function
arguments and return values in ARM registers.
The first four integer arguments are passed in the first four ARM
registers: r0, r1, r2, and r3.
Subsequent integer arguments are placed on the full descending stack,
ascending in memory as in Figure.
Function return integer values are passed in r0.
Two-word arguments such as long long or double are passed in a pair
of consecutive argument registers and returned in r0, r1.
The compiler may pass structures in registers or by reference according
to command line compiler options.

ARM UNIT-3 Radha R


C
ARM UNIT-3 Radha R C
• The first point to note about the procedure call standard is the four-
register rule.
• Functions with four or fewer arguments are far more efficient to call
than functions with five or more arguments.
• For functions with four or fewer arguments, the compiler can pass all
the arguments in registers.
• For functions with more arguments, both the caller and callee must
access the stack for some arguments.

ARM UNIT-3 Radha R


C
Example: Illustrates the benefits of using a structure pointer.
It is a typical routine to insert N bytes from array data into a queue.
For this queue is implemented using a cyclic buffer with start address
Q_start (inclusive) and end address Q_end (exclusive).

ARM UNIT-3 Radha R C


ARM UNIT-3 Radha R C
• The following code creates a Queue structure and passes this to the
function to reduce the number of function arguments.

ARM UNIT-3 Radha R C


The queue_bytes_v2 is one instruction longer than queue_bytes_v1, but it is in fact more efficient
overall. The second version has only three function arguments rather than five. Each call to the
function requires only three register setups. There is a net saving of two instructions in function
call overhead. There are likely further savings in the callee function, as it only needs to assign a
single register to the Queue structure pointer, rather than three registers in the nonstructured
case.

ARM UNIT-3 Radha R C


• The function uint_to_hex converts a 32-bit unsigned integer into an array of eight
hexadecimal digits.
• It uses a helper function nybble_to_hex, which converts a digit d in the range 0 to
15 to a hexadecimal digit.

ARM UNIT-3 Radha R C


• When we compile this, we see that uint_to_hex doesn’t call
nybble_to_hex at all!
• In the following compiled code, the compiler has inlined the
uint_to_hex code. This is more efficient than generating a function
call.

ARM UNIT-3 Radha R C


• The compiler will only inline small functions. You can ask the compiler to
inline a function using the __inline keyword, although this keyword is only
a hint and the compiler may ignore it.
• Inlining large functions can lead to big increases in code size without much
performance improvement.

Calling Functions Efficiently :


• Try to restrict functions to four arguments. This will make them more
efficient to call. Use structures to group related arguments and pass
structure pointers instead of multiple arguments.
• Define small functions in the same source file and before the functions that
call them. The compiler can then optimize the function call or inline the
small function.
• Critical functions can be inlined using the __inline keyword.

ARM UNIT-3 Radha R C


Mixing C and Assembly
• Mixing C and assembly is quite common, especially in deeply
embedded applications where programmers work nearly at the
hardware level.
• There are two ways to add assembly to your high-level source code:
the inline assembler and the embedded assembler.
• One way is through a process called inlining, where the __inline
keyword is placed in the C or C++ code to notate a function that,
when possible, should be placed in the assembly directly, rather than
being called as a subroutine.
• This potentially avoids some of the overhead associated with
branching and returning.

ARM UNIT-3 Radha R C


ARM UNIT-3 Radha R
C
ARM UNIT-3 Radha R
C
Instruction scheduling:
• Reordering the instructions in a code sequence to avoid processor
stalls.
• Since ARM implementations are pipelined, the timing of an
instruction can be affected by neighboring instructions
• It takes additional effort to optimize assembly routines so don’t
bother to optimize noncritical ones.

ARM UNIT-3 Radha R C


Cycle timing of common instructions:
Time taken to execute the instruction depends on the
implementation pipeline
• Instructions that are conditional on the value of the ARM condition codes in
the cpsr take one cycle if the condition is not met.
• If the condition is met, then the following rules apply:
• ALU operations such as addition, subtraction, shift by an immediate value and
logical operations take one cycle.
• Register-specified shift -2 cycles.
• If the instruction writes to the pc, then add two cycles
• Load instructions that load N 32-bit words of memory such as LDR and LDM
take N cycles to issue, but the result of the last word loaded is not available on
the following cycle. The updated load address is available on the next cycle.
• If the instruction loads pc, then add two cycles.
• Load instructions that load 16-bit or 8-bit data such as LDRB, LDRSB, LDRH,
and LDRSH take one cycle to issue. The load result is not available on the
following two cycles. The updated load address is available on the next cycle.

ARM UNIT-3 Radha R C


• Branch instructions take three cycles.
• Store instructions that store N values take N cycles.
• An STM/LDM of a single value is exceptional, taking two cycles.
• Multiply instructions take a varying number of cycles depending on the value of the
second operand in the product
MUL, MLA 2
xMULL, xMLAL 3
To schedule code efficiently on the ARM, we need to understand the ARM pipeline and
dependencies.
The ARM9TDMI processor performs five operations in parallel:
■ Fetch: Fetch from memory the instruction at address pc. The instruction is loaded
into the core and then processes down the core pipeline.
■ Decode: Decode the instruction that was fetched in the previous cycle. The
processor also reads the input operands from the register bank if they are not available
via one of the forwarding paths.

ARM UNIT-3 Radha R C


ALU: Executes the instruction that was decoded in the previous cycle.
Note this instruction was originally fetched from address pc − 8 (ARM
state) or pc − 4 (Thumb state).
This involves calculating the answer for a data processing operation, or
the address for a load, store, or branch operation.
LS1: Load or store the data specified by a load or store instruction. If
the instruction is not a load or store, then this stage has no effect.
LS2: Extract and zero- or sign-extend the data loaded by a byte or
halfword load instruction. If the instruction is not a load of an 8-bit
byte or 16-bit halfword item, then this stage has no effect.

ARM UNIT-3 Radha R C


• After an instruction has completed the five stages of the pipeline, the core
writes the result to the register file.
Note: pc points to the address of the instruction being fetched.
The ALU is executing the instruction that was originally fetched from address
pc − 8 in parallel with fetching the instruction at address pc.
• If an instruction requires the result of a previous instruction that is not
available, then the processor stalls. This is called a pipeline hazard or
pipeline interlock.
Example for no interlock in the pipeline.
ADD r0, r0, r1
ADD r0, r0, r2
This instruction pair takes two cycles. The ALU calculates r0 + r1 in one cycle.
Therefore this result is available for the ALU to calculate r0 + r2 in the second
cycle.

ARM UNIT-3 Radha R C


• Example shows a one-cycle interlock caused by load instruction.
LDR r1, [r2, #4]
ADD r0, r0, r1
This instruction pair takes three cycles.
The ALU calculates the address r2 + 4 in the first cycle while decoding
the ADD instruction in parallel.
However, the ADD cannot proceed on the second cycle because the
load instruction has not yet loaded the value of r1. Therefore the
pipeline stalls for one cycle while the load instruction completes the
LS1 stage.
Now that r1 is ready, the processor executes the ADD in the ALU on the
third cycle.

ARM UNIT-3 Radha R C


Figure illustrates how this interlock affects the pipeline.
• The processor stalls the ADD instruction for one cycle in the ALU stage
of the pipeline while the load instruction completes the LS1 stage
indicated in italic ADD.
• Since the LDR instruction proceeds down the pipeline, but the ADD
instruction is stalled, a gap opens up between them.
• This gap is called a pipeline bubble marked with a dash.

ARM UNIT-3 Radha R C


• This example shows a one-cycle interlock caused by delayed load use.
LDRB r1, [r2, #1]
ADD r0, r0, r2
EOR r0, r0, r1

This instruction triplet takes four cycles.


Although the ADD proceeds on the cycle following the load byte, the EOR
instruction cannot start on the third cycle.
The r1 value is not ready until the load instruction completes the LS2 stage of
the pipeline.
The processor stalls the EOR instruction for one cycle. Note that the ADD
instruction does not affect the timing at all.
The ADD doesn’t cause any stalls since the ADD does not use r1, the result of
the load.

ARM UNIT-3 Radha R C


Example shows why a branch instruction takes three cycles.
The processor must flush the pipeline when jumping to a new address.
MOV r1, #1
B case1
AND r0, r0, r1
EOR r2, r2, r3
...
case1 SUB r0, r0, r1

The three executed instructions take a total of five cycles. The MOV instruction executes on
the first cycle. On the second cycle, the branch instruction calculates the destination address.
This causes the core to flush the pipeline and refill it using this new pc value. The refill takes
two cycles.
ARM UNIT-3 Radha R C

You might also like