TMS320C54x DSP Programmer's Guide: Literature Number: SPRU538 July 2001
TMS320C54x DSP Programmer's Guide: Literature Number: SPRU538 July 2001
IMPORTANT NOTICE Texas Instruments and its subsidiaries (TI) reserve the right to make changes to their products or to discontinue any product or service without notice, and advise customers to obtain the latest version of relevant information to verify, before placing orders, that information being relied on is current and complete. All products are sold subject to the terms and conditions of sale supplied at the time of order acknowledgment, including those pertaining to warranty, patent infringement, and limitation of liability. TI warrants performance of its products to the specifications applicable at the time of sale in accordance with TIs standard warranty. Testing and other quality control techniques are utilized to the extent TI deems necessary to support this warranty. Specific testing of all parameters of each device is not necessarily performed, except those mandated by government requirements. Customers are responsible for their applications using TI components. In order to minimize risks associated with the customers applications, adequate design and operating safeguards must be provided by the customer to minimize inherent or procedural hazards. TI assumes no liability for applications assistance or customer product design. TI does not warrant or represent that any license, either express or implied, is granted under any patent right, copyright, mask work right, or other intellectual property right of TI covering or relating to any combination, machine, or process in which such products or services might be or are used. TIs publication of information regarding any third partys products or services does not constitute TIs approval, license, warranty or endorsement thereof. Reproduction of information in TI data books or data sheets is permissible only if reproduction is without alteration and is accompanied by all associated warranties, conditions, limitations and notices. Representation or reproduction of this information with alteration voids all warranties provided for an associated TI product or service, is an unfair and deceptive business practice, and TI is not responsible nor liable for any such use. Resale of TIs products or services with statements different from or beyond the parameters stated by TI for that products or service voids all express and any implied warranties for the associated TI product or service, is an unfair and deceptive business practice, and TI is not responsible nor liable for any such use. Also see: Standard Terms and Conditions of Sale for Semiconductor Products. www.ti.com/sc/docs/stdterms.htm
Mailing Address: Texas Instruments Post Office Box 655303 Dallas, Texas 75265
Preface
Notational Conventions
This document uses the following conventions.
- The device number TMS320C54x is often abreviated as C54x. - Program listings, program examples, and interactive displays are shown
in a special typeface similar to a typewriters. Examples use a bold version of the special typeface for emphasis; interactive displays use a bold version of the special typeface to distinguish commands that you enter from items that the system displays (such as prompts, command output, error messages, etc.). Here is a sample program listing:
0011 0012 0013 0014 0005 0005 0005 0006 0001 0003 0006 .field .field .field .even 1, 2 3, 4 6, 3
Here is an example of a system prompt and a command that you might enter:
C: csr a /user/ti/simuboard/utilities
- In syntax descriptions, the instruction, command, or directive is in a bold
typeface font and parameters are in an italic typeface. Portions of a syntax that are in bold should be entered as shown; portions of a syntax that are in italics describe the type of information that should be entered. Here is an example of a directive syntax: .asect section name, address .asect is the directive. This directive has two parameters, indicated by section name and address. When you use .asect, the first parameter must be an actual section name, enclosed in double quotes; the second parameter must be an address.
Contents iii
Notational Conventions
the .byte directive can have up to 100 parameters. The syntax for this directive is: .byte value1 [, ... , valuen ] This syntax shows that .byte must have at least one value parameter, but you have the option of supplying additional value parameters, separated by commas.
- In most cases, hexadecimal numbers are shown with the suffix h. For ex-
ample, the following number is a hexadecimal 40 (decimal 64): 40h Similarly, binary numbers are shown with the suffix b. For example, the following number is the decimal number 4 shown in binary form: 0100b
- Bits are sometimes referenced with the following notation: Notation Register(nm) Description Bits n through m of Register Example AC0(150) represents the 16 least significant bits of the register AC0.
The information in a caution or a warning is provided for your protection. Please read each caution and warning carefully.
iv
Trademarks
Trademarks
Code Composer Studio, TMS320C54x, C54x, TMS320C55x, and C55x are trademarks of Texas Instruments.
vi
Contents
Contents
1 TMS320C54x Architectural Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1 Lists some of the key features of the TMS320C54x DSP architecture. 1.1 TMS320C54x Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2 1.2 TMS320C54x Key Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3 Improving System Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduces features of the TMS320C54x DSP that improve system performance. 2.1 Tips for Efficient Memory Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Memory Alignment Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Stack Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Overlay Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Memory-to-Memory Moves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Efficient Power Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1 2-2 2-4 2-5 2-6 2-7 2-9
Arithmetic and Logical Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1 Shows how the TMS320C54x supports typical arithmetic and logical operations, including multiplication, addition, division, square roots, and extended-precision operations. 3.1 Division and Modulus Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2 3.2 Sines and Cosines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-9 3.3 Square Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14 3.4 Extended-Precision Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-16 3.4.1 Addition and Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-16 3.4.2 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20 3.5 Floating-Point Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24 3.6 Logical Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-43 Application-Specific Instructions and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1 Shows examples of application-specific instructions that the TMS320C54x offers and the typical functions where they are used. 4.1 Codebook Search for Excitation Signal in Speech Coding . . . . . . . . . . . . . . . . . . . . . . . . 4-2 4.2 Viterbi Algorithm for Channel Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-5 TI C54x DSPLIB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduces the features and the C functions of the TI TMS320C54x DSP function library. 5.1 Features and Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 DSPLIB Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 DSPLIB Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Calling a DSPLIB Function from C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Calling a DSPLIB Function from Assembly Language Source Code . . . . . . . . . . . . . . . 5.6 Where to Find Sample Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 DSPLIB Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1 5-2 5-2 5-2 5-3 5-4 5-4 5-5
vii
Figures
Figures
31 32 33 34 41 42 43 32-Bit Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-17 32-Bit Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-19 32-Bit Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21 IEEE Floating-Point Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24 CELP-Based Speech Coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2 Butterfly Structure of the Trellis Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-5 Pointer Management and Storage Scheme for Path Metrics . . . . . . . . . . . . . . . . . . . . . . . . 4-7
Tables
41 Code Generated by the Convolutional Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6
viii
Examples
Examples
21 22 23 24 31 32 33 34 35 36 37 38 39 310 311 312 41 42 Memory Alignment Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4 Stack Initialization for Assembly Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5 Stack Initialization c_int00 routine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5 Memory-to-Memory Block Moves Using the RPT Instruction . . . . . . . . . . . . . . . . . . . . . . . . 2-7 Unsigned/Signed Integer Division Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3 Generation of a Sine Wave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10 Generation of a Cosine Wave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12 Square Root Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14 Lit Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-18 64-Bit Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20 32-Bit Integer Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22 32-Bit Fractional Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-23 Add Two Floating-Point Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-25 Multiply Two Floating-Point Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-31 Divide a Floating-Point Number by Another . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-36 Pack/Unpack Data in the Scrambler/Descrambler of a Digital Modem . . . . . . . . . . . . . . . 3-43 Codebook Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4 Viterbi Operator for Channel Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-7
Contents
ix
Equations
Equations
41 42 43 44 45 46 Optimum Code Vector Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cross Correlation Variable (ci ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Energy Variable (Gi ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Optimal Code Vector Condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Polynomials for Convolutional Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Branch Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2 4-2 4-3 4-3 4-5 4-5
Chapter 1
Topic
1.1 1.2
Page
TMS320C54x Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2 TMS320C54x Key Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3
1-1
TMS320C54x Overview
buses, and four address buses for increased performance and versatility
- Advanced CPU design with a high degree of parallelism and application-
power consumption
- Low power consumption and increased radiation hardness because of
1-2
Advanced multibus architecture with one program bus, three data buses, and four address buses 40-bit arithmetic logic unit (ALU), including a 40-bit barrel shifter and two independent 40-bit accumulators 17-bit 17-bit parallel multiplier coupled to a 40-bit dedicated adder for nonpipelined single-cycle multiply/accumulate (MAC) operation Compare, select, store unit (CSSU) for the add/compare selection of the Viterbi operator Exponent encoder to compute the exponent of a 40-bit accumulator value in a single cycle Two address generators, including eight auxiliary registers and two auxiliary register arithmetic units Dual-CPU/core architecture on the 5420
- Instruction set J J J J J J J
Single-instruction repeat and block repeat operations Block memory move instructions for better program and data management Instructions with a 32-bit long operand Instructions with 2- or 3-operand simultaneous reads Arithmetic instructions with parallel store and parallel load Conditional-store instructions Fast return from interrupt
1-3
Chapter 2
Topic
2.1 2.2 2.3 2.4 2.5 2.6
Page
Tips for Efficient Memory Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2 Memory Alignment Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4 Stack Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5 Overlay Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6 Memory-to-Memory Moves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-7 Efficient Power Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9
2-1
The C54x can access minimum 64K words of program and 64K words of data memory. On-chip memory accesses are more efficient than off-chip memory access, since there are eight different internal buses on the C54x but there is only one external bus for off-chip accesses. This means that an off-chip operation requires more cycles than that of an on-chip operation. In cases where the DSP uses wait-state generators to interface to slower memories, the system, cannot run at full speed. If on-chip memory consists of dual access RAM (DARAM), accessing two operands from the same block does not incur a penalty. Using single access RAM (SARAM), however, incurs a cycle penalty.
- Tip: For random-access variables, use direct addressing and
allocate them in the same 128-word page. Random-access variables use direct addressing mode. Data-page relative direct memory addressing makes efficient use of memory resources. Allocating all the random variables on a single data page saves some extra CPU cycles. Sometimes data variables have an associated lifetime. When that lifecycle is over, the data variables become useless.. Thus, if two data variables have non-overlapping lifetimes, both can occupy the same physical memory. The UNION directive in the linker command file allows two or more data variables share the same physical memory location
- Tip: If required, reserve CPU resources for the exclusive use of
interrupts. The actual lifetime of a variable determines whether it is retained across the application or only in the function. By careful organization of the code in an application, resources can be used optimally. Aggregate variables, such as arrays and structures, are accessed via pointers located within that programs data page, but the actual aggregate variables reside else where in the data memory. Depending upon the lifetime of the arrays or structures, these can also form unions accordingly. Interrupt driven tasks require careful memory management. Often, programmers assume that all CPU resources are available when required. This may not be the case if tasks are interrupted periodically. These interrupts do not require many CPU resources, but they force the system to respond within a certain time. To ensure that interrupts occur within the specified time and the interrupted code resumes as soon as possible, you
2-2
must use low overhead interrupts. If the application requires frequent interrupts, you can set aside some of the CPU resources for these interrupts. When all CPU resources are used, simply saving and restoring the CPUs contents increases the overhead for an interrupt service routine (ISR). Dedicated auxiliary registers are useful for servicing interrupts. Allowing interrupts at certain places in the code permits the various tasks of an application to reuse memory. If the code is fully interruptible (that is, interrupts can occur anywhere and interrupt response time is assured within a certain period), memory blocks must be kept separate from each other. On the other hand, if a context switch occurs at the completion of a function rather than in the middle of execution, the variables can be overlapped for efficiency. This allows variables to use the same physical memory addresses at different times.
2-3
erations; that is, the most significant word at an even address and the least significant word at an odd address.
- Circular buffers should be aligned at a K boundary, where K is the smallest
integer that satisfies 2K > R and R is the size of the circular buffer. Use the align directive to align buffers to correct sizesIf an application uses circular buffers of different sizes, allocate the largest buffer size as the first alignment, the next highest as the second alignment, and so forth. Example 21 shows the memory management alignment feature where the largest circular buffer is 1024 words, and therefore, is assigned first. A 256-word buffer is assigned next. Unused memory can be used for other functions without conflict.
Stack Initialization
Example 23 shows stack initialization by c_int00 routine from the C runtime support library(rts.lib) when the application is written in C.The compiler uses the stack to allocate local variables, pass arguments, and save the processor status. The stack size is set by the linker and the default size is 1 K words. If 1K words of stack is more than necessary, allocate a smaller size stack by using the stack directive in the linker command file and utilize the freed up memory for other data variables.
ADDM #(_STACK_SIZE 1), *(SP) ; add size to get to the top ANDM #0FFFEh, *(SP) ; make sure it is an even address
2-5
Overlay Management
This is achieved by setting the OVLY bit in the PMST register. This is particularly useful in loading the coefficients of a filter, since program and data use the same physical memory.
- Overlay off-chip memory to achieve more than 64K words.
If an application needs more than 64K words of either data or program memory, two options are available: The first one is to use one of the C54x derivatives that provides more than 16 address lines to access more than 64K words of program space. The other option is to use an external device that provides upper addresses beyond the 16-bit memory range. The DSP writes a value to a register located in its I/O space, whose data lines are the higher address bits. It implements bank switching to cross the 64K boundary. Some devices have Bank Switch Control Register to select memory bank boundary size. Since the bank switch requires action from the DSP, frequent switching between the banks is not very efficient. It is more efficient to partition tasks within a bank and switch banks only when starting new tasks.
2-6
Memory-to-Memory Moves
Memory-to-Memory Moves
2-8
2-9
Chapter 3
Topic
3.1 3.2 3.3 3.4 3.5 3.6
Page
Division and Modulus Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2 Sines and Cosines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-9 Square Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14 Extended-Precision Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-16 Floating-Point Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24 Logical Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-43
3-1
SUBC performs binary division like long division. For 16-bit by 16-bit integer division, the dividend is stored in low part accumulator A. The program repeats the SUBC command 16 times to produce a 16-bit quotient in low part accumulator A and a 16-bit remainder in high part accumulator B. For each SUBC subtraction that results in a negative answer, you must left-shift the accumulator by 1 bit. This corresponds to putting a 0 in the quotient when the divisor does not go into the dividend. For each subtraction that produces a positive answer, you must left shift the difference in the ALU output by 1 bit, add 1, and store the result in accumulator A. This corresponds to putting a 1 in the quotient when the divisor goes into the dividend. Similarly, 32-bit by 16-bit integer division is implemented using two stages of 16-bit by 16-bit integer division. The first stage takes the upper 16 bits of the 32-bit dividend and the 16-bit divisor as inputs. The resulting quotient becomes the higher 16 bits of the final quotient. The remainder is left shifted by 16 bits and adds the lower 16 bits of the original dividend. This sum and the 16-bit divisor become inputs to the second stage. The lower 16 bits of the resulting quotient is the final quotient and the resulting remainder is the final remainder. Both the dividend and divisor must be positive when using SUBC. The division algorithm computes the quotient as follows: 1) The algorithm determines the sign of the quotient and stores this in accumulator B. 2) The program determines the quotient of the absolute value of the numerator and the denominator, using repeated SUBC commands. 3) The program takes the negative of the result of step 2, if appropriate, according to the value in accumulator B. For unsigned division and modulus (types I and III), you must disable the sign extension mode (SXM = 0). For signed division and modulus (types II and IV), turn on sign extension mode (SXM = 1). The absolute value of the numerator must be greater than the absolute value of the denominator.
3-2
3-3
NOTES: Sign extension mode must be turned off. ;;=========================================================================== .def DivModUI16 .ref d_Num .ref d_Den .ref d_Quot .ref d_Rem .text DivModUI16: RSBX SXM ; sign extention mode off LD @d_Num,A RPT #(161) SUBC @d_Den,A STL A,@d_Quot STH A,@d_Rem RET ;;=========================================================================== ;; ;; Module Name: DivModI32 ;; ;;=========================================================================== ;; ;; Description: 32 Bit By 16 Bit Signed Integer Divide And Modulus. ;; ;;;; ;; Usage ASM: ;; .bss d_NumH,1 ; 80000001h to 7FFFFFFFh ;; .bss d_NumL,1 ;; .bss d_Den,1 ; 8000h to 7FFFh ;; .bss d_QuotH,1 ; 80000001h to 7FFFFFFFh ;; .bss d_QuotL,1 ;; .bss d_Rem,1 ; 8000h to 7FFFh ;; ;; CALL DivModI32 ; ;;;; ;; Input: d_NumH
3-5
LD d_QuotH,16,A ADDS d_QuotL,A NEG A STH A,d_QuotH STL A,d_QuotL DivModI32Skip: RET ;;=========================================================================== ;; ;; Module Name: DivModI16 ;; ;;=========================================================================== ;; ;; Description: 16 Bit By 16 Bit Signed Integer Divide And Modulus. ;; ;;;; ;; Usage ASM: ;; .bss d_Num,1 ; 8000h to 7FFFh (Q0.15 format) ;; .bss d_Den,1 ; 8000h to 7FFFh (Q0.15 format) ;; .bss d_Quot,1 ; 8000h to 7FFFh (Q0.15 format) ;; .bss d_Rem,1 ; 8000h to 7FFFh (Q0.15 format) ;; ;; CALL DivModI16 ;; ;;;; ;; Input: d_Num ;; d_Den ;; ;; Modifies: AR2 ;; T ;; accumulator A ;; accumulator B ;; SXM ;; ;; Output: d_Quot ;; d_Rem ;; ;;;; ;; Algorithm: Quot = Num/Den ;; Rem = Num%Den ;; ;; Signed division is similar to unsigned division except that ;; the sign of Num and Den must be taken into account. ;; First the sign is determined by multiplying Num by Den. ;; Then division is performed on the absolute values. ;; ;; Num = n1|n0 Quot = q1|q0
3-7
3-8
The following recursive formulas generate the sine and cosine waves: sin nq + 2 cos( q )sin{( n1 )q} sin{( n2 )q} cos nq + 2 cos( q )cos{( n1 )q} cos{( n2 )q} These equations use two steps to generate a sine or cosine wave. The first evaluates cos(q) and the second generates the signal itself, using one multiply and one subtract for a repeat counter, n.
Arithmetic and Logical Operations 3-9
Example 32 and Example 33 assume that the delayed cos((n1)) and cos((n2)) are precalculated and are stored in memory. The Taylor series expansion to evaluate the delayed cos((n1)), cos((n2))/sin((n1)), and sin((n2)) values for a given q can also be used.
3-10
3-11
3-12
.text start: LD #d_cos_delay1,DP CALL cos_start CALL cos_prog STM #d_cos_delay1,AR3 RPTZ A,#3h STL A,*AR3+ STM #d_cos_delay1,AR3 ST #K_cos_delay_1,*AR3+ ST #K_cos_delay_2,*AR3 STM #d_cos_delay1,AR3 ST #K_theta,d_theta STM #1,AR0 STM #K_2,BK STM #K_2561,BRC cos_generate: RPTB end_of_cose MPY *AR2,*AR3+0%,A SUB *AR3,15,A SFTA A,1,A STH A,*AR3 PORTW *AR3,56h end_of_cose NOP NOP B cos_generate .end
; calculate cos(theta)
; output vaues
; next sample
3-13
Square Roots
) 0.5 x 2
* 0.625 x 2
) 0.875 x 2
3-14
Square Roots
3-15
Extended-Precision Arithmetic
3.4.1
A carry can also be generated when two data-memory operands are added or when a data-memory operand is added to an immediate operand. If a carry is not generated, the carry bit is cleared. The ADD instruction with a 16-bit shift is an exception because it only sets the carry bit. This allows the ALU to generate the appropriate carry when adding to the lower or upper half of the accumulator causes a carry. Figure 31 shows several 32-bit additions and their effect on the carry bit.
3-16
Extended-Precision Arithmetic
C X
C X
C 1
Example 35 adds two 64-bit numbers to obtain a 64-bit result. The partial sum of the 64-bit addition is efficiently performed by the DLD and DADD instructions, which handle 32-bit operands in a single cycle. For the upper half of a partial sum, the ADDC (ADD with carry) instruction uses the carry bit generated in the lower 32-bit partial sum. Each partial sum is stored in two memory locations by the DST (long-word store) instruction.
3-17
Extended-Precision Arithmetic
Similar to addition, the carry bit is reset if a borrow is generated when an accumulator value is subtracted from:
- The other accumulator - A data-memory operand - An immediate operand
A borrow can also be generated when two data-memory operands are subtracted or when an immediate operand is subtracted from a data-memory operand. If a borrow is not generated, the carry bit is set. The SUB instruction with a 16-bit shift is an exception because it only resets the carry bit. This allows the ALU to generate the appropriate carry when subtracting from the lower or the upper half of the accumulator causes a borrow. Figure 32 shows several 32-bit subtractions and their effect on the carry bit.
3-18
Extended-Precision Arithmetic
C X
C X
SUBB C MSB LSB C MSB LSB 0 0 0 0 0 0 0 0 0 0 0 ACC 0 F F F F F F F F F F ACC 0 (SUBB) 0(SUBB) 0 F F F F F F F F F F 1 F F F F F F F F F E SUB Smem,16,src C MSB LSB 1 F F 8 0 0 0 F F F F ACC 0 0 0 0 0 1 0 0 0 0 0 0 0 7 F F F F F F F
C 0
MSB F F 8 0 0 0 F F F F F F F F 0 0 0 F F 8 0 0 1 F F
LSB F F ACC 0 0 F F
Example 36 subtracts two 64-bit numbers on the C54x. The partial remainder of the 64-bit subtraction is efficiently performed by the DLD (long word load) and the DSUB (double precision subtract) instructions, which handle 32-bit operands in a single cycle. For the upper half of a partial remainder, the SUBB (SUB with borrow) instruction uses the borrow bit generated in the lower 32-bit partial remainder. Each partial remainder is stored in two consecutive memory locations by a DST.
3-19
Extended-Precision Arithmetic
3.4.2
Multiplication
The MPYU (unsigned multiply) and MACSU (signed/unsigned multiply and accumulate) instructions can also handle extended-precision calculations. Figure 33 shows how two 32-bit numbers obtain a 64-bit product. The MPYU instruction multiplies two unsigned 16-bit numbers and places the 32-bit result in one of the accumulators in a single cycle. The MACSU instruction multiplies a signed 16-bit number by an unsigned 16-bit number and accumulates the result in a single cycle. Efficiency is gained by generating partial products of the 16-bit portions of a 32-bit (or larger) value instead of having to split the value into 15-bit (or smaller) parts.
3-20
Extended-Precision Arithmetic
Y1 Y1
Signed multiplication W3 W2 W1 W0
The program in Example 37 shows that a multiply of two 32-bit integer numbers requires one multiply, three multiply/accumulates, and two shifts. The product is a 64-bit integer number. Note in particular, the use of MACSU, MPYU and LD instructions. The LD instruction can perform a right-shift in the accumulator by 16 bits in a single cycle.
3-21
Extended-Precision Arithmetic
Example 38 performs fractional multiplication. The operands are in Q31 format, while the product is in Q30 format.
3-22
Extended-Precision Arithmetic
3-23
Floating-Point Arithmetic
The values of the numbers represented in the IEEE floating-point format are as follows: (1)s * 2e127 * (01.f)
3-24
Floating-Point Arithmetic
Special Cases: (1)s * 0.0 (1)s * 2126 * (0.f) (1)s * infinity NaN (not a number) If e = 0, and f = 0 (zero) If e = 0 and f <> 0 (denormalized) If e = 255 and f = 0 (infinity) If e = 255 and f <> 0
Example 39 through Example 311 illustrate how the C54x performs floatingpoint addition, multiplication, and division.
3-25
Floating-Point Arithmetic
3-26
Floating-Point Arithmetic
; load floating #1
12
* *;***************************************************************************** *; CONVERSION OF FLOATING POINT FORMAT UNPACK *; Test OP1 for special case treatment of zero. *; Split the MSW of OP1 in the accumulator. *; Save the exponent on the stack [xxxx xxxx EEEE EEEE]. *; Add the implied one to the mantissa value. *; Store the mantissa as a signed value *;***************************************************************************** * DLD op1_msw,A ; load the OP1 high word SFTA A,8 ; shift right by 8 SFTA A,8 BC op1_zero,AEQ ; If op1 is 0, jump to special case LD A,B ; Copy OP1 to acc B RSBX SXM ; Reset for right shifts used for masking SFTL A,1 ; Remove sign bit STH A,8,op1_se ; Store exponent to stack SFTL A,8 ; Remove exponent SFTL A,9 ADD #080h,16,A ; Add implied 1 to mantissa XC 1,BLT ; Negate OP1 mantissa for negative values NEG A SSBX SXM ; Make sure OP2 is signextended DST A,op1_hm ; Store mantissa * *;***************************************************************************** *; CONVERSION OF FLOATING POINT FORMAT UNPACK *; Test OP1 for special case treatment of zero. *; Split the MSW of OP1 in the accumulator. *; Save the exponent on the stack [xxxx xxxx EEEE EEEE]. *; Add the implied one to the mantissa value. *; Store the mantissa as a signed value *;***************************************************************************** * DLD op2_msw,A ; Load acc with op2 BC op2_zero,AEQ ; If op2 is 0, jump to special case LD A,B ; Copy OP2 to acc B SFTL A,1 ; Remove sign bit STH A,8,op2_se ; Store exponent to stack RSBX SXM ; Reset for right shifts used for masking SFTL A,8 ; Remove exponent SFTL A,9 ADD #080h,16,A ; Add implied 1 to mantissa XC 1,BLT ; Negate OP2 mantissa for negative values NEG A
3-27
Floating-Point Arithmetic
3-28
Floating-Point Arithmetic
3-29
Floating-Point Arithmetic
3-30
Floating-Point Arithmetic
3-31
Floating-Point Arithmetic
Floating-Point Arithmetic
3-33
Floating-Point Arithmetic
Floating-Point Arithmetic
3-35
Floating-Point Arithmetic
3-36
Floating-Point Arithmetic
3-37
Floating-Point Arithmetic
B NOP NOP .text float_div: LD #res_hm,DP ; initialize the page pointer LD #K_divisor_high,A ; load floating #2 12 STL A,op2_msw LD #K_divisor_low,A STL A,op2_lsw LD #K_dividend_high,A ; load floating #1 12 STL A,op1_msw LD #K_dividend_low,A STL A,op1_lsw ********************************************************** RSBX C16 ; Insure long adds for later * *;***************************************************************************** *; CONVERSION OF FLOATING POINT FORMAT UNPACK *; Test OP1 for special case treatment of zero. *; Split the MSW of A in the accumulator. *; Save the sign and exponent on the stack [xxxx xxxS EEEE EEEE]. *; Add the implied one to the mantissa value. *; Store entire mantissa with a long word store *;***************************************************************************** DLD op1_msw,A ; load acc a with OP1 SFTA A,8 SFTA A,8 BC op1_zero,AEQ ; if op1 is 0, jump to special case STH A,7,op1_se ; store sign and exponent to stack STL A,op1_lm ; store low mantissa AND #07Fh,16,A ; mask off sign & exp to get high mantissa ADD #080h,16,A ; ADD implied 1 to mantissa STH A,op1_hm ; store mantissa to stack * *;***************************************************************************** *; CONVERSION OF FLOATING POINT FORMAT UNPACK *; Test OP1 for special case treatment of zero. *; Split the MSW of A in the accumulator. *; Save the sign and exponent on the stack [xxxx xxxS EEEE EEEE]. *; Add the implied one to the mantissa value. *; Store entire mantissa with a long word store *;****************************************************************************** DLD op2_msw,A ; load acc a with OP2 BC op2_zero,AEQ ; if OP2 is 0, divide by zero STH A,7,op2_se ; store sign and exponent to stack STL A,op2_lm ; store low mantissa AND #07Fh,16,A ; mask off sign & exp to get high mantissa 3-38
Floating-Point Arithmetic
* *;***************************************************************************** *; SIGN EVALUATION *; Exclusive OR sign bits of OP1 and OP2 to determine sign of result. *;************************* **************************************************** * LD op1_se,A ; load sign and exp of op1 to acc XOR op2_se,A ; xor with op2 to get sign of result AND #00100h,A ; mask to get sign STL A,res_sign ; save sign of result to stack * *;***************************************************************************** *; EXPONENT SUMMATION *; Find difference between operand exponents to determine the result exponent. * * Since the subtraction process removes the bias it must be readded in. * * *; Branch to one of three blocks of processing *; Case 1: exp OP1 + exp OP2 results in underflow (exp < 0) *; Case 2: exp OP1 + exp OP2 results in overflow (exp >= 0FFh) *; Case 3: exp OP1 + exp OP2 results are in range (exp >= 0 & exp < 0FFh) *; NOTE: Cases when result exp = 0 may result in underflow unless there * * is a carry in the result that increments the exponent to 1. * * Cases when result exp = 0FEh may result in overflow if there is a carry * * in the result that increments the exponent to 0FFh. *;***************************************************************************** * LD op1_se,A ; Load OP1 sign and exponent AND #0FFh,A ; Mask OP1 exponent * LD op2_se,B ; Load OP2 sign and exponent AND #0FFh,B ; Mask OP2 exponent * ADD #07Fh,A ; Add offset (difference eliminates offset) SUB B,A ; Take difference between exponents STL A,res_exp ; Save result exponent on stack * BC underflow,ALT ; branch to underflow handler if exp < 0 SUB #0FFh,A ; test for overflow BC overflow,AGT ; branch to overflow is exp > 127 * *;***************************************************************************** *; DIVISION *; Division is implemented by parts. The mantissas for both OP1 and OP2 are left shifted
* * * * * in the 32 bit field to reduce the effect of secondary and tertiary contributions to the final result. The left shifted results are identified as OP1HI, OP1LO, OP2HI, and OP2LO where OP1HI and OP2HI have the xx most significant bits of the mantissas and OP1LO and OP2LO contain the remaining bits * of each mantissa. Let QHI and QLO represent the two portions of the resultant mantissa. Then
1 QHI ) QLO + OPI HI ) OPI LO + OPI HI ) OPI LO * OP2 HI ) OP2 LO OP2 HI 1 ) OP2 LO OP2 HI
3-39
Floating-Point Arithmetic
1 2 3 (1 ) x) + 1x ) x x ) ........
*; Since OP2HI contains the first xx significant bits of the OP2 mantissa,* Therefore the X2 term and all subsequent terms are less X = OP2LO/OP2HI < 2yy*;
than the least significant * bit of the 24bit result and can be dropped. The result then becomes
QHI ) QLO + OPI HI ) OPI LO * 1 OP2 LO OP2 HI ) OP2 LO OP2 HI + ( QHI ) QLO ) * 1 OP2 LO OP2 HI
*; * *
where QHI and QLO represent the first approximation of the result. Also since QLO and OP2LO/OP2HI are less significant the 24th bit of the result, this product term can be dropped so
*
that
1 QHI ) QLO + OPI HI ) OPI LO + OPI HI ) OPI LO * OP2 HI ) OP2 LO OP2 HI 1 ) OP2 LO OP2 HI
*;****************************************************************************** DLD op1_hm,A ; Load dividend mantissa SFTL A,6 ; Shift dividend in preparation for division * DLD op2_hm,B ; Load divisor mantissa SFTL B,7 ; Shift divisor in preparation for division DST B,op2_hm ; Save off divisor * RPT #14 ; QHI = OP1HI/OP2HI SUBC op2_hm,A STL A,res_hm ; Save QHI * SUBS res_hm,A ; Clear QHI from ACC RPT #10 ; QLO = OP1LO / OP2HI SUBC op2_hm,A STL A,5,res_lm ; Save QLO* LD res_hm,T ; T = QHI MPYU op2_lm,A ; Store QHI * OP2LO in acc A SFTL A,1 ;* RPT #11 ; Calculate QHI * OP2LO / OP2HI SUBC op2_hm,A ; (correction factor) SFTL A,4 ; Left shift to bring it to proper range AND #0FFFFh,A ; Mask off correction factor * NEG A ; Subtract correction factor ADDS res_lm,A ; Add QLO ADD res_hm,16,A ; Add QHI *
3-40
Floating-Point Arithmetic
3-41
Floating-Point Arithmetic
3-42
Logical Operations
The same polynomial sequence in the descrambler section reproduces the original 16-bit input sequence. The output of the descrambler is a 16-bit word in packed format.
3-43
Logical Operations
Logical Operations
3-45
Chapter 4
Topic
4.1 4.2
Page
Codebook Search for Excitation Signal in Speech Coding . . . . . . . . 4-2 Viterbi Algorithm for Channel Decoding . . . . . . . . . . . . . . . . . . . . . . . . 4-5
4-1
p(n)
+
Synthesis filter
g(n)
To locate an optimum code vector, the codebook search uses Equation 41 to minimize the mean-square error.
Ei +
i+0
{ p(n) * g ig i (n) }
N : Subframe
The variable p(n) is the weighted input speech, gi (n) is the zero-input response of the synthesis filter, and i is the gain of the codebook. The cross-correlation (ci ) of p(n) and gi (n) is represented by Equation 42. The energy (Gi ) of gi (n) is represented by Equation 43.
ci +
i+0
g i * p(n)
4-2
Gi +
i+0
g2 i
c2 i Equation 41 is minimized by maximizing G i. Therefore, assuming that a code vector with i = opt is optimal, Equation 44 is always met for any i. The codebook search routine evaluates this equation for each code vector and finds the optimum one.
Example 41 shows the implementation algorithm for codebook search on C54x. The square (SQUR), multiply (MPYA), and conditional store (SRCCD, STRCD, SACCD) instructions are used to minimize the execution cycles. AR5 points to ci and AR2 points to Gi. AR3 points to the locations of Gopt and c 2 . opt The value of i(opt) is stored at the location addressed by AR4.
4-3
SRCCD *AR4,BGEQ STRCD *AR3+,BGEQ SACCD A,*AR3,BGEQ NOP Srh_End: RET .end
4-4
J)1 M
M J)8
SD(2i) is the first symbol that represents a soft-decision input and SD(2i+1) is the second symbol. B(J,0) and B(J,1) correspond to the code generated by the convolutional encoder as shown in Table 41.
4-6
J 0 1 2 3 4 5 6 7 B(J,0) 1 B(J,1) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
The C54x can compute a butterfly quickly by setting the ALU to dual 16-bit mode. To determine the new path metric (J), two possible path metrics from 2J and 2J+1 are calculated in parallel with branch metrics (M and M) using the DADST instruction. The path metrics are compared by the CMPS instruction. To calculate the new path metric (J+8), the DSADT instruction calculates two possible path metrics using branch metrics and old path metrics stored in the upper half and lower half of the accumulator. The CMPS instruction determines the new path metric.
The CMPS instruction compares the upper word and the lower word of the accumulator and stores the larger value in memory. The 16-bit transition register (TRN) is updated with every comparison so you can track the selected path metric. The TRN contents must be stored in memory locations after processing each symbol time interval. The back-track routine uses the information in memory locations to find the optimal path. Example 42 shows the Viterbi butterfly macro. A branch metric value is stored in T before calling the macro. During every butterfly cycle, two macros prevent T from receiving opposite sign values of the branch metrics. Figure 43 illustrates pointer management and the storage scheme for the path metrics used in Example 42. In one symbol time interval, eight butterflies are calculated for the next 16 new states. This operation repeats over a number of symbol time intervals. At the
end of the sequence of time intervals, the back-track routine is performed to find the optimal path out of the 16 paths calculated. This path represents the bit sequence to be decoded.
Figure 43. Pointer Management and Storage Scheme for Path Metrics
Pointer AR5 Metrics J&2 J)1 Location (relative) 0
Old state
15 16
24
New state
31
4-7
Chapter 5
TI C54x DSPLIB
The TI C54x DSPLIB is an optimized DSP function library for C programmers on TMS320C54x (C54x) DSP devices. It includes over 50 C-callable assembly-optimized general-purpose signal processing routines. These routines are typically used in computationally intensive real-time applications where optimal execution speed is critical. By using these routines you can achieve execution speeds considerably faster than equivalent code written in standard ANSI C language. In addition, by providing ready-to-use DSP functions, TI DSPLIB can shorten significantly your DSP application development time. The TI DSPLIB includes commonly used DSP routines. Source code is provided to allow you to modify the functions to match your specific needs and is shipped as part of the C54x Code Composer Studio product under the c:\ti\C5400\dsplib\54x_src directory. Full documentation on C54x DSPLIB can be found in the TMS320C54x DSP Library Programmers Reference (SPRU518).
Topic
5.1 5.2 5.3 5.4 5.5 5.6 5.7
Page
Features and Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2 DSPLIB Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2 DSPLIB Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2 Calling a DSPLIB Function from C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3 Calling a DSPLIB Function from Assembly Language Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4 Where to Find Sample Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4 DSPLIB Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5
5-1
that is predefined as DATA, in the dsplib.h header file. Certain DSPLIB functions use the following data type elements:
- Q.31 (LDATA): A Q.31 operand is represented by a long data type (32 bit)
Im) format.
- In-place computation is allowed (unless specifically noted): Source
available in your C54x DSP board. For example, the following code contains a call to the recip16 and q15tofl routines in DSPLIB:
#include dsplib.h DATA x[3] = { 12398 , 23167, 564}; DATA DATA float float r[NX]; rexp[NX]; rf1[NX]; rf2[NX];
void main() { short i; for (i=0;i<NX;i++) { r[i] =0; rexp[i] = 0; } recip16(x, r, rexp, NX); q15tofl(r, rf1, NX); for (i=0; i<NX; i++) { rf2[i] = (float)rexp[i] * rf1[i]; } return; }
In this example, the q15tofl DSPLIB function is used to convert Q15 fractional values to floating-point fractional values. However, in many applications, your data is always maintained in Q15 format so that the conversion between floating point and Q15 is not required.
TI C54x DSPLIB
5-3
Calling a DSPLIB Function from C Calling a DSPLIB Function from Assembly Language Source Code / Where to FInd Sample Code
(raw) function as. This test.h file is generated by using Matlab scripts.
- test.c: contains function used to compare the output of araw function with
5-4
DSPLIB Functions
For specific DSPLIB function API descriptions, refer to the TMS320C54x DSP Library Programmers Reference (SPRU518).
TI C54x DSPLIB
5-5