Reference Guide: TMS320C674x DSP CPU and Instruction Set
Reference Guide: TMS320C674x DSP CPU and Instruction Set
Reference Guide: TMS320C674x DSP CPU and Instruction Set
Reference Guide
Preface ...................................................................................................................................... 17
1 Introduction ...................................................................................................................... 19
1.1 Overview .................................................................................................................... 20
1.2 DSP Features and Options ............................................................................................... 20
1.3 DSP Architecture .......................................................................................................... 22
1.3.1 Central Processing Unit (CPU) ................................................................................. 23
1.3.2 Internal Memory .................................................................................................. 23
1.3.3 Memory and Peripheral Options ................................................................................ 23
2 CPU Data Paths and Control ............................................................................................... 25
2.1 Introduction ................................................................................................................. 26
2.2 General-Purpose Register Files ......................................................................................... 26
2.3 Functional Units ............................................................................................................ 29
2.4 Register File Cross Paths ................................................................................................ 30
2.5 Memory, Load, and Store Paths ......................................................................................... 31
2.6 Data Address Paths ....................................................................................................... 31
2.7 Galois Field ................................................................................................................. 31
2.7.1 Special Timing Considerations ................................................................................. 33
2.8 Control Register File ...................................................................................................... 34
2.8.1 Register Addresses for Accessing the Control Registers ................................................... 35
2.8.2 Pipeline/Timing of Control Register Accesses ................................................................ 35
2.8.3 Addressing Mode Register (AMR) ............................................................................. 36
2.8.4 Control Status Register (CSR) .................................................................................. 38
2.8.5 Galois Field Polynomial Generator Function Register (GFPGFR) ......................................... 40
2.8.6 Interrupt Clear Register (ICR) ................................................................................... 41
2.8.7 Interrupt Enable Register (IER) ................................................................................. 42
2.8.8 Interrupt Flag Register (IFR) .................................................................................... 43
2.8.9 Interrupt Return Pointer Register (IRP) ........................................................................ 43
2.8.10 Interrupt Set Register (ISR) .................................................................................... 44
2.8.11 Interrupt Service Table Pointer Register (ISTP) ............................................................. 45
2.8.12 Nonmaskable Interrupt (NMI) Return Pointer Register (NRP) ............................................. 45
2.8.13 E1 Phase Program Counter (PCE1) .......................................................................... 46
2.9 Control Register File Extensions ........................................................................................ 46
2.9.1 Debug Interrupt Enable Register (DIER) ...................................................................... 47
2.9.2 DSP Core Number Register (DNUM) .......................................................................... 48
2.9.3 Exception Clear Register (ECR) ................................................................................ 48
2.9.4 Exception Flag Register (EFR) ................................................................................. 49
2.9.5 GMPY Polynomial—A Side Register (GPLYA) ............................................................... 50
2.9.6 GMPY Polynomial—B Side Register (GPLYB) ............................................................... 50
2.9.7 Internal Exception Report Register (IERR) ................................................................... 51
2.9.8 SPLOOP Inner Loop Count Register (ILC) ................................................................... 52
2.9.9 Interrupt Task State Register (ITSR) ........................................................................... 52
2.9.10 NMI/Exception Task State Register (NTSR) ................................................................. 53
2.9.11 Restricted Entry Point Register (REP) ........................................................................ 53
2.9.12 SPLOOP Reload Inner Loop Count Register (RILC) ....................................................... 54
2.9.13 Saturation Status Register (SSR) ............................................................................. 54
7.9.4 Program Memory Fetch Enable Delay During Epilog ...................................................... 686
7.9.5 Stage Boundary and SPKERNEL(R) Position .............................................................. 686
7.9.6 Loop Buffer Reload ............................................................................................. 686
7.9.7 Restrictions on Accessing ILC and RILC .................................................................... 690
7.10 Loop Buffer Control Using the SPLOOPW Instruction .............................................................. 690
7.10.1 Initial Termination Condition Using the SPLOOPW Condition ........................................... 691
7.10.2 Stage Boundary Termination Condition Using the SPLOOPW Condition .............................. 691
7.10.3 Interrupting the Loop Buffer When Using SPLOOPW .................................................... 691
7.10.4 Under-Execution of Early Stages of SPLOOPW When Termination Condition Becomes True
While Interrupt Draining ........................................................................................ 692
7.11 Using the SPMASK Instruction ......................................................................................... 692
7.11.1 Using SPMASK to Merge Setup Code Example .......................................................... 693
7.11.2 Some Points About the SPMASK to Merge Setup Code Example ...................................... 694
7.11.3 Using SPMASK to Merge Reset Code Example ........................................................... 695
7.11.4 Some Points About the SPMASK to Merge Reset Code Example ...................................... 696
7.11.5 Returning from an Interrupt ................................................................................... 696
7.12 Program Memory Fetch Control ....................................................................................... 696
7.12.1 Program Memory Fetch Disable ............................................................................. 697
7.12.2 Program Memory Fetch Enable .............................................................................. 697
7.13 Interrupts .................................................................................................................. 697
7.13.1 Interrupting the Loop Buffer .................................................................................. 697
7.13.2 Returning to an SPLOOP(D/W) After an Interrupt ......................................................... 698
7.13.3 Exceptions ...................................................................................................... 698
7.13.4 Branch to Interrupt, Pipe-Down Sequence ................................................................. 698
7.13.5 Return from Interrupt, Pipe-Up Sequence .................................................................. 698
7.13.6 Disabling Interrupts During Loop Buffer Operation ........................................................ 698
7.14 Branch Instructions ...................................................................................................... 699
7.15 Instruction Resource Conflicts and SPMASK Operation ........................................................... 699
7.15.1 Program Memory and Loop Buffer Resource Conflicts ................................................... 700
7.15.2 Restrictions on Stall Detection Within SPLOOP Operation .............................................. 700
7.16 Restrictions on Cross Path Stalls ...................................................................................... 700
7.17 Restrictions on AMR-Related Stalls ................................................................................... 700
7.18 Restrictions on Instructions Placed in the Loop Buffer .............................................................. 701
8 CPU Privilege .................................................................................................................. 703
8.1 Overview .................................................................................................................. 704
8.2 Execution Modes ......................................................................................................... 704
8.2.1 Privilege Mode After Reset .................................................................................... 704
8.2.2 Execution Mode Transitions ................................................................................... 704
8.2.3 Supervisor Mode ................................................................................................ 704
8.2.4 User Mode ....................................................................................................... 705
8.3 Interrupts and Exception Handling ..................................................................................... 706
8.3.1 Inhibiting Interrupts in User Mode ............................................................................ 706
8.3.2 Privilege and Interrupts ......................................................................................... 706
8.3.3 Privilege and Exceptions ....................................................................................... 706
8.3.4 Privilege and Memory Protection ............................................................................. 706
8.4 Operating System Entry ................................................................................................. 706
8.4.1 Entering User Mode from Supervisor Mode ................................................................. 707
8.4.2 Entering Supervisor Mode from User Mode ................................................................. 707
A Instruction Compatibility .................................................................................................. 709
B Mapping Between Instruction and Functional Unit ............................................................... 715
C .D Unit Instructions and Opcode Maps ............................................................................... 721
C.1 Instructions Executing in the .D Functional Unit ..................................................................... 722
C.2 Opcode Map Symbols and Meanings ................................................................................. 722
List of Figures
1-1. TMS320C674x DSP Block Diagram .................................................................................... 22
2-1. CPU Data Paths ........................................................................................................... 27
2-2. Storage Scheme for 40-Bit Data in a Register Pair ................................................................... 28
2-3. Addressing Mode Register (AMR) ...................................................................................... 36
2-4. Control Status Register (CSR) ........................................................................................... 38
2-5. PWRD Field of Control Status Register (CSR) ........................................................................ 38
2-6. Galois Field Polynomial Generator Function Register (GFPGFR) .................................................. 40
2-7. Interrupt Clear Register (ICR)............................................................................................ 41
2-8. Interrupt Enable Register (IER) .......................................................................................... 42
2-9. Interrupt Flag Register (IFR) ............................................................................................. 43
2-10. Interrupt Return Pointer Register (IRP) ................................................................................. 43
2-11. Interrupt Set Register (ISR) .............................................................................................. 44
2-12. Interrupt Service Table Pointer Register (ISTP) ....................................................................... 45
2-13. NMI Return Pointer Register (NRP) ..................................................................................... 45
2-14. E1 Phase Program Counter (PCE1) .................................................................................... 46
2-15. Debug Interrupt Enable Register (DIER) ............................................................................... 47
2-16. DSP Core Number Register (DNUM) ................................................................................... 48
2-17. Exception Flag Register (EFR) .......................................................................................... 49
2-18. GMPY Polynomial A-Side Register (GPLYA) .......................................................................... 50
2-19. GMPY Polynomial B-Side (GPLYB) .................................................................................... 50
2-20. Internal Exception Report Register (IERR) ............................................................................ 51
2-21. Inner Loop Count Register (ILC) ........................................................................................ 52
2-22. Interrupt Task State Register (ITSR).................................................................................... 52
2-23. NMI/Exception Task State Register (NTSR) ........................................................................... 53
2-24. Reload Inner Loop Count Register (RILC) ............................................................................. 54
2-25. Saturation Status Register (SSR) ....................................................................................... 54
2-26. Time Stamp Counter Register - Low Half (TSCL)..................................................................... 55
2-27. Time Stamp Counter Register - High Half (TSCH) ................................................................... 55
2-28. Task State Register (TSR) ............................................................................................... 57
2-29. Floating-Point Adder Configuration Register (FADCR) ............................................................... 59
2-30. Floating-Point Auxiliary Configuration Register (FAUCR) ............................................................ 61
2-31. Floating-Point Multiplier Configuration Register (FMCR) ............................................................. 63
3-1. Single-Precision Floating-Point Fields .................................................................................. 71
3-2. Double-Precision Floating-Point Fields ................................................................................. 72
3-3. Basic Format of a Fetch Packet ......................................................................................... 74
3-4. Examples of the Detectability of Write Conflicts by the Assembler ................................................. 81
3-5. Compact Instruction Header Format .................................................................................... 92
3-6. Layout Field in Compact Header Word ................................................................................. 92
3-7. Expansion Field in Compact Header Word ............................................................................ 93
3-8. P-bits Field in Compact Header Word .................................................................................. 95
4-1. Pipeline Stages ........................................................................................................... 576
4-2. Fetch Phases of the Pipeline ........................................................................................... 577
4-3. Decode Phases of the Pipeline ........................................................................................ 578
4-4. Execute Phases of the Pipeline ........................................................................................ 579
4-5. Pipeline Phases .......................................................................................................... 580
4-6. Pipeline Operation: One Execute Packet per Fetch Packet ........................................................ 580
4-7. Pipeline Phases Block Diagram ........................................................................................ 583
List of Tables
2-1. 40-Bit/64-Bit Register Pairs .............................................................................................. 28
2-2. Functional Units and Operations Performed ........................................................................... 29
2-3. Modulo 2 Arithmetic ....................................................................................................... 31
2-4. Modulo 5 Arithmetic ....................................................................................................... 32
2-5. Modulo Arithmetic for Field GF(23) ...................................................................................... 33
2-6. Control Registers .......................................................................................................... 34
2-7. Addressing Mode Register (AMR) Field Descriptions ............................................................... 36
2-8. Block Size Calculations ................................................................................................... 37
2-9. Control Status Register (CSR) Field Descriptions .................................................................... 38
2-10. Galois Field Polynomial Generator Function Register (GFPGFR) Field Descriptions............................ 40
2-11. Interrupt Clear Register (ICR) Field Descriptions ..................................................................... 41
2-12. Interrupt Enable Register (IER) Field Descriptions ................................................................... 42
2-13. Interrupt Flag Register (IFR) Field Descriptions ....................................................................... 43
2-14. Interrupt Set Register (ISR) Field Descriptions ........................................................................ 44
2-15. Interrupt Service Table Pointer Register (ISTP) Field Descriptions ................................................ 45
2-16. Control Register File Extensions ....................................................................................... 46
2-17. Debug Interrupt Enable Register (DIER) Field Descriptions ........................................................ 47
2-18. Exception Flag Register (EFR) Field Descriptions ................................................................... 49
2-19. Internal Exception Report Register (IERR) Field Descriptions ..................................................... 51
2-20. Interrupt Task State Register (ITSR) Field Descriptions ............................................................. 52
2-21. NMI/Exception Task State Register (NTSR) Field Descriptions..................................................... 53
2-22. Saturation Status Register Field Descriptions ......................................................................... 54
2-23. Task State Register (TSR) Field Descriptions ........................................................................ 57
2-24. Control Register File Extensions for Floating-Point Operations ..................................................... 58
2-25. Floating-Point Adder Configuration Register (FADCR) Field Descriptions ........................................ 59
2-26. Floating-Point Auxiliary Configuration Register (FAUCR) Field Descriptions ..................................... 61
2-27. Floating-Point Multiplier Configuration Register (FMCR) Field Descriptions ..................................... 63
3-1. Instruction Operation and Execution Notations ........................................................................ 66
3-2. Instruction Syntax and Opcode Notations ............................................................................. 68
3-3. IEEE Floating-Point Notations ........................................................................................... 70
3-4. Special Single-Precision Values ......................................................................................... 71
3-5. Hexadecimal and Decimal Representation for Selected Single-Precision Values ................................ 71
3-6. Special Double-Precision Values ........................................................................................ 72
3-7. Hexadecimal and Decimal Representation for Selected Double-Precision Values ............................... 72
3-8. Delay Slot and Functional Unit Latency ................................................................................ 73
3-9. Registers That Can Be Tested by Conditional Operations .......................................................... 77
3-10. Indirect Address Generation for Load/Store ........................................................................... 90
3-11. Address Generator Options for Load/Store ............................................................................ 90
3-12. CPU Fetch Packet Types ................................................................................................ 91
3-13. Layout Field Description in Compact Instruction Packet Header ................................................... 92
3-14. Expansion Field Description in Compact Instruction Packet Header ............................................... 93
3-15. LD/ST Data Size Selection ............................................................................................... 94
3-16. P-bits Field Description in Compact Instruction Packet Header ..................................................... 95
3-17. Available Compact Instructions ......................................................................................... 96
3-18. Relationships Between Operands, Operand Size, Functional Units, and Opfields for Example Instruction
(ADD) ...................................................................................................................... 100
3-19. Program Counter Values for Branch Using a Displacement Example ............................................ 152
3-20. Program Counter Values for Branch Using a Register Example .................................................. 154
3-21. Program Counter Values for B IRP Instruction Example ........................................................... 156
3-22. Program Counter Values for B NRP Instruction Example .......................................................... 158
3-23. Data Types Supported by LDB(U) Instruction ........................................................................ 279
3-24. Data Types Supported by LDB(U) Instruction (15-Bit Offset) ...................................................... 282
3-25. Data Types Supported by LDH(U) Instruction ....................................................................... 288
3-26. Data Types Supported by LDH(U) Instruction (15-Bit Offset) ...................................................... 290
3-27. Register Addresses for Accessing the Control Registers .......................................................... 378
3-28. Field Allocation in stg/cyc Field ........................................................................................ 482
3-29. Bit Allocations to Stage and Cycle in stg/cyc Field .................................................................. 482
4-1. Operations Occurring During Pipeline Phases ...................................................................... 581
4-2. Execution Stage Length Description for Each Instruction Type - Part A ......................................... 585
4-3. Execution Stage Length Description for Each Instruction Type - Part B ......................................... 586
4-4. Execution Stage Length Description for Each Instruction Type - Part C ......................................... 586
4-5. Execution Stage Length Description for Each Instruction Type - Part D ......................................... 587
4-6. Single-Cycle Instruction Execution .................................................................................... 588
4-7. Multiply Instruction Execution .......................................................................................... 589
4-8. Store Instruction Execution ............................................................................................. 590
4-9. Extended Multiply Instruction Execution .............................................................................. 592
4-10. Load Instruction Execution.............................................................................................. 593
4-11. Branch Instruction Execution ........................................................................................... 594
4-12. Two-Cycle DP Instruction Execution .................................................................................. 596
4-13. Four-Cycle Instruction Execution ...................................................................................... 597
4-14. INTDP Instruction Execution ........................................................................................... 598
4-15. DP Compare Instruction Execution .................................................................................... 598
4-16. ADDDP/SUBDP Instruction Execution ................................................................................ 599
4-17. MPYI Instruction Execution ............................................................................................. 599
4-18. MPYID Instruction Execution ........................................................................................... 600
4-19. MPYDP Instruction Execution .......................................................................................... 600
4-20. MPYSPDP Instruction Execution ...................................................................................... 601
4-21. MPYSP2DP Instruction Execution ..................................................................................... 601
4-22. Single-Cycle .S-Unit Instruction Constraints.......................................................................... 602
4-23. DP Compare .S-Unit Instruction Constraints ......................................................................... 603
4-24. 2-Cycle DP .S-Unit Instruction Constraints ........................................................................... 604
4-25. ADDSP/SUBSP .S-Unit Instruction Constraints...................................................................... 604
4-26. ADDDP/SUBDP .S-Unit Instruction Constraints ..................................................................... 605
4-27. Branch .S-Unit Instruction Constraints ................................................................................ 605
4-28. 16 × 16 Multiply .M-Unit Instruction Constraints ..................................................................... 606
4-29. 4-Cycle .M-Unit Instruction Constraints ............................................................................... 607
4-30. MPYI .M-Unit Instruction Constraints .................................................................................. 608
4-31. MPYID .M-Unit Instruction Constraints ................................................................................ 609
4-32. MPYDP .M-Unit Instruction Constraints ............................................................................... 610
4-33. MPYSP .M-Unit Instruction Constraints ............................................................................... 611
4-34. MPYSPDP .M-Unit Instruction Constraints ........................................................................... 612
4-35. MPYSP2DP .M-Unit Instruction Constraints.......................................................................... 613
4-36. Single-Cycle .L-Unit Instruction Constraints .......................................................................... 614
4-37. 4-Cycle .L-Unit Instruction Constraints ................................................................................ 615
4-38. INTDP .L-Unit Instruction Constraints ................................................................................. 616
4-39. ADDDP/SUBDP .L-Unit Instruction Constraints...................................................................... 617
Notational Conventions
This document uses the following conventions.
• Hexadecimal numbers are shown with the suffix h. For example, the following number is 40
hexadecimal (decimal 64): 40h.
TMS320C674x, TMS320C67x+, TMS320C64x+, C674x, TMS320C67x+, TMS320C64x+, XDS510, XDS560 are trademarks of Texas
Instruments.
Windows is a registered trademark of Microsoft Corporation.
Introduction
1.1 Overview
The TMS320C674x™ DSP is the new generation floating-point DSP that combines the TMS320C67x+™
DSP and the TMS320C64x+™ DSP instruction set architectures into one core.
The C674x™ megamodule is the name used to designate the CPU together with the hardware providing
memory, bandwidth management, interrupt, memory protection, and power-down support. This document
describes the CPU architecture, pipeline, instruction set, and interrupts of the C674x DSP. The C674x
megamodule is not described in this document since it is fully covered in the TMS320C674x DSP
Megamodule Reference Guide (SPRUFK5).
L1P Cache/SRAM
L1D Cache/SRAM
This chapter focuses on the CPU, providing information about the data paths and control registers. The
two register files and the data cross paths are described.
2.1 Introduction
The components of the data path for the CPU are shown in Figure 2-1. These components consist of:
• Two general-purpose register files (A and B)
• Eight functional units (.L1, .L2, .S1, .S2, .M1, .M2, .D1, and .D2)
• Two load-from-memory data paths (LD1 and LD2)
• Two store-to-memory data paths (ST1 and ST2)
• Two data address paths (DA1 and DA2)
• Two register file data cross paths (1X and 2X)
Even
src1 Odd
register
register
file A
file A
(A0, A2,
src2 (A1, A3,
.L1 A4...A30)
A5...A31)
odd dst
See note 4
even dst
long src 8
ST1b 32 MSB
32 LSB
ST1a
8
long src
even dst
See note 4
odd dst
Data path A .S1
src1
src2
1x Even
Odd register
DA2 src2 file B
register
.D2 (B0, B2,
src1 file B
dst (B1, B3, B4...B30)
32 LSB B5...B31)
LD2a
LD2b 32 MSB
src2 64
/ See note 3
.M2 src1
dst1 32
See note 2
dst2 32
See note 1
src2
src1
.S2 odd dst
Data path B even dst
See note 4
8
long src
32 MSB
ST2a
32 LSB
ST2b
8
long src
even dst
See note 4
odd dst
.L2
src2
src1
Control
Register
39 32 31 0
40-bit data
Write to registers
Note that addition and subtraction results are the same, and in fact are equivalent to the XOR
(exclusive-OR) operation in binary. Also, the multiplication result is equal to the AND operation in binary.
These properties are unique to modulo 2 arithmetic, but modulo 2 arithmetic is used extensively in error
correction coding. Another more general property is that division by any nonzero element is now defined.
Division can always be performed, if every element other than zero has a multiplicative inverse:
x × x-1 = 1
Another example, arithmetic modulo 5, illustrates this concept more clearly. The addition, subtraction, and
multiplication tables are given in Table 2-4.
In the rows of the multiplication table, element 1 appears in every nonzero row and column. Every nonzero
element can be multiplied by at least one other element for a result equal to 1. Therefore, division always
works and arithmetic over integers modulo 5 forms a field. Fields generated in this manner are called finite
fields or Galois fields and are written as GF(X), such as GF(2) or GF(5). They only work when the
arithmetic performed is modulo a prime number.
Galois fields can also be formed where the elements are vectors instead of integers if polynomials are
used. Finite fields, therefore, can be found with a number of elements equal to any power of a prime
number. Typically, we are interested in implementing error correction coding systems using binary
arithmetic. All of the fields that are dealt with in Reed Solomon coding systems are of the form GF(2m).
This allows performing addition using XORs on the coefficients of the vectors, and multiplication using a
combination of ANDs and XORs.
A final example considers the field GF(23), which has 8 elements. This can be generated by arithmetic
modulo the (irreducible) polynomial P(x) = x3 + x + 1. Elements of this field look like vectors of three bits.
Table 2-5 shows the addition and multiplication tables for field GF(23).
Note that the value 1 (001) appears in every nonzero row of the multiplication table, which indicates that
this is a valid field.
The channel error can now be modeled as a vector of bits, with a one in every bit position that an error
has occurred, and a zero where no error has occurred. Once the error vector has been determined, it can
be subtracted from the received message to determine the correct code word.
The Galois field multiply hardware on the DSP is named GMPY4. The GMPY4 instruction performs four
parallel operations on 8-bit packed data on the .M unit. The Galois field multiplier can be programmed to
perform all Galois multiplies for fields of the form GF(2m), where m can range between 1 and 8 using any
generator polynomial. The field size and the polynomial generator are controlled by the Galois field
polynomial generator function register (GFPGFR).
In addition to the GMPY4 instruction, the C674x DSP has the GMPY instruction that uses either the
GPLYA or GPLYB control register as a source for the polynomial (depending on whether the A or B side
functional unit is used) and produces a 32-bit result.
The GFPGFR, shown in Figure 2-6 and described in Table 2-10, contains the Galois field polynomial
generator and the field size control bits. These bits control the operation of the GMPY4 instruction.
GFPGFR can only be set via the MVC instruction. The default function after reset for the GMPY4
instruction is field size = 7h and polynomial = 1Dh.
Addition
+ 000 001 010 011 100 101 110 111
000 000 001 010 011 100 101 110 111
001 001 000 011 010 101 100 111 110
010 010 011 000 001 110 111 100 101
011 011 010 001 000 111 110 101 100
100 100 101 110 111 000 001 010 011
101 101 100 111 110 001 000 011 010
110 110 111 100 101 010 011 000 001
111 111 110 101 100 011 010 001 000
Multiplication
× 000 001 010 011 100 101 110 111
000 000 000 000 000 000 000 000 000
001 000 001 010 011 100 101 110 111
010 000 010 100 110 011 001 111 101
011 000 011 110 101 111 100 001 010
100 000 100 011 111 110 010 101 001
101 000 101 001 100 010 111 011 110
110 000 110 111 001 101 011 010 100
111 000 111 101 010 001 110 100 011
Pipeline Stage E1
Read src2
Written dst
Unit in use .S2
Even though MVC modifies the particular target control register in a single cycle, it can take extra clocks
to complete modification of the non-explicitly named register. For example, the MVC cannot modify bits in
the IFR directly. Instead, MVC can only write 1's into the ISR or the ICR to specify setting or clearing,
respectively, of the IFR bits. MVC completes this ISR/ICR write in a single (E1) cycle but the modification
of the IFR bits occurs one clock later. For more information on the manipulation of ISR, ICR, and IFR, see
Section 2.8.10, Section 2.8.6, and Section 2.8.8 .
Saturating instructions, such as SADD, set the saturation flag bit (SAT) in CSR indirectly. As a result,
several of these instructions update the SAT bit one full clock cycle after their primary results are written to
the register file. For example, the SMPY instruction writes its result at the end of pipeline stage E2; its
primary result is available after one delay slot. In contrast, the SAT bit in CSR is updated one cycle later
than the result is written; this update occurs after two delay slots. (For the specific behavior of an
instruction, refer to the description of that individual instruction).
The B IRP and B NRP instructions directly update the GIE and NMIE bits, respectively. Because these
branches directly modify CSR and IER, respectively, there are no delay slots between when the branch is
issued and when the control register updates take effect.
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
B7 MODE B6 MODE B5 MODE B4 MODE A7 MODE A6 MODE A5 MODE A4 MODE
R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0
LEGEND: R = Readable by the MVC instruction; W = Writeable by the MVC instruction; -n = value after reset
15 10 9 8 7 5 4 2 1 0
PWRD SAT EN PCC DCC PGIE GIE
R/SW-0 R/WC-0 R-x R/SW-0 R/SW-0 R/SW-0 R/W-0
LEGEND: R = Readable by the MVC instruction; W = Writeable by the MVC instruction; SW = Writeable by the MVC instruction only in
supervisor mode; WC = Bit is cleared on write; -n = value after reset; -x = value is indeterminate after reset
(1)
See the device-specific datasheet for the default value of this field.
15 8 7 0
Reserved POLY
R-0 R/W-1Dh
LEGEND: R = Readable by the MVC instruction; W = Writeable by the MVC instruction; -n = value after reset
Table 2-10. Galois Field Polynomial Generator Function Register (GFPGFR) Field Descriptions
Bit Field Value Description
31-27 Reserved 0 Reserved. The reserved bit location is always read as 0. A value written to this field has no effect.
26-24 SIZE 0-7h Field size.
23-8 Reserved 0 Reserved. The reserved bit location is always read as 0. A value written to this field has no effect.
7-0 POLY 0-FFh Polynomial generator.
NOTE: Any write to ICR (by the MVC instruction) effectively has one delay slot because the results
cannot be read (by the MVC instruction) in IFR until two cycles after the write to ICR.
Any write to ICR is ignored by a simultaneous write to the same bit in the interrupt set
register (ISR).
15 14 13 12 11 10 9 8 7 6 5 4 3 0
IC15 IC14 IC13 IC12 IC11 IC10 IC9 IC8 IC7 IC6 IC5 IC4 Reserved
W-0 W-0 W-0 W-0 W-0 W-0 W-0 W-0 W-0 W-0 W-0 W-0 R-0
LEGEND: R = Read only; W = Writeable by the MVC instruction; -n = value after reset
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
IE15 IE14 IE13 IE12 IE11 IE10 IE9 IE8 IE7 IE6 IE5 IE4 Reserved NMIE 1
R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R-0 R/W-0 R-1
LEGEND: R = Readable by the MVC instruction; W = Writeable by the MVC instruction; -n = value after reset
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
IF15 IF14 IF13 IF12 IF11 IF10 IF9 IF8 IF7 IF6 IF5 IF4 Reserved NMIF 0
R-0 R-0 R-0 R-0 R-0 R-0 R-0 R-0 R-0 R-0 R-0 R-0 R-0 R-0 R-0
LEGEND: R = Readable by the MVC instruction; -n = value after reset
NOTE: Any write to ISR (by the MVC instruction) effectively has one delay slot because the results
cannot be read (by the MVC instruction) in IFR until two cycles after the write to ISR.
Any write to the interrupt clear register (ICR) is ignored by a simultaneous write to the same
bit in ISR.
15 14 13 12 11 10 9 8 7 6 5 4 3 0
IS15 IS14 IS13 IS12 IS11 IS10 IS9 IS8 IS7 IS6 IS5 IS4 Reserved
W-0 W-0 W-0 W-0 W-0 W-0 W-0 W-0 W-0 W-0 W-0 W-0 R-0
LEGEND: R = Read only; W = Writeable by the MVC instruction; -n = value after reset
15 10 9 5 4 3 2 1 0
ISTB HPEINT 0 0 0 0 0
R/W-S R-0 R-0 R-0 R-0 R-0 R-0
LEGEND: R = Readable by the MVC instruction; W = Writeable by the MVC instruction; -n = value after reset; S = See the device-specific
data manual for the default value of this field after reset
Table 2-15. Interrupt Service Table Pointer Register (ISTP) Field Descriptions
Bit Field Value Description
31-10 ISTB 0-3F FFFFh Interrupt service table base portion of the IST address. This field is cleared to a device-specific
default value on reset; therefore, upon startup the IST must reside at this specific address. See
the device-specific data manual for more information. After reset, you can relocate the IST by
writing a new value to ISTB. If relocated, the first ISFP (corresponding to RESET) is never
executed via interrupt processing, because reset clears the ISTB to its default value. See
Example 5-1.
9-5 HPEINT 0-1Fh Highest priority enabled interrupt that is currently pending. This field indicates the number
(related bit position in the IFR) of the highest priority interrupt (as defined in Table 5-1) that is
enabled by its bit in the IER. Thus, the ISTP can be used for manual branches to the highest
priority enabled interrupt. If no interrupt is pending and enabled, HPEINT contains the value 0.
The corresponding interrupt need not be enabled by NMIE (unless it is NMI) or by GIE.
4-0 0 0 Cleared to 0 (fetch packets must be aligned on 8-word (32-byte) boundaries).
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
INT15 INT14 INT13 INT12 INT11 INT10 INT9 INT8 INT7 INT6 INT5 INT4 Reserved WSEL Rsvd
R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R-0 R/W-0 R/W-0
LEGEND: R = Readable by the MVC instruction; W = Writeable by the MVC instruction; -n = value after reset
15 8 7 0
Reserved DSP number
R-0 R-S
LEGEND: R = Readable by the MVC instruction; -n = value after reset; S = See the device-specific data manual for the default value of this
field after reset
15 2 1 0
Reserved IXF SXF
R-0 R/W-0 R/W-0
LEGEND: R = Readable by the MVC EFR instruction only in Supervisor mode; W = Clearable by the MVC ECR instruction only in
Supervisor mode; -n = value after reset
15 9 8 7 6 5 4 3 2 1 0
Reserved MSX LBX PRX RAX RCX OPX EPX FPX IFX
R-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0
LEGEND: R = Readable by the MVC instruction only in Supervisor mode; W = Writeable by the MVC instruction only in Supervisor mode;
-n = value after reset
15 14 13 11 10 9 8 7 6 5 4 3 2 1 0
IB SPLX Reserved EXC INT Rsvd CXM Rsvd DBGM XEN GEE SGIE GIE
R/W-0 R/W-0 R-0 R/W-0 R/W-0 R-0 R/W-0 R-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0
LEGEND: R = Readable by the MVC instruction only in Supervisor mode; W = Writeable by the MVC instruction only in Supervisor mode;
-n = value after reset
15 14 13 11 10 9 8 7 6 5 4 3 2 1 0
IB SPLX Reserved EXC INT Rsvd CXM Rsvd DBGM XEN GEE SGIE GIE
R/W-0 R/W-0 R-0 R/W-0 R/W-0 R-0 R/W-0 R-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0
LEGEND: R = Readable by the MVC instruction only in Supervisor mode; W = Writeable by the MVC instruction only in Supervisor mode;
-n = value after reset
15 5 4 3 2 1 0
Reserved M2 M1 S2 S1 L2 L1
R-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0
LEGEND: R = Readable by the MVC instruction; W = Writeable by the MVC instruction; -n = value after reset
2.9.14.1 Initialization
The counter is cleared to 0 after reset, and counting is disabled.
CAUTION
Reading TSCL in the cycle before a cross path stall may give an inaccurate
value in TSCH.
When reading the full 64-bit value, it must be ensured that no interrupts are serviced between the two
MVC instructions if an ISR is allowed to make use of the time stamp counter. There is no way for an ISR
to restore the previous value of TSCH (snapshot) if it reads TSCL, since a new snapshot is performed.
Two methods for reading the 64-bit count value in an uninterruptible manner are shown in Example 2-1
and Example 2-2. Example 2-1 uses the fact that interrupts are automatically disabled in the delay slots of
a branch to prevent an interrupt from happening between the TSCL read and the TSCH read.
Example 2-2 accomplishes the same task by explicitly disabling interrupts.
Example 2-1. Code to Read the 64-Bit TSC Value in Branch Delay Slot
BNOP TSC_Read_Done, 3
MVC TSCL,B0 ; Read the low half first; high half copied to TSCH
MVC TSCH,B1 ; Read the snapshot of the high half
TSC_Read_Done:
Example 2-2. Code to Read the 64-Bit TSC Value Using DINT/RINT
DINT
|| MVC TSCL,B0 ; Read the low half first; high half copied to TSCH
RINT
|| MVC TSCH,B1 ; Read the snapshot of the high half
TSC_Read_Done:
15 14 13 11 10 9 8 7 6 5 4 3 2 1 0
IB SPLX Reserved EXC INT Rsvd CXM Rsvd DBGM XEN GEE SGIE GIE
R-0 R-0 R-0 R/C-0 R-0 R-0 R/W-0 R-0 R/W-0 R/W-0 R/S-0 R/W-0 R/W-0
LEGEND: R = Readable by the MVC instruction; W = Writeable in Supervisor mode; C = Clearable in Supervisor mode; S = Can be set in
Supervisor mode; -n = value after reset
NOTE: The ADDSP, ADDDP, SUBSP, and SUBDP instructions executing in the .S functional unit
use the rounding mode from and set the warning bits in FADCR. The warning bits in FADCR
are the logical-OR of the warnings produced on the .L functional unit and the warnings
produced by the ADDSP/ADDDP/SUBSP/SUBDP instructions on the .S functional unit (but
not other instructions executing on the .S functional unit).
15 11 10 9 8 7 6 5 4 3 2 1 0
Reserved RMODE UNDER INEX OVER INFO INVAL DEN2 DEN1 NAN2 NAN1
R-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0
LEGEND: R = Readable by the MVC instruction; W = Writeable by the MVC instruction; -n = value after reset
Table 2-25. Floating-Point Adder Configuration Register (FADCR) Field Descriptions (continued)
Bit Field Value Description
18 DEN1 Denormalized number select for .L2 src1.
0 src1 is not a denormalized number.
1 src1 is a denormalized number.
17 NAN2 NaN select for .L2 src2.
0 src2 is not NaN.
1 src2 is NaN.
16 NAN1 NaN select for .L2 src1.
0 src1 is not NaN.
1 src1 is NaN.
15-11 Reserved 0 Reserved. The reserved bit location is always read as 0. A value written to this field has no effect.
10-9 RMODE 0-3h Rounding mode select for .L1.
0 Round toward nearest representable floating-point number
1h Round toward 0 (truncate)
2h Round toward infinity (round up)
3h Round toward negative infinity (round down)
8 UNDER Result underflow status for .L1.
0 Result does not underflow.
1 Result underflows.
7 INEX Inexact results status for .L1.
0
1 Result differs from what would have been computed had the exponent range and precision been
unbounded; never set with INVAL.
6 OVER Result overflow status for .L1.
0 Result does not overflow.
1 Result overflows.
5 INFO Signed infinity for .L1.
0 Result is not signed infinity.
1 Result is signed infinity.
4 INVAL
0 A signed NaN (SNaN) is not a source.
1 A signed NaN (SNaN) is a source. NaN is a source in a floating-point to integer conversion or when
infinity is subtracted from infinity.
3 DEN2 Denormalized number select for .L1 src2.
0 src2 is not a denormalized number.
1 src2 is a denormalized number.
2 DEN1 Denormalized number select for .L1 src1.
0 src1 is not a denormalized number.
1 src1 is a denormalized number.
1 NAN2 NaN select for .L1 src2.
0 src2 is not NaN.
1 src2 is NaN.
0 NAN1 NaN select for .L1 src1.
0 src1 is not NaN.
1 src1 is NaN.
NOTE: The ADDSP, ADDDP, SUBSP, and SUBDP instructions executing in the .S functional unit
use the rounding mode from and set the warning bits in the floating-point adder configuration
register (FADCR). The warning bits in FADCR are the logical-OR of the warnings produced
on the .L functional unit and the warnings produced by the ADDSP/ADDDP/SUBSP/SUBDP
instructions on the .S functional unit (but not other instructions executing on the .S functional
unit).
15 11 10 9 8 7 6 5 4 3 2 1 0
Reserved DIV0 UNORD UND INEX OVER INFO INVAL DEN2 DEN1 NAN2 NAN1
R-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0
LEGEND: R = Readable by the MVC instruction; W = Writeable by the MVC instruction; -n = value after reset
Table 2-26. Floating-Point Auxiliary Configuration Register (FAUCR) Field Descriptions (continued)
Bit Field Value Description
19 DEN2 Denormalized number select for .S2 src2.
0 src2 is not a denormalized number.
1 src2 is a denormalized number.
18 DEN1 Denormalized number select for .S2 src1.
0 src1 is not a denormalized number.
1 src1 is a denormalized number.
17 NAN2 NaN select for .S2 src2.
0 src2 is not NaN.
1 src2 is NaN.
16 NAN1 NaN select for .S2 src1.
0 src1 is not NaN.
1 src1 is NaN.
15-11 Reserved 0 Reserved. The reserved bit location is always read as 0. A value written to this field has no effect.
10 DIV0 Source to reciprocal operation for .S1.
0 0 is not source to reciprocal operation.
1 0 is source to reciprocal operation.
9 UNORD Source to a compare operation for .S1
0 NaN is not a source to a compare operation.
1 NaN is a source to a compare operation.
8 UND Result underflow status for .S1.
0 Result does not underflow.
1 Result underflows.
7 INEX Inexact results status for .S1.
0
1 Result differs from what would have been computed had the exponent range and precision been
unbounded; never set with INVAL.
6 OVER Result overflow status for .S1.
0 Result does not overflow.
1 Result overflows.
5 INFO Signed infinity for .S1.
0 Result is not signed infinity.
1 Result is signed infinity.
4 INVAL
0 A signed NaN (SNaN) is not a source.
1 A signed NaN (SNaN) is a source. NaN is a source in a floating-point to integer conversion or when
infinity is subtracted from infinity.
3 DEN2 Denormalized number select for .S1 src2.
0 src2 is not a denormalized number.
1 src2 is a denormalized number.
2 DEN1 Denormalized number select for .S1 src1.
0 src1 is not a denormalized number.
1 src1 is a denormalized number.
1 NAN2 NaN select for .S1 src2.
0 src2 is not NaN.
1 src2 is NaN.
0 NAN1 NaN select for .S1 src1.
0 src1 is not NaN.
1 src1 is NaN.
15 11 10 9 8 7 6 5 4 3 2 1 0
Reserved RMODE UNDER INEX OVER INFO INVAL DEN2 DEN1 NAN2 NAN1
R-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0 R/W-0
LEGEND: R = Readable by the MVC instruction; W = Writeable by the MVC instruction; -n = value after reset
Table 2-27. Floating-Point Multiplier Configuration Register (FMCR) Field Descriptions (continued)
Bit Field Value Description
16 NAN1 NaN select for .M2 src1.
0 src1 is not NaN.
1 src1 is NaN.
15-11 Reserved 0 Reserved. The reserved bit location is always read as 0. A value written to this field has no effect.
10-9 RMODE 0-3h Rounding mode select for .M1.
0 Round toward nearest representable floating-point number
1h Round toward 0 (truncate)
2h Round toward infinity (round up)
3h Round toward negative infinity (round down)
8 UNDER Result underflow status for .M1.
0 Result does not underflow.
1 Result underflows.
7 INEX Inexact results status for .M1.
0
1 Result differs from what would have been computed had the exponent range and precision been
unbounded; never set with INVAL.
6 OVER Result overflow status for .M1.
0 Result does not overflow.
1 Result overflows.
5 INFO Signed infinity for .M1.
0 Result is not signed infinity.
1 Result is signed infinity.
4 INVAL
0 A signed NaN (SNaN) is not a source.
1 A signed NaN (SNaN) is a source. NaN is a source in a floating-point to integer conversion or when
infinity is subtracted from infinity.
3 DEN2 Denormalized number select for .M1 src2.
0 src2 is not a denormalized number.
1 src2 is a denormalized number.
2 DEN1 Denormalized number select for .M1 src1.
0 src1 is not a denormalized number.
1 src1 is a denormalized number.
1 NAN2 NaN select for .M1 src2.
0 src2 is not NaN.
1 src2 is NaN.
0 NAN1 NaN select for .M1 src1.
0 src1 is not NaN.
1 src1 is NaN.
Instruction Set
This chapter describes the assembly language instructions of the TMS320C674x DSP. Also described are
parallel operations, conditional operations, resource constraints, and addressing modes.
The C674x DSP uses all of the instructions available to the TMS320C62x, TMS320C64x, TMS320C64x+,
TMS320C67x, and TMS320C67x+ DSPs. The C674x DSP instructions include 8-bit and 16-bit extensions,
nonaligned word loads and stores, data packing/unpacking operations.
LEGEND: s = sign bit (0 = positive, 1 = negative); e = 8-bit exponent ( 0 < e < 255);
f = 23-bit fraction (0 < f < 1 × 2-1 + 1 × 2-2 + ... + 1 × 2-23 or 0 < f < ((223) - 1)/(223)
The floating-point fields represent floating-point numbers within two ranges: normalized (e is between 0
and 255) and denormalized (e is 0). The following formulas define how to translate the s, e, and f fields
into a single-precision floating-point number.
Table 3-4 shows the s, e, and f values for special single-precision floating-point numbers.
Table 3-5 shows hexadecimal and decimal values for some single-precision floating-point numbers.
The floating-point fields represent floating-point numbers within two ranges: normalized (e is between 0
and 2047) and denormalized (e is 0). The following formulas define how to translate the s, e, and f fields
into a double-precision floating-point number.
Table 3-6 shows the s, e, and f values for special double-precision floating-point numbers.
Table 3-7 shows hexadecimal and decimal values for some double-precision floating-point numbers.
The CPU supports compact 16-bit instructions. Unlike the normal 32-bit instructions, the p-bit information
for compact instructions is not contained within the instruction opcode. Instead, the p-bit is contained
within the p-bits field within the fetch packet header. See Section 3.10 for more information.
The execution of the individual noncompact instructions is partially controlled by a bit in each instruction,
the p-bit. The p-bit (bit 0) determines whether the instruction executes in parallel with another instruction.
The p-bits are scanned from left to right (lower to higher address). If the p-bit of instruction I is 1, then
instruction I + 1 is to be executed in parallel with (in the same cycle as) instruction I. If the p-bit of
instruction I is 0, then instruction I + 1 is executed in the cycle after instruction I. All instructions executing
in parallel constitute an execute packet. An execute packet can contain up to eight instructions. Each
instruction in an execute packet must use a different functional unit.
On the CPU, the execute packet can cross fetch packet boundaries, but will be limited to no more than
eight instructions in a fetch packet. The last instruction in an execute packet will be marked with its p-bit
cleared to zero. There are three types of p-bit patterns for fetch packets. These three p-bit patterns result
in the following execution sequences for the eight instructions:
• Fully serial
• Fully parallel
• Partially serial
Example 3-1 through Example 3-3 show the conversion of a p-bit sequence into a cycle-by-cycle
execution stream of instructions.
Cycle/Execute
Packet Instructions
1 A B C D E F G H
instruction B
instruction C
|| instruction D
|| instruction E
instruction F
|| instruction G
|| instruction H
Conditional instructions are represented in code by using square brackets, [ ], surrounding the condition
register name. The following execute packet contains two ADD instructions in parallel. The first ADD is
conditional on B0 being nonzero. The second ADD is conditional on B0 being zero. The character !
indicates the inverse of the condition.
[B0] ADD .L1 A1,A2,A3
|| [!B0] ADD .L2 B1,B2,B3
The above instructions are mutually exclusive, only one will execute. If they are scheduled in parallel,
mutually exclusive instructions are constrained as described in Section 3.8. If mutually exclusive
instructions share any resources as described in Section 3.8, they cannot be scheduled in parallel (put in
the same execute packet), even though only one will execute.
The act of making an instruction conditional is often called predication and the conditional register is often
called the predication register.
3.8.2 Constraints on the Same Functional Unit Writing in the Same Instruction Cycle
The .M unit has two 32-bit write ports; so the results of a 4-cycle 32-bit instruction and a 2-cycle 32-bit
instruction operating on the same .M unit can write their results on the same instruction cycle. Any other
combination of parallel writes on the .M unit will result in a conflict. On the C674x DSP this will result in an
exception.
On the C674x DSP, this will result in erroneous values being written to the destination registers.
For example, the following sequence is valid and results in both A2 and A5 being written by the .M1 unit
on the same cycle.
DOTP2 .M1 A0,A1,A2 ;This instruction has 3 delay slots
NOP
AVG2 .M1 A4,A5 ;This instruction has 1 delay slot
NOP ;Both A2 and A5 get written on this cycle
The following sequence is invalid. The attempt to write 96 bits of output through 64-bits of write port will
fail.
SMPY2 .M1 A5,A6,A9:A8 ;This instruction has 3 delay slots; but generates a 64 bit
result
NOP
MPY .M1 A1,A2,A3 ;This instruction has 1 delay slot
NOP
The following execute packet is valid because all uses of the 1X cross path are for the same B register
operand, and all uses of the 2X cross path are for the same A register operand:
ADD .L1X A0,B1,A1 ; Instructions use the 1X with B1
|| SUB .S1X A2,B1,A2 ; 1X cross paths using B1
|| AND .D1 A4,A1,A3 ;
|| MPY .M1 A6,A1,A4 ;
|| ADD .L2 B0,B4,B2 ;
|| SUB .S2X B4,A4,B3 ; 2X cross paths using A4
|| AND .D2X B5,A4,B4 ; 2X cross paths using A4
|| MPY .M2 B6,B4,B5 ;
The following execute packet is invalid because more than two functional units use the same cross path
operand:
MV .L2X A0, B0 ; 1st cross path move
|| MV .S2X A0, B1 ; 2nd cross path move
|| MV .D2X A0, B2 ; 3rd cross path move
The operand comes from a register file opposite of the destination, if the x bit in the instruction field is set.
It is possible to avoid the cross path stall by scheduling an instruction that reads an operand via the cross
path at least one cycle after the operand is updated. With appropriate scheduling, the DSP can provide
one cross path operand per data path per cycle with no stalls. In many cases, the TMS320C6000
Optimizing Compiler and Assembly Optimizer automatically perform this scheduling.
Figure 3-4 shows different multiple-write conflicts. For example, ADD and SUB in execute packet L1 write
to the same register. This conflict is easily detectable.
MPY in packet L2 and ADD in packet L3 might both write to B2 simultaneously; however, if a branch
instruction causes the execute packet after L2 to be something other than L3, a conflict would not occur.
Thus, the potential conflict in L2 and L3 might not be detected by the assembler. The instructions in L4 do
not constitute a write conflict because they are mutually exclusive. In contrast, because the instructions in
L5 may or may not be mutually exclusive, the assembler cannot determine a conflict. If the pipeline does
receive commands to perform multiple writes to the same register, the result is undefined.
3.8.11.3 DINT
A DINT instruction cannot be placed in parallel with the following instructions:
• MVC reg, TSR
• MVC reg, CSR
• B IRP
• B NRP
• IDLE
• NOP n (if n > 1)
• RINT
• SPKERNEL(R)
• SPLOOP(D/W)
• SPMASK(R)
• SWE
• SWENR
A DINT instruction can be placed in parallel with the NOP instruction.
3.8.11.4 IDLE
An IDLE instruction cannot be placed in parallel with the following instructions:
• DINT
• NOP n (if n > 1)
• RINT
• SPKERNEL(R)
• SPLOOP(D/W)
• SPMASK(R)
• SWE
• SWENR
An IDLE instruction can be placed in parallel with the NOP instruction.
3.8.11.5 NOP n
A NOP n (with n > 1) instruction cannot be placed in parallel with other multicycle NOP counts (ADDKPC,
BNOP, CALLP) with the exception of another NOP n where the NOP count is the same. A NOP n (with
n > 1) instruction cannot be placed in parallel with the following instructions:
• DINT
• IDLE
• RINT
• SPKERNEL(R)
• SPLOOP(D/W)
• SPMASK(R)
• SWE
• SWENR
3.8.11.6 RINT
A RINT instruction cannot be placed in parallel with the following instructions:
• MVC reg, TSR
• MVC reg, CSR
• B IRP
• B NRP
• DINT
• IDLE
• NOP n (if n > 1)
• SPKERNEL(R)
• SPLOOP(D/W)
• SPMASK(R)
• SWE
• SWENR
A RINT instruction can be placed in parallel with the NOP instruction.
3.8.11.7 SPKERNEL(R)
An SPKERNEL(R) instruction cannot be placed in parallel with the following instructions:
• DINT
• IDLE
• NOP n (if n > 1)
• RINT
• SPLOOP(D/W)
• SPMASK(R)
• SWE
• SWENR
An SPKERNEL(R) instruction can be placed in parallel with the NOP instruction.
3.8.11.8 SPLOOP(D/W)
An SPLOOP(D/W) instruction cannot be placed in parallel with the following instructions:
• DINT
• IDLE
• NOP n (if n > 1)
• RINT
• SPKERNEL(R)
• SPMASK(R)
• SWE
• SWENR
An SPLOOP(D/W) instruction can be placed in parallel with the NOP instruction:
3.8.11.9 SPMASK(R)
An SPMASK(R) instruction cannot be placed in parallel with the following instructions:
• DINT
• IDLE
• NOP n (if n > 1)
• RINT
• SPLOOP(D/W)
• SPKERNEL(R)
• SWE
• SWENR
An SPMASK(R) instruction can be placed in parallel with the NOP instruction.
3.8.11.10 SWE
An SWE instruction cannot be placed in parallel with the following instructions:
• DINT
• IDLE
• NOP n (if n > 1)
• RINT
• SPLOOP(D/W)
• SPKERNEL(R)
• SWENR
An SWE instruction can be placed in parallel with the NOP instruction.
3.8.11.11 SWENR
An SWENR instruction cannot be placed in parallel with the following instructions:
• DINT
• IDLE
• NOP n (if n > 1)
• RINT
• SPLOOP(D/W)
• SPKERNEL(R)
• SWE
An SWENR instruction can be placed in parallel with the NOP instruction.
DP compare No other instruction can use the functional unit on cycles I and I + 1.
ADDDP/SUBDP No other instruction can use the functional unit on cycles I and I + 1.
MPYI No other instruction can use the functional unit on cycles I, I + 1, I + 2, and I + 3.
MPYID No other instruction can use the functional unit on cycles I, I + 1, I + 2, and I + 3.
MPYDP No other instruction can use the functional unit on cycles I, I + 1, I + 2, and I + 3.
If a cross path is used to read a source in an instruction with a multicycle functional unit latency, you must
ensure that no other instructions executing on the same side uses the cross path.
An instruction of the following types scheduled on cycle I using a cross path to read a source, has the
following constraints:
DP compare No other instruction on the same side can used the cross path on cycles I and I + 1.
ADDDP/SUBDP No other instruction on the same side can use the cross path on cycles I and I + 1.
MPYI No other instruction on the same side can use the cross path on cycles I, I + 1, I + 2,
and I + 3.
MPYID No other instruction on the same side can use the cross path on cycles I, I + 1, I + 2,
and I + 3.
MPYDP No other instruction on the same side can use the cross path on cycles I, I + 1, I + 2,
and I + 3.
Other hazards exist because instructions have varying numbers of delay slots, and need the functional
unit read and write ports of varying numbers of cycles. A read or write hazard exists when two instructions
on the same functional unit attempt to read or write, respectively, to the register file on the same cycle.
An instruction of the following types scheduled on cycle I has the following constraints:
All of the previous cases deal with double-precision floating-point instructions or the MPYI or MPYID
instructions except for the 4-cycle case. A 4-cycle instruction consists of both single- and double-precision
floating-point instructions. Therefore, the 4-cycle case is important for the following single-precision
floating-point instructions:
• ADDSP
• SUBSP
• SPINT
• SPTRUNC
• INTSP
• MPYSP
(1)
Before LDW 1 cycle after LDW 5 cycles after LDW
A4 0000 0100h A4 0000 0104h A4 0000 0104h
A1 xxxx xxxxh A1 xxxx xxxxh A1 1234 5678h
mem 104h 1234 5678h mem 104h 1234 5678h mem 104h 1234 5678h
(1)
Note: 9h words is 24h bytes. 24h bytes is 4 bytes beyond the 32-byte (20h) boundary 100h-11Fh; thus, it is wrapped around to
(124h - 20h = 104h).
(1)
Before ADDAH 1 cycle after ADDAH
A4 0000 0100h A4 0000 0106h
A1 0000 0013h A1 0000 0013h
(1)
Note: 13h halfwords is 26h bytes. 26h bytes is 6 bytes beyond the 32-byte (20h) boundary 100h-11Fh; thus, it is wrapped
around to (126h - 20h = 106h).
1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3
7 8 9 A B C D E F 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 1 2 3 4 5 6 7 8
x x x x x x x x x a b c d e f g h i j k l m n o p x x x x x x x x x
The effect of circular buffering is to make it so that memory accesses and address updates in the 20h-2Fh
range stay completely inside this range. Effectively, the memory map behaves in this manner:
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
7 8 9 A B C D E F 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 1 2 3 4 5 6 7 8
h i j k l m n o p a b c d e f g h i j k l m n o p a b c d e f g h i
Example 3-6 shows an LDNW performed with register A4 in circular mode and BK0 = 4, so the buffer size
is 32 bytes, 16 halfwords, or 8 words. The value in AMR for this example is 0004 0001h. The buffer starts
at address 0020h and ends at 0040h. The register A4 is initialized to the address 003Ah.
(1)
Before LDNW 1 cycle after LDNW 5 cycles after LDNW
A4 0000 003Ah A4 0000 0022h A4 0000 0022h
A1 xxxx xxxxh A1 xxxx xxxxh A1 5678 9ABCh
mem 0022h 5678 9ABCh mem 0022h 5678 9ABCh mem 0022h 5678 9ABCh
(1)
Note: 2h words is 8h bytes. 8h bytes is 2 bytes beyond the 32-byte (20h) boundary starting at address 003Ah; thus, it is
wrapped around to 0022h (003Ah + 8h = 0022h).
Within the other seven words of the fetch packet, each word may be composed of a single 32-bit opcode
or two 16-bit opcodes. The header word specifies which words contain compact opcodes and which
contain 32-bit opcodes.
The compiler will automatically code instructions as 16-bit compact instructions when possible.
There are a number of restrictions to the use of compact instructions:
• No dedicated predication field
• 3-bit register address field
• Very limited 3 operand instructions
• Subset of 32-bit instructions
Bits 27-21 (Layout field) indicate which words in the fetch packet contain 32-bit opcodes and which words
contain two 16-bit opcodes.
Bits 20-14 (Expansion field) contain information that contributes to the decoding of all compact
instructions in the fetch packet.
Bits 13-0 (p-bits field) specify which compact instructions are run in parallel.
Bit 20 (PROT) selects between protected and nonprotected mode for all LD instructions within the fetch
packet. When PROT is 1, four cycles of NOP are added after each LD instruction within the fetch packet
whether the LD is in 16-bit compact format or 32-bit format.
Bit 19 (RS) specifies which register set is used by compact instructions within the fetch packet. The
register set defines which subset of 8 registers on each side are data registers. The 3-bit register field in
the compact opcode indicates which one of eight registers is used. When RS is 1, the high register set
(A16-A23 and B16-B23) is used; when RS is 0, the low register set (A0-A7 and B0-B7) is used.
Bits 18-16 (DSZ) determine the two data sizes available to the compact versions of the LD and ST
instructions in a fetch packet. Bit 18 determines the primary data size that is either word (W) or
doubleword (DW). In the case of DW, an opcode bit selects between aligned (DW) and nonaligned (NDW)
accesses. Bits 17 and 16 determine the secondary data size: byte unsigned (BU), byte (B), halfword
unsigned (HU), halfword (H), word (W), or nonaligned word (NW). Table 3-15 describes how the bits map
to data size.
Bit 15 (BR). When BR is 1, instructions in the S unit are decoded as branches.
Bit 14 (SAT). When SAT is 1, the ADD, SUB, SHL, MPY, MPYH, MPYLH, and MPYHL instructions are
decoded as SADD, SUBS, SSHL, SMPY, SMPYH, SMPYLH, and SMPYHL, respectively.
Description Instruction execution and its effect on the rest of the processor or memory contents are
described. Any constraints on the operands imposed by the processor or the assembler
are discussed. The description parallels and supplements the information given by the
execution block.
Execution The execution describes the processing that takes place when the instruction is
executed. The symbols are defined in Table 3-1. For example:
Pipeline This section contains a table that shows the sources read from, the destinations written
to, and the functional unit used during each execution cycle of the instruction.
Instruction Type This section gives the type of instruction. See Section 4.2 for information about the
pipeline execution of this type of instruction.
Delay Slots This section gives the number of delay slots the instruction takes to execute See
Section 3.4 for an explanation of delay slots.
Functional Unit Latency This section gives the number of cycles that the functional unit is in use during the
execution of the instruction.
Example Examples of instruction execution. If applicable, register and memory values are given
before and after instruction execution.
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 5 4 3 2 1 0
creg z dst src2 0 0 0 0 0 x op 1 1 0 s p
3 1 5 5 1 7 1 1
Execution
Pipeline
Pipeline Stage E1
Read src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
ABS .L1 A1,A5
Example 2
ABS .L1 A1,A5
Example 3
ABS .L1 A1:A0,A5:A4
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 0 0 1 0 0 x 0 0 1 1 0 1 0 1 1 0 s p
3 1 5 5 1 1 1
Description The absolute values of the upper and lower halves of the src2 operand are placed in the
upper and lower halves of the dst.
31 16 15 0
a_hi a_lo ← src2
ABS2
↓ ↓
31 16 15 0
abs(a_hi) abs(a_lo) ← dst
Specifically, this instruction performs the following steps for each halfword of src2, then
writes its result to the appropriate halfword of dst:
1. If the value is between 0 and 215, then value → dst
2. If the value is less than 0 and not equal to -215, then -value → dst
3. If the value is equal to -215, then 215 -1 → dst
Execution
if (cond) {
abs(lsb16(src2)) → lsb16(dst)
abs(msb16(src2)) → msb16(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
ABS2 .L1 A0,A2
Example 2
ABS2 .L1 A0,A2
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 0 0 0 0 0 x 1 0 1 1 0 0 1 0 0 0 s p
3 1 5 5 1 1 1
Description The absolute value of src2 is placed in dst. The 64-bit double-precision operand is read
in one cycle by using the src2 port for the 32 MSBs and the src1 port for the 32 LSBs.
The absolute value of src2 is determined as follows:
1. If src2 ≥ 0, then src2 →dst
2. If src2 < 0, then -src2 → dst
NOTE:
1. If scr2 is SNaN, NaN_out is placed in dst and the INVAL and NAN2
bits are set.
2. If src2 is QNaN, NaN_out is placed in dst and the NAN2 bit is set.
3. If src2 is denormalized, +0 is placed in dst and the INEX and DEN2
bits are set.
4. If src2 is +infinity or −infinity, +infinity is placed in dst and the INFO bit
is set.
Execution
Pipeline
Pipeline Stage E1 E2
Read src2_l, src2_h
Written dst_l dst_h
Unit in use .S
If dst is used as the source for the ADDDP, CMPEQDP, CMPLTDP, CMPGTDP,
MPYDP, or SUBDP instruction, the number of delay slots can be reduced by one,
because these instructions read the lower word of the DP source one cycle before the
upper word of the DP source.
Delay Slots 1
A1:A0 C004 0000h 0000 0000h -2.5 A1:A0 C004 0000h 0000 0000h
A3:A2 xxxx xxxxh xxxx xxxxh A3:A2 4004 0000h 0000 0000h 2.5
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 0 0 0 0 0 x 1 1 1 1 0 0 1 0 0 0 s p
3 1 5 5 1 1 1
NOTE:
1. If scr2 is SNaN, NaN_out is placed in dst and the INVAL and NAN2
bits are set.
2. If src2 is QNaN, NaN_out is placed in dst and the NAN2 bit is set.
3. If src2 is denormalized, +0 is placed in dst and the INEX and DEN2
bits are set.
4. If src2 is +infinity or −infinity, +infinity is placed in dst and the INFO bit
is set.
Execution
Pipeline
Pipeline Stage E1
Read src2
Written dst
Unit in use .S
Delay Slots 0
Opcode .L unit
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
creg z dst src2 src1 x op 1 1 0 s p
3 1 5 5 5 1 7 1 1
Opcode .S unit
31 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
creg z dst src2 src1 x op 1 0 0 0 s p
3 1 5 5 5 1 6 1 1
Description for .L1, .L2 and .S1, .S2 Opcodes src2 is added to src1. The result is placed in dst.
31 29 28 27 23 22 18 17 13 12 7 6 5 4 3 2 1 0
creg z dst src2 src1 op 1 0 0 0 0 s p
3 1 5 5 5 6 1 1
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 1 0 1 0 1 1 0 0 s p
3 1 5 5 5 1 1 1
Opcode .D unit (if the cross path form is used with a constant)
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 1 0 1 1 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description for .D1, .D2 Opcodes src1 is added to src2. The result is placed in dst.
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L, .S, or .D
Delay Slots 0
Examples Example 1
ADD .L2X A1,B1,B2
Example 2
ADD .L1 A1,A3:A2,A5:A4
A3:A2 0000 00FFh FFFF FF12h -228 (1) A3:A2 0000 00FFh FFFF FF12h
A5:A4 0000 0000h 0000 0000h A5:A4 0000 0000h 0000 316Ch 12,652 (1)
(1)
Signed 40-bit (long) integer
Example 3
ADD .L1 -13,A1,A6
Example 4
ADD .D1 A1,26,A6
Example 5
ADD .D1 B0,5,A2
Opcode
31 29 28 27 23 22 18 17 13 12 7 6 5 4 3 2 1 0
creg z dst src2 src1 op 1 0 0 0 0 s p
3 1 5 5 5 6 1 1
Description src1 is added to src2 using the byte addressing mode specified for src2. The addition
defaults to linear mode. However, if src2 is one of A4-A7 or B4-B7, the mode can be
changed to circular mode by writing the appropriate value to the AMR (see
Section 2.8.3).The result is placed in dst.
Execution
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .D
Opcode
31 30 29 28 27 23 22 8 7 6 5 4 3 2 1 0
0 0 0 1 dst ucst15 y 0 1 1 1 1 s p
5 15 1 1 1
Description This instruction reads a register (baseR), B14 (y = 0) or B15 (y = 1), and adds a 15-bit
unsigned constant (ucst15) to it, writing the result to a register (dst). This instruction is
executed unconditionally, it cannot be predicated.
The offset, ucst15, is added to baseR. The result of the calculation is written into dst.
The addressing arithmetic is always performed in linear mode.
The s bit determines the unit used (D1 or D2) and the file the destination is written to:
s = 0 indicates the unit is D1 and dst is in the A register file; and s = 1 indicates the unit
is D2 and dst is in the B register file.
Pipeline
Pipeline Stage E1
Read B14/B15
Written dst
Unit in use .D
Delay Slots 0
Examples Example 1
ADDAB .D1 A4,A2,A4
(1)
Before instruction 1 cycle after instruction
Example 2
ADDAB .D1X B14,42h,A4
(1)
Before instruction 1 cycle after instruction
Example 3
ADDAB .D2 B14,7FFFh,B4
(1)
Before instruction 1 cycle after instruction
Opcode
31 29 28 27 23 22 18 17 13 12 7 6 5 4 3 2 1 0
creg z dst src2 src1 op 1 0 0 0 0 s p
3 1 5 5 5 6 1 1
Description src1 is added to src2 using the doubleword addressing mode specified for src2. The
addition defaults to linear mode. However, if src2 is one of A4-A7 or B4-B7, the mode
can be changed to circular mode by writing the appropriate value to the AMR (see
Section 2.8.3). src1 is left shifted by 3 due to doubleword data sizes. The result is placed
in dst.
Execution
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .D
Delay Slots 0
Opcode
31 29 28 27 23 22 18 17 13 12 7 6 5 4 3 2 1 0
creg z dst src2 src1 op 1 0 0 0 0 s p
3 1 5 5 5 6 1 1
Description src1 is added to src2 using the halfword addressing mode specified for src2. The
addition defaults to linear mode. However, if src2 is one of A4-A7 or B4-B7, the mode
can be changed to circular mode by writing the appropriate value to the AMR (see
Section 2.8.3). src1 is left shifted by 1. The result is placed in dst.
Execution
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .D
Opcode
31 30 29 28 27 23 22 8 7 6 5 4 3 2 1 0
0 0 0 1 dst ucst15 y 1 0 1 1 1 s p
5 15 1 1 1
Description This instruction reads a register (baseR), B14 (y = 0) or B15 (y = 1), and adds a scaled
15-bit unsigned constant (ucst15) to it, writing the result to a register (dst). This
instruction is executed unconditionally, it cannot be predicated.
The offset, ucst15, is scaled by a left-shift of 1 and added to baseR. The result of the
calculation is written into dst. The addressing arithmetic is always performed in linear
mode.
The s bit determines the unit used (D1 or D2) and the file the destination is written to:
s = 0 indicates the unit is D1 and dst is in the A register file; and s = 1 indicates the unit
is D2 and dst is in the B register file.
Pipeline
Pipeline Stage E1
Read B14/B15
Written dst
Unit in use .D
Delay Slots 0
Examples Example 1
ADDAH .D1 A4,A2,A4
(1)
Before instruction 1 cycle after instruction
Example 2
ADDAH .D1X B14,42h,A4
(1)
Before instruction 1 cycle after instruction
Example 3
ADDAH .D2 B14,7FFFh,B4
(1)
Before instruction 1 cycle after instruction
Opcode
31 29 28 27 23 22 18 17 13 12 7 6 5 4 3 2 1 0
creg z dst src2 src1 op 1 0 0 0 0 s p
3 1 5 5 5 6 1 1
Description src1 is added to src2 using the word addressing mode specified for src2. The addition
defaults to linear mode. However, if src2 is one of A4-A7 or B4-B7, the mode can be
changed to circular mode by writing the appropriate value to the AMR (see
Section 2.8.3). src1 is left shifted by 2. The result is placed in dst.
Execution
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .D
Opcode
31 30 29 28 27 23 22 8 7 6 5 4 3 2 1 0
0 0 0 1 dst ucst15 y 1 1 1 1 1 s p
5 15 1 1 1
Description This instruction reads a register (baseR), B14 (y = 0) or B15 (y = 1), and adds a scaled
15-bit unsigned constant (ucst15) to it, writing the result to a register (dst). This
instruction is executed unconditionally, it cannot be predicated.
The offset, ucst15, is scaled by a left-shift of 2 and added to baseR. The result of the
calculation is written into dst. The addressing arithmetic is always performed in linear
mode.
The s bit determines the unit used (D1 or D2) and the file the destination is written to:
s = 0 indicates the unit is D1 and dst is in the A register file; and s = 1 indicates the unit
is D2 and dst is in the B register file.
Pipeline
Pipeline Stage E1
Read B14/B15
Written dst
Unit in use .D
Delay Slots 0
Examples Example 1
ADDAW .D1 A4,2,A4
(1)
Before instruction 1 cycle after instruction
Example 2
ADDAW .D1X B14,42h,A4
(1)
Before instruction 1 cycle after instruction
Example 3
ADDAW .D2 B14,7FFFh,B4
(1)
Before instruction 1 cycle after instruction
Opcode
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
creg z dst src2 src1 x op 1 1 0 s p
3 1 5 5 5 1 7 1 1
NOTE:
1. This instruction takes the rounding mode from and sets the warning
bits in the floating-point adder configuration register (FADCR), not in
the floating-point auxiliary configuration register (FAUCR) as for
other .S unit instructions.
2. If rounding is performed, the INEX bit is set.
3. If one source is SNaN or QNaN, the result is NaN_out. If either
source is SNaN, the INVAL bit is also set.
4. If one source is +infinity and the other is −infinity, the result is
NaN_out and the INVAL bit is set.
5. If one source is signed infinity and the other source is anything
except NaN or signed infinity of the opposite sign, the result is
signed infinity and the INFO bit is set.
6. If overflow occurs, the INEX and OVER bits are set and the results
are rounded as follows (LFPN is the largest floating-point number):
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4 E5 E6 E7
Read src1_l, src1_h,
src2_l src2_h
Written dst_l dst_h
Unit in use .L or .S .L or .S
The low half of the result is written out one cycle earlier than the high half. If dst is used
as the source for the ADDDP, CMPEQDP, CMPLTDP, CMPGTDP, MPYDP, MPYSPDP,
MPYSP2DP, or SUBDP instruction, the number of delay slots can be reduced by one,
because these instructions read the lower word of the DP source one cycle before the
upper word of the DP source.
Delay Slots 6
B1:B0 4021 3333h 3333 3333h B1:B0 4021 3333h 4021 3333h 8.6
A3:A2 C004 0000h 0000 0000h A3:A2 C004 0000h 0000 0000h -2.5
A5:A4 xxxx xxxxh xxxx xxxxh A5:A4 4018 6666h 6666 6666h 6.1
Opcode
31 29 28 27 23 22 7 6 5 4 3 2 1 0
creg z dst cst16 1 0 1 0 0 s p
3 1 5 16 1 1
Description A 16-bit signed constant, cst16, is added to the dst register specified. The result is
placed in dst.
Execution
Pipeline
Pipeline Stage E1
Read cst16
Written dst
Unit in use .S
Delay Slots 0
Example ADDK .S1 15401,A1
Opcode
31 29 28 27 23 22 16 15 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src1 src2 0 0 0 0 1 0 1 1 0 0 0 s p
3 1 5 7 3 1 1
Description A 7-bit signed constant, src1, is shifted 2 bits to the left, then added to the address of the
first instruction of the fetch packet that contains the ADDKPC instruction (PCE1). The
result is placed in dst. The 3-bit unsigned constant, src2, specifies the number of NOP
cycles to insert after the current instruction. This instruction helps reduce the number of
instructions needed to set up the return address for a function call.
The following code:
B .S2 func
MVKL .S2 LABEL, B3
MVKH .S2 LABEL, B3
NOP 3
LABEL
The 7-bit value coded as src1 is the difference between LABEL and PCE1 shifted right
by 2 bits. The address of LABEL must be within 9 bits of PCE1.
Only one ADDKPC instruction can be executed per cycle. An ADDKPC instruction
cannot be paired with any relative branch instruction in the same execute packet. If an
ADDKPC and a relative branch are in the same execute packet, and if the ADDKPC
instruction is executed when the branch is taken, behavior is undefined.
The ADDKPC instruction cannot be paired with any other multicycle NOP instruction in
the same execute packet. Instructions that generate a multicycle NOP are: IDLE, BNOP,
and the multicycle NOP.
Execution
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
(1)
Before instruction 1 cycle after instruction
Opcode
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
creg z dst src2 src1 x op 1 1 0 s p
3 1 5 5 5 1 7 1 1
NOTE:
1. This instruction takes the rounding mode from and sets the warning
bits in the floating-point adder configuration register (FADCR), not in
the floating-point auxiliary configuration register (FAUCR) as for
other .S unit instructions.
2. If rounding is performed, the INEX bit is set.
3. If one source is SNaN or QNaN, the result is NaN_out. If either
source is SNaN, the INVAL bit is also set.
4. If one source is +infinity and the other is −infinity, the result is
NaN_out and the INVAL bit is set.
5. If one source is signed infinity and the other source is anything
except NaN or signed infinity of the opposite sign, the result is
signed infinity and the INFO bit is set.
6. If overflow occurs, the INEX and OVER bits are set and the results
are rounded as follows (LFPN is the largest floating-point number):
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .L or .S
Delay Slots 3
Opcode
31 30 29 28 27 24 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 dst 0 src2 src1 x 0 0 0 1 1 0 0 1 1 0 s p
4 5 5 1 1 1
Execution
Delay Slots 0
Examples Example 1
ADDSUB .L1 A0,A1,A3:A2
Example 2
ADDSUB .L2X B0,A1,B3:B2
Opcode
31 30 29 28 27 24 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 dst 0 src2 src1 x 0 0 0 1 1 0 1 1 1 0 s p
4 5 5 1 1 1
Description For the ADD2 operation, the upper and lower halves of the src2 operand are added to
the upper and lower halves of the src1 operand. The values in src1 and src2 are treated
as signed, packed 16-bit data and the results are written in signed, packed 16-bit format
into dst_o.
For the SUB2 operation, the upper and lower halves of the src2 operand are subtracted
from the upper and lower halves of the src1 operand. The values in src1 and src2 are
treated as signed, packed 16-bit data and the results are written in signed, packed 16-bit
format into dst_e.
Execution
Delay Slots 0
Examples Example 1
ADDSUB2 .L1 A0,A1,A3:A2
Example 2
ADDSUB2 .L2X B0,A1,B3:B2
Example 3
ADDSUB2 .L1 A0,A1,A3:A2
Example 4
ADDSUB2 .L1 A0,A1,A3:A2
Opcode
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
creg z dst src2 src1 x op 1 1 0 s p
3 1 5 5 5 1 7 1 1
Execution
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
ADDU .L1 A1,A2,A5:A4
A5:A4 xxxx xxxxh A5:A4 0000 0001h 0000 316Ch 4,294,979,948 (2)
(1)
Unsigned 32-bit integer
(2)
Unsigned 40-bit (long) integer
Example 2
ADDU .L1 A1,A3:A2,A5:A4
A5:A4 0000 0000h 0000 0000h 0 A5:A4 0000 0000h 0000 316Ch 12,652 (2)
(1)
Unsigned 32-bit integer
(2)
Unsigned 40-bit (long) integer
ADD2 Add Two 16-Bit Integers on Upper and Lower Register Halves
Opcode .S unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 0 0 0 1 1 0 0 0 s p
3 1 5 5 5 1 1 1
Opcode .L Unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 0 0 1 0 1 1 1 0 s p
3 1 5 5 5 1 1 1
Opcode .D unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 0 1 0 0 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description The upper and lower halves of the src1 operand are added to the upper and lower
halves of the src2 operand. The values in src1 and src2 are treated as signed, packed
16-bit data and the results are written in signed, packed 16-bit format into dst.
For each pair of signed packed 16-bit values found in the src1 and src2, the sum
between the 16-bit value from src1 and the 16-bit value from src2 is calculated to
produce a 16-bit result. The result is placed in the corresponding positions in the dst.
The carry from the lower half add does not affect the upper half add.
31 16 15 0
a_hi a_lo ← src1
+ +
ADD2
= =
31 16 15 0
a_hi + b_hi a_lo + b_lo ← dst
Execution
if (cond) {
msb16(src1) + msb16(src2) → msb16(dst);
lsb16(src1) + lsb16(src2) → lsb16(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S, .L, .D
Delay Slots 0
Examples Example 1
ADD2 .S1X A1,B1,A2
Example 2
ADD2 .L1 A0,A1,A2
ADD4 Add Without Saturation, Four 8-Bit Pairs for Four 8-Bit Results
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 0 0 1 0 1 1 1 0 s p
3 1 5 5 5 1 1 1
Description Performs 2s-complement addition between packed 8-bit quantities. The values in src1
and src2 are treated as packed 8-bit data and the results are written into dst in a packed
8-bit format.
For each pair of packed 8-bit values in src1 and src2, the sum between the 8-bit value
from src1 and the 8-bit value from src2 is calculated to produce an 8-bit result. No
saturation is performed. The carry from one 8-bit add does not affect the add of any
other 8-bit add. The result is placed in the corresponding positions in dst:
• The sum of src1 byte0 and src2 byte0 is placed in byte0 of dst.
• The sum of src1 byte1 and src2 byte1 is placed in byte1 of dst.
• The sum of src1 byte2 and src2 byte2 is placed in byte2 of dst.
• The sum of src1 byte3 and src2 byte3 is placed in byte3 of dst.
31 24 23 16 15 8 7 0
a_3 a_2 a_1 a_0 ← src1
+ + + +
ADD4
= = = =
31 24 23 16 15 8 7 0
a_3 + b_3 a_2 + b_2 a_1 + b_1 a_0 + b_0 ← dst
Execution
if (cond) {
byte0(src1) + byte0(src2) → byte0(dst);
byte1(src1) + byte1(src2) → byte1(dst);
byte2(src1) + byte2(src2) → byte2(dst);
byte3(src1) + byte3(src2) → byte3(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
ADD4 .L1 A0,A1,A2
Example 2
ADD4 .L1 A0,A1,A2
Opcode .L unit
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
creg z dst src2 src1 x op 1 1 0 s p
3 1 5 5 5 1 7 1 1
Opcode .S unit
31 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
creg z dst src2 src1 x op 1 0 0 0 s p
3 1 5 5 5 1 6 1 1
Opcode .D unit
31 29 28 27 23 22 18 17 13 12 11 10 9 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 op 1 1 0 0 s p
3 1 5 5 5 1 4 1 1
Description Performs a bitwise AND operation between src1 and src2. The result is placed in dst.
The scst5 operands are sign extended to 32 bits.
Execution
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L, .S, or .D
Delay Slots 0
Examples Example 1
AND .L1X A1,B1,A2
Example 2
AND .L1 15,A1,A3
Opcode .L unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 1 1 1 0 0 1 1 0 s p
3 1 5 5 5 1 1 1
Opcode .S unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 0 1 1 0 1 1 0 0 s p
3 1 5 5 5 1 1 1
Opcode .D unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 0 0 0 0 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Performs a bitwise logical AND operation between src1 and the bitwise logical inverse of
src2. The result is placed in dst.
Execution
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L, .S, or .D
Delay Slots 0
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 0 0 1 1 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Performs an averaging operation on packed 16-bit data. For each pair of signed 16-bit
values found in src1 and src2, AVG2 calculates the average of the two values and
returns a signed 16-bit quantity in the corresponding position in the dst.
The averaging operation is performed by adding 1 to the sum of the two 16-bit numbers
being averaged. The result is then right-shifted by 1 to produce a 16-bit result.
No overflow conditions exist.
31 16 15 0
sa_1 sa_0 ← src1
AVG2
↓ ↓
31 16 15 0
(sa_1 + sb_1 + 1) >> 1 (sa_0 + sb_0 + 1) >> 1 ← dst
Execution
if (cond) {
((lsb16(src1) + lsb16(src2) + 1) >> 1) → lsb16(dst);
((msb16(src1) + msb16(src2) + 1) >> 1) → msb16(dst)
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 0 0 1 0 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Performs an averaging operation on packed 8-bit data. The values in src1 and src2 are
treated as unsigned, packed 8-bit data and the results are written in unsigned, packed
8-bit format. For each unsigned, packed 8-bit value found in src1 and src2, AVGU4
calculates the average of the two values and returns an unsigned, 8-bit quantity in the
corresponding positions in the dst.
The averaging operation is performed by adding 1 to the sum of the two 8-bit numbers
being averaged. The result is then right-shifted by 1 to produce an 8-bit result.
No overflow conditions exist.
31 24 23 16 15 8 7 0
ua_3 ua_2 ua_1 ua_0 ← src1
AVGU4
↓ ↓ ↓ ↓
31 24 23 16 15 8 7 0
(ua_3 + ub_3 + 1) >> 1 (ua_2 + ub_2 + 1) >> 1 (ua_1 + ub_1 + 1) >> 1 (ua_0 + ub_0 + 1) >> 1 ← dst
Execution
if (cond) {
((ubyte0(src1) + ubyte0(src2) + 1) >> 1) → ubyte0(dst);
((ubyte1(src1) + ubyte1(src2) + 1) >> 1) → ubyte1(dst);
((ubyte2(src1) + ubyte2(src2) + 1) >> 1) → ubyte2(dst);
((ubyte3(src1) + ubyte3(src2) + 1) >> 1) → ubyte3(dst)
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
A0 1A 2E 5F 4Eh 26 46 95 78 A0 1A 2E 5F 4Eh
unsigned
Opcode
31 29 28 27 7 6 5 4 3 2 1 0
creg z cst21 0 0 1 0 0 s p
3 1 21 1 1
Description A 21-bit signed constant, cst21, is shifted left by 2 bits and is added to the address of the
first instruction of the fetch packet that contains the branch instruction. The result is
placed in the program fetch counter (PFC). The assembler/linker automatically computes
the correct value for cst21 by the following formula:
cst21 = (label - PCE1) >> 2
If two branches are in the same execute packet and both are taken, behavior is
undefined.
Two conditional branches can be in the same execute packet if one branch uses a
displacement and the other uses a register, IRP, or NRP. As long as only one branch
has a true condition, the code executes in a well-defined way.
NOTE:
1. PCE1 (program counter) represents the address of the first
instruction in the fetch packet in the E1 stage of the pipeline.
PFC is the program fetch counter.
2. The execute packets in the delay slots of a branch cannot be
interrupted. This is true regardless of whether the branch is taken.
3. See Section 3.5.2 for information on branching into the middle of an
execute packet.
4. A branch to an execute packet that spans two fetch packets will
cause a stall while the second fetch packet is fetched.
5. A relative branch instruction cannot be in the same execute packet
as an ADDKPC instruction.
Execution
Pipeline
Target Instruction
Pipeline Stage E1 PS PW PR DP DC E1
Read
Written
Branch taken ✓
Unit in use .S
Delay Slots 5
Example Table 3-19 gives the program counter values and actions for the following code example.
0000 0000 B .S1 LOOP
0000 0004 ADD .L1 A1, A2, A3
0000 0008 || ADD .L2 B1, B2, B3
0000 000C LOOP: MPY .M1X A3, B3, A4
0000 0010 || SUB .D1 A5, A6, A6
0000 0014 MPY .M1 A3, A6, A5
0000 0018 MPY .M1 A6, A7, A8
0000 001C SHR .S1 A4, 15, A4
0000 0020 ADD .D1 A4, A6, A4
Table 3-19. Program Counter Values for Branch Using a Displacement Example
Cycle Program Counter Value Action
Cycle 0 0000 0000h Branch command executes (target code fetched)
Cycle 1 0000 0004h
Cycle 2 0000 000Ch
Cycle 3 0000 0014h
Cycle 4 0000 0018h
Cycle 5 0000 001Ch
Cycle 6 0000 000Ch Branch target code executes
Cycle 7 0000 0014h
Opcode
31 29 28 27 26 25 24 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z 0 0 0 0 0 src2 0 0 0 0 0 x 0 0 1 1 0 1 1 0 0 0 1 p
3 1 5 1 1
NOTE:
1. This instruction executes on .S2 only. PFC is program fetch counter.
2. The execute packets in the delay slots of a branch cannot be
interrupted. This is true regardless of whether the branch is taken.
3. See Section 3.5.2 for information on branching into the middle of an
execute packet.
4. A branch to an execute packet that spans two fetch packets will
cause a stall while the second fetch packet is fetched.
Execution
Pipeline
Target Instruction
Pipeline Stage E1 PS PW PR DP DC E1
Read src2
Written
Branch taken ✓
Unit in use .S2
Delay Slots 5
Example Table 3-20 gives the program counter values and actions for the following code example.
In this example, the B10 register holds the value 1000 000Ch.
1000 0000 B .S2 B10
1000 0004 ADD .L1 A1, A2, A3
1000 0008 || ADD .L2 B1, B2, B3
1000 000C MPY .M1X A3, B3, A4
1000 0010 || SUB .D1 A5, A6, A6
1000 0014 MPY .M1 A3, A6, A5
1000 0018 MPY .M1 A6, A7, A8
1000 001C SHR .S1 A4, 15, A4
1000 0020 ADD .D1 A4, A6, A4
Table 3-20. Program Counter Values for Branch Using a Register Example
Cycle Program Counter Value Action
Cycle 0 1000 0000h Branch command executes (target code fetched)
Cycle 1 1000 0004h
Cycle 2 1000 000Ch
Cycle 3 1000 0014h
Cycle 4 1000 0018h
Cycle 5 1000 001Ch
Cycle 6 1000 000Ch Branch target code executes
Cycle 7 1000 0014h
Opcode
31 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 p
3 1 1
Description IRP is placed in the program fetch counter (PFC). This instruction also moves the PGIE
bit value to the GIE bit. The PGIE bit is unchanged.
If two branches are in the same execute packet and are both taken, behavior is
undefined.
Two conditional branches can be in the same execute packet if one branch uses a
displacement and the other uses a register, IRP, or NRP. As long as only one branch
has a true condition, the code executes in a well-defined way.
NOTE:
1. This instruction executes on .S2 only. PFC is the program fetch
counter.
2. Refer to Chapter 5 for more information on IRP, PGIE, and GIE.
3. The execute packets in the delay slots of a branch cannot be
interrupted. This is true regardless of whether the branch is taken.
4. See Section 3.5.2 for information on branching into the middle of an
execute packet.
5. A branch to an execute packet that spans two fetch packets will
cause a stall while the second fetch packet is fetched.
Execution
Pipeline
Target Instruction
Pipeline Stage E1 PS PW PR DP DC E1
Read IRP
Written
Branch taken ✓
Unit in use .S2
Delay Slots 5
Example Table 3-21 gives the program counter values and actions for the following code example.
Given that an interrupt occurred at
PC = 0000 1000 IRP = 0000 1000
0000 0020 B .S2 IRP
0000 0024 ADD .S1 A0, A2, A1
0000 0028 MPY .M1 A1, A0, A1
0000 002C NOP
0000 0030 SHR .S1 A1, 15, A1
0000 0034 ADD .L1 A1, A2, A1
0000 0038 ADD .L2 B1, B2, B3
Opcode
31 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 p
3 1 1
Description NRP is placed in the program fetch counter (PFC). This instruction also sets the NMIE
bit. The PGIE bit is unchanged.
If two branches are in the same execute packet and are both taken, behavior is
undefined.
Two conditional branches can be in the same execute packet if one branch uses a
displacement and the other uses a register, IRP, or NRP. As long as only one branch
has a true condition, the code executes in a well-defined way.
NOTE:
1. This instruction executes on .S2 only. PFC is program fetch counter.
2. Refer to Chapter 5 for more information on NRP and NMIE.
3. The execute packets in the delay slots of a branch cannot be
interrupted. This is true regardless of whether the branch is taken.
4. See Section 3.5.2 for information on branching into the middle of an
execute packet.
5. A branch to an execute packet that spans two fetch packets will
cause a stall while the second fetch packet is fetched.
Execution
Pipeline
Target Instruction
Pipeline Stage E1 PS PW PR DP DC E1
Read NRP
Written
Branch taken ✓
Unit in use .S2
Delay Slots 5
Example Table 3-22 gives the program counter values and actions for the following code example.
Given that an interrupt occurred at
PC = 0000 1000 IRP = 0000 1000
0000 0020 B .S2 NRP
0000 0024 ADD .S1 A0, A2, A1
0000 0028 MPY .M1 A1, A0, A1
0000 002C NOP
0000 0030 SHR .S1 A1, 15, A1
0000 0034 ADD .L1 A1, A2, A1
0000 0038 ADD .L2 B1, B2, B3
Opcode
31 29 28 27 23 22 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src 1 0 0 0 0 0 0 1 0 0 0 s p
3 1 5 10 1 1
Description If the predication and decrement register (dst) is positive (greater than or equal to 0), the
BDEC instruction performs a relative branch and decrements dst by 1. The instruction
performs the relative branch using a 10-bit signed constant, scst10, in src. The constant
is shifted 2 bits to the left, then added to the address of the first instruction of the fetch
packet that contains the BDEC instruction (PCE1). The result is placed in the program
fetch counter (PFC).
This instruction helps reduce the number of instructions needed to decrement a register
and conditionally branch based upon the value of the register. Note also that any register
can be used that can free the predicate registers (A0-A2 and B0-B2) for other uses.
The following code:
CMPLT .L1 A10,0,A1
[!A1] SUB .L1 A10,1,A10
||[!A1] B .S1 func
NOP 5
NOTE:
1. Only one BDEC instruction can be executed per cycle. The BDEC
instruction can be predicated by using any conventional condition
register. The conditions are effectively ANDed together. If two
branches are in the same execute packet, and if both are taken,
behavior is undefined.
2. See Section 3.5.2 for information on branching into the middle of an
execute packet.
3. A branch to an execute packet that spans two fetch packets will
cause a stall while the second fetch packet is fetched.
4. The BDEC instruction cannot be in the same execute packet as an
ADDKPC instruction.
Execution
if (cond) {
if (dst >= 0), PFC = ((PCE1 + se(scst10)) << 2);
if (dst >= 0), dst = dst - 1;
else nop
}
else nop
Pipeline
Target Instruction
Pipeline Stage E1 PS PW PR DP DC E1
Read dst
Written dst, PC
Branch taken ✓
Unit in use .S
Delay Slots 5
Examples Example 1
BDEC .S1 100h,A10
Example 2
BDEC .S1 300h,A10 ; 300h is sign extended
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 1 1 1 1 0 x 0 0 0 0 1 1 1 1 0 0 s p
3 1 5 5 1 1 1
Description Performs a bit-count operation on 8-bit quantities. The value in src2 is treated as packed
8-bit data, and the result is written in packed 8-bit format. For each of the 8-bit quantities
in src2, the count of the number of 1 bits in that value is written to the corresponding
position in dst.
31 24 23 16 15 8 7 0
ub_3 ub_2 ub_1 ub_0 ← src2
BITC4
↓ ↓ ↓ ↓
31 24 23 16 15 8 7 0
bit_count(ub_3) bit_count(ub_2) bit_count(ub_1) bit_count(ub_0) ← dst
Execution
if (cond) {
bit_count(src2(ubyte0)) → ubyte0(dst);
bit_count(src2(ubyte1)) → ubyte1(dst);
bit_count(src2(ubyte2)) → ubyte2(dst);
bit_count(src2(ubyte3)) → ubyte3(dst)
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src2
Written dst
Unit in use .M
Delay Slots 1
Example BITC4 .M1 A1,A2
A1 9E 52 6E 30h A1 9E 52 6E 30h
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 1 1 1 1 1 x 0 0 0 0 1 1 1 1 0 0 s p
3 1 5 5 1 1 1
Description Implements a bit-reversal function that reverses the order of bits in a 32-bit word. This
means that bit 0 of the source becomes bit 31 of the result, bit 1 of the source becomes
bit 30 of the result, bit 2 becomes bit 29, and so on.
31 0
abcd efgh ijkl mnop qrst uvwx yzAB CDEF ← src2
BITR
31 0
FEDC BAzy xwvu tsrq ponm lkji hgfe dcba ← dst
Execution
Pipeline
Pipeline Stage E1 E2
Read src2
Written dst
Unit in use .M
Delay Slots 1
Opcode
31 29 28 27 16 15 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z src2 src1 0 0 0 0 1 0 0 1 0 0 0 s p
3 1 12 3 1 1
Description The constant displacement form of the BNOP instruction performs a relative branch with
NOP instructions. The instruction performs the relative branch using the 12-bit signed
constant specified by src2. The constant is shifted 2 bits to the left, then added to the
address of the first instruction of the fetch packet that contains the BNOP instruction
(PCE1). The result is placed in the program fetch counter (PFC).
The 3-bit unsigned constant specified in src1 gives the number of delay slot NOP
instructions to be inserted, from 0 to 7. With src1 = 0, no NOP cycles are inserted.
This instruction helps reduce the number of instructions to perform a branch when NOP
instructions are required to fill the delay slots of a branch.
The following code:
B .S1 LABEL
NOP N
LABEL: ADD
NOTE:
1. BNOP instructions may be predicated. The predication condition
controls whether or not the branch is taken, but does not affect the
insertion of NOPs. BNOP always inserts the number of NOPs
specified by N, regardless of the predication condition.
2. The execute packets in the delay slots of a branch cannot be
interrupted. This is true regardless of whether the branch is taken.
3. See Section 3.5.2 for information on branching into the middle of an
execute packet.
4. A branch to an execute packet that spans two fetch packets will
cause a stall while the second fetch packet is fetched.
Only one branch instruction can be executed per cycle. If two branches are in the same
execute packet, and if both are taken, the behavior is undefined. It should also be noted
that when a predicated BNOP instruction is used with a NOP count greater than 5, the
CPU inserts the full delay slots requested when the predicated condition is false.
For example, the following set of instructions will insert 7 cycles of NOPs:
ZERO .L1 A0
[A0] BNOP .S1 LABEL,7 ; branch is not taken and
; 7 cycles of NOPs are inserted
Conversely, when a predicated BNOP instruction is used with a NOP count greater than
5 and the predication condition is true, the branch will be taken and the multi-cycle NOP
is terminated when the branch is taken.
For example in the following set of instructions, only 5 cycles of NOP are inserted:
MVK .D1 1,A0
[A0] BNOP .S1 LABEL,7 ; branch is taken and
; 5 cycles of NOPs are inserted
The BNOP instruction cannot be paired with any other multicycle NOP instruction in the
same execute packet. Instructions that generate a multicycle NOP are: IDLE, ADDKPC,
CALLP, and the multicycle NOP.
The BNOP instruction does not require the use of the .S unit. If no unit is specified, then
it may be scheduled in parallel with instructions executing on both the .S1 and .S2 units.
If either the .S1 or .S2 unit is specified for BNOP, then the .S unit specified is not
available for another instruction in the same execute packet. This is enforced by the
assembler.
It is possible to branch into the middle of a 32-bit instruction. The only case that will be
detected and result in an exception is when the 32-bit instruction is contained in a
compact header-based fetch packet. The header cannot be the target of a branch
instruction. In the event that the header is the target of a branch, an exception will be
raised.
if (cond) {
PFC = (PCE1 + (se(scst12) << 1));
nop (src1)
}
else nop (src1 + 1)
if (cond) {
PFC = (PCE1 + (se(scst12) << 2));
nop (src1)
}
else nop (src1 + 1)
Pipeline
Target Instruction
Pipeline Stage E1 PS PW PR DP DC E1
Read src2
Written PC
Branch taken ✓
Unit in use .S
Delay Slots 5
Opcode
31 29 28 27 26 25 24 23 22 18 17 16 15 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z 0 0 0 0 1 src2 0 0 src1 x 0 0 1 1 0 1 1 0 0 0 1 p
3 1 5 3 1 1
Description The register form of the BNOP instruction performs an absolute branch with NOP
instructions. The register specified in src2 is placed in the program fetch counter (PFC).
For branch targets residing in compact header-based fetch packets, the 31
most-significant bits of the register are used to determine the branch target. For branch
targets not residing in compact header-based fetch packets, the 30 most-significant bits
of the register are used to determine the branch target.
The 3-bit unsigned constant specified in src1 gives the number of delay slots NOP
instructions to be inserted, from 0 to 7. With src1 = 0, no NOP cycles are inserted.
This instruction helps reduce the number of instructions to perform a branch when NOP
instructions are required to fill the delay slots of a branch.
The following code:
B .S2 B3
NOP N
NOTE:
1. BNOP instructions may be predicated. The predication condition
controls whether or not the branch is taken, but does not affect the
insertion of NOPs. BNOP always inserts the number of NOPs
specified by N, regardless of the predication condition.
2. The execute packets in the delay slots of a branch cannot be
interrupted. This is true regardless of whether the branch is taken.
3. See Section 3.5.2 for information on branching into the middle of an
execute packet.
4. A branch to an execute packet that spans two fetch packets will
cause a stall while the second fetch packet is fetched.
Only one branch instruction can be executed per cycle. If two branches are in the same
execute packet, and if both are taken, the behavior is undefined. It should also be noted
that when a predicated BNOP instruction is used with a NOP count greater than 5, the
CPU inserts the full delay slots requested when the predicated condition is false.
For example, the following set of instructions will insert 7 cycles of NOPs:
ZERO .L1 A0
[A0] BNOP .S2 B3,7 ; branch is not taken and 7 cycles of NOPs are inserted
Conversely, when a predicated BNOP instruction is used with a NOP count greater than
5 and the predication condition is true, the branch will be taken and multi-cycle NOP is
terminated when the branch is taken.
For example, in the following set of instructions only 5 cycles of NOP are inserted:
MVK .D1 1,A0
[A0] BNOP .S2 B3,7 ; branch is taken and 5 cycles of NOPs are inserted
The BNOP instruction cannot be paired with any other multicycle NOP instruction in the
same execute packet. Instructions that generate a multicycle NOP are: IDLE, ADDKPC,
CALLP, and the multicycle NOP.
Execution
if (cond) {
src2 → PFC;
nop (src1)
}
else nop (src1 + 1)
Pipeline
Target Instruction
Pipeline Stage E1 PS PW PR DP DC E1
Read src2
Written PC
Branch taken ✓
Unit in use .S2
Delay Slots 5
Opcode
31 29 28 27 23 22 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src 0 0 0 0 0 0 0 1 0 0 0 s p
3 1 5 10 1 1
Description If the predication register (dst) is positive (greater than or equal to 0), the BPOS
instruction performs a relative branch. If dst is negative, the BPOS instruction takes no
other action.
The instruction performs the relative branch using a 10-bit signed constant, scst10, in
src. The constant is shifted 2 bits to the left, then added to the address of the first
instruction of the fetch packet that contains the BPOS instruction (PCE1). The result is
placed in the program fetch counter (PFC).
Any register can be used that can free the predicate registers (A0-A2 and B0-B2) for
other uses.
NOTE:
1. Only one BPOS instruction can be executed per cycle. The BPOS
instruction can be predicated by using any conventional condition
register. The conditions are effectively ANDed together. If two
branches are in the same execute packet, and if both are taken,
behavior is undefined.
2. The execute packets in the delay slots of a branch cannot be
interrupted. This is true regardless of whether the branch is taken.
3. See Section 3.5.2 for information on branching into the middle of an
execute packet.
4. A branch to an execute packet that spans two fetch packets will
cause a stall while the second fetch packet is fetched.
5. The BPOS instruction cannot be in the same execute packet as an
ADDKPC instruction.
Execution
if (cond) {
if (dst >= 0), PFC = (PCE1 + (se(scst10) << 2));
else nop
}
else nop
Pipeline
Target Instruction
Pipeline Stage E1 PS PW PR DP DC E1
Read dst
Written PC
Branch taken ✓
Unit in use .S
Delay Slots 5
Example BPOS .S1 200h,A10
Opcode
31 30 29 28 27 7 6 5 4 3 2 1 0
0 0 0 1 cst21 0 0 1 0 0 s p
21 1 1
Description A 21-bit signed constant, cst21, is shifted left by 2 bits and is added to the address of the
first instruction of the fetch packet that contains the branch instruction. The result is
placed in the program fetch counter (PFC). The assembler/linker automatically computes
the correct value for cst21 by the following formula:
cst21 = (label - PCE1) >> 2
The address of the execute packet immediately following the execute packet containing
the CALLP instruction is placed in A3, if the S1 unit is used; or in B3, if the S2 unit is
used. This write occurs in E1. An implied NOP 5 is inserted into the instruction pipeline
occupying E2-E6.
Since this branch is taken unconditionally, it cannot be placed in the same execute
packet as another branch. Additionally, no other branches should be pending when the
CALLP instruction is executed.
CALLP, like other relative branch instructions, cannot have an ADDKPC instruction in
the same execute packet with it.
NOTE:
1. PCE1 (program counter) represents the address of the first
instruction in the fetch packet in the E1 stage of the pipeline. PFC is
the program fetch counter. retPC represents the address of the first
instruction of the execute packet in the DC stage of the pipeline.
2. The execute packets in the delay slots of a branch cannot be
interrupted. This is true regardless of whether the branch is taken.
Execution
Pipeline
Target Instruction
Pipeline Stage E1 PS PW PR DP DC E1
Read
Written A3/B3
Branch taken ✓
Unit in use .S
Delay Slots 5
31 29 28 27 23 22 18 17 13 12 8 7 6 5 4 3 2 1 0
creg z dst src2 csta cstb 1 1 0 0 1 0 s p
3 1 5 5 5 5 1 1
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 1 1 1 1 1 0 0 0 s p
3 1 5 5 5 1 1 1
Description For cstb ≥ csta, the field in src2 as specified by csta to cstb is cleared to all 0s in dst.
The csta and cstb operands may be specified as constants or in the 10 LSBs of the src1
register, with cstb being bits 0−4 (src1 4..0) and csta being bits 5−9 (src1 9..5). csta is the
LSB of the field and cstb is the MSB of the field. In other words, csta and cstb represent
the beginning and ending bits, respectively, of the field to be cleared to all 0s in dst. The
LSB location of src2 is bit 0 and the MSB location of src2 is bit 31.
In the following example, csta is 15 and cstb is 23. For the register version of the
instruction, only the 10 LSBs of the src1 register are valid. If any of the 22 MSBs are
non-zero, the result is invalid.
cstb
csta
src2 X X X X X X X X 1 0 1 0 0 1 1 0 1 X X X X X X X X X X X X X X X
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
dst X X X X X X X X 0 0 0 0 0 0 0 0 0 X X X X X X X X X X X X X X X
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
For cstb < csta, the src2 register is copied to dst. The csta and cstb operands may be
specified as constants or in the 10 LSBs of the src1 register, with cstb being bits 0−4
(src1 4..0) and csta being bits 5−9 (src1 9..5).
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
CLR .S1 A1,4,19,A2
Example 2
CLR .S2 B1,B3,B2
Opcode
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
creg z dst src2 src1 x op 1 1 0 s p
3 1 5 5 5 1 7 1 1
Description Compares src1 to src2. If src1 equals src2, then 1 is written to dst; otherwise, 0 is written
to dst.
Execution
if (cond) {
if (src1 == src2), 1 → dst
else 0 → dst
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
CMPEQ .L1X A1,B1,A2
Example 2
CMPEQ .L1 Ch,A1,A2
Example 3
CMPEQ .L2X A1,B3:B2,B1
B3:B2 0000 00FFh F23A 3789h B3:B2 0000 00FFh F23A 3789h
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 1 1 0 1 1 0 0 0 s p
3 1 5 5 5 1 1 1
Description Performs equality comparisons on packed 16-bit data. Each 16-bit value in src1 is
compared against the corresponding 16-bit value in src2, returning either a 1 if equal or
a 0 if not equal. The equality results are packed into the two least-significant bits of dst.
The result for the lower pair of values is placed in bit 0, and the results for the upper pair
of values are placed in bit 1. The remaining bits of dst are cleared to 0.
31 16 15 0
a_hi a_lo ← src1
CMPEQ2
↓↑ ↓↑
31 16 15 0
b_hi b_lo ← src2
a_lo = = b_lo
a_hi = = b_hi
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 = = dst
31 2 1 0
Execution
if (cond) {
if (lsb16(src1) == lsb16(src2)), 1 → dst 0
else 0 → dst 0;
if (msb16(src1) == msb16(src2)), 1 → dst 1
else 0 → dst 1
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
CMPEQ2 .S1 A3,A4,A5
Example 2
CMPEQ2 .S2 B2,B8,B15
Example 3
CMPEQ2 .S2 B2,B8,B15
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 1 1 0 0 1 0 0 0 s p
3 1 5 5 5 1 1 1
Description Performs equality comparisons on packed 8-bit data. Each 8-bit value in src1 is
compared against the corresponding 8-bit value in src2, returning either a 1 if equal or a
0 if not equal. The equality comparison results are packed into the four least-significant
bits of dst.
The 8-bit values in each input are numbered from 0 to 3, starting with the
least-significant byte, then working towards the most-significant byte. The comparison
results for byte 0 are written to bit 0 of the result. Likewise the results for byte 1 to 3 are
written to bits 1 to 3 of the result, respectively, as shown in the diagram below. The
remaining bits of dst are cleared to 0.
31 24 23 16 15 8 7 0
sa_3 sa_2 sa_1 sa_0 ← src1
CMPEQ4
↓↑ ↓↑ ↓↑ ↓↑
31 24 23 16 15 8 7 0
sb_3 sb_2 sb_1 sb_0 ← src2
sa_0 = = sb_0
sa_1 = = sb_1
sa_2 = = sb_2
sa_3 = = sb_3
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 = = = = dst
31 4 3 2 1 0
Execution
if (cond) {
if (sbyte0(src1) == sbyte0(src2)), 1 → dst 0
else 0 → dst 0;
if (sbyte1(src1) == sbyte1(src2)), 1 → dst 1
else 0 → dst 1;
if (sbyte2(src1) == sbyte2(src2)), 1 → dst 2
else 0 → dst 2;
if (sbyte3(src1) == sbyte3(src2)), 1 → dst 3
else 0 → dst 3
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
CMPEQ4 .S1 A3,A4,A5
A3 02 3A 4E 1Ch A3 02 3A 4E 1Ch
A4 02 B8 4E 76h A4 02 B8 4E 76h
Example 2
CMPEQ4 .S2 B2,B8,B13
B2 F2 3A 37 89h B2 F2 3A 37 89h
B8 04 B8 37 89h B8 04 B8 37 89h
B13 xxxx xxxxh B13 0000 0003h false, false, true, true
Example 3
CMPEQ4 .S2 B2,B8,B13
B2 01 B6 24 51h B2 01 B6 24 51h
B8 05 B6 24 51h B8 05 B6 24 51h
B13 xxxx xxxxh B13 0000 0007h false, true, true, true
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 1 0 0 0 1 0 0 0 s p
3 1 5 5 5 1 1 1
Description Compares src1 to src2. If src1 equals src2, then 1 is written to dst; otherwise, 0 is written
to dst.
Special cases of inputs:
NOTE:
1. In the case of NaN compared with itself, the result is false.
2. No configuration bits other than those in the preceding table are set,
except the NaNn and DENn bits when appropriate.
Execution
if (cond) {
if (src1 == src2), 1 → dst
else 0 → dst
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src1_l, src2_l src1_h, src2_h
Written dst
Unit in use .S .S
Delay Slots 1
A1:A0 4021 3333h 3333 3333h A1:A0 4021 3333h 3333 3333h 8.6
A3:A2 C004 0000h 0000 0000h A3:A2 C004 0000h 0000 0000h -2.5
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 1 0 0 0 1 0 0 0 s p
3 1 5 5 5 1 1 1
Description Compares src1 to src2. If src1 equals src2, then 1 is written to dst; otherwise, 0 is written
to dst.
Special cases of inputs:
NOTE:
1. In the case of NaN compared with itself, the result is false.
2. No configuration bits other than those in the preceding table are set,
except the NaNn and DENn bits when appropriate.
Execution
if (cond) {
if (src1 == src2), 1 → dst
else 0 → dst
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Opcode
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
creg z dst src2 src1 x op 1 1 0 s p
3 1 5 5 5 1 7 1 1
Description Performs a signed comparison of src1 to src2. If src1 is greater than src2, then a 1 is
written to dst; otherwise, a 0 is written to dst.
NOTE: The CMPGT instruction allows using a 5-bit constant as src1. If src2 is a
5-bit constant, as in
CMPGT .L1 A4, 5, A0
These two instructions are equivalent, with the second instruction using
the conventional operand types for src1 and src2.
Similarly, the CMPGT instruction allows a cross path operand to be used
as src2. If src1 is a cross path operand as in
CMPGT .L1x B4, A5, A0
In both of these operations the listing file (.lst) will have the first
implementation, and the second implementation will appear in the
debugger.
Execution
if (cond) {
if (src1 > src2), 1 → dst
else 0 → dst
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
CMPGT .L1X A1,B1,A2
Example 2
CMPGT .L1X A1,B1,A2
Example 3
CMPGT .L1 8,A1,A2
Example 4
CMPGT .L1X A1,B1,A2
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 0 1 0 0 1 0 0 0 s p
3 1 5 5 5 1 1 1
Description Performs comparisons for greater than values on signed, packed 16-bit data. Each
signed 16-bit value in src1 is compared against the corresponding signed 16-bit value in
src2, returning a 1 if src1 is greater than src2 or returning a 0 if it is not greater. The
comparison results are packed into the two least-significant bits of dst. The result for the
lower pair of values is placed in bit 0, and the results for the upper pair of values are
placed in bit 1. The remaining bits of dst are cleared to 0.
31 16 15 0
a_hi a_lo ← src1
CMPGT2
↓↑ ↓↑
31 16 15 0
b_hi b_lo ← src2
Execution
if (cond) {
if (lsb16(src1) > lsb16(src2)), 1 → dst 0
else 0 → dst 0;
if (msb16(src1) > msb16(src2)), 1 → dst 1
else 0 → dst 1
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
CMPGT2 .S1 A3,A4,A5
Example 2
CMPGT2 .S2 B2,B8,B15
Example 3
CMPGT2 .S2 B2, B8, B15
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 1 0 0 1 1 0 0 0 s p
3 1 5 5 5 1 1 1
Description Compares src1 to src2. If src1 is greater than src2, then 1 is written to dst; otherwise, 0
is written to dst.
Special cases of inputs:
NOTE: No configuration bits other than those in the preceding table are set,
except the NaNn and DENn bits when appropriate.
Execution
if (cond) {
if (src1 > src2), 1 → dst
else 0 → dst
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src1_l, src2_l src1_h, src2_h
Written dst
Unit in use .S .S
Delay Slots 1
A1:A0 4021 3333h 3333 3333h 8.6 A1:A0 4021 3333h 3333 3333h
A3:A2 C004 0000h 0000 0000h -2.5 A3:A2 C004 0000h 0000 0000h
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 1 0 0 1 1 0 0 0 s p
3 1 5 5 5 1 1 1
Description Compares src1 to src2. If src1 is greater than src2, then 1 is written to dst; otherwise, 0
is written to dst.
Special cases of inputs:
NOTE: No configuration bits other than those in the preceding table are set,
except the NaNn and DENn bits when appropriate.
Execution
if (cond) {
if (src1 > src2), 1 → dst
else 0 → dst
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Opcode
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
creg z dst src2 src1 x op 1 1 0 s p
3 1 5 5 5 1 7 1 1
Description Performs an unsigned comparison of src1 to src2. If src1 is greater than src2, then a 1 is
written to dst; otherwise, a 0 is written to dst.
Execution
if (cond) {
if (src1 > src2), 1 → dst
else 0 → dst
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
CMPGTU .L1 A1,A2,A3
Example 2
CMPGTU .L1 0Ah,A1,A2
Example 3
CMPGTU .L1 0Eh,A3:A2,A4
A3:A2 0000 0000h 0000 000Ah 10 (1) A3:A2 0000 0000h 0000 000Ah
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 0 1 0 1 1 0 0 0 s p
3 1 5 5 5 1 1 1
Description Performs comparisons for greater than values on packed 8-bit data. Each unsigned 8-bit
value in src1 is compared against the corresponding unsigned 8-bit value in src2,
returning a 1 if the byte in src1 is greater than the corresponding byte in src2 or a 0 if is
not greater. The comparison results are packed into the four least-significant bits of dst.
The 8-bit values in each input are numbered from 0 to 3, starting with the
least-significant byte, then working towards the most-significant byte. The comparison
results for byte 0 are written to bit 0 of the result. Likewise, the results for byte 1 to 3 are
written to bits 1 to 3 of the result, respectively, as shown in the diagram below. The
remaining bits of dst are cleared to 0.
31 24 23 16 15 8 7 0
ua_3 ua_2 ua_1 ua_0 ← src1
CMPGTU4
↓↑ ↓↑ ↓↑ ↓↑
31 24 23 16 15 8 7 0
ub_3 ub_2 ub_1 ub_0 ← src2
Execution
if (cond) {
if (ubyte0(src1) > ubyte0(src2)), 1 → dst 0
else 0 → dst 0;
if (ubyte1(src1) > ubyte1(src2)), 1 → dst 1
else 0 → dst 1;
if (ubyte2(src1) > ubyte2(src2)), 1 → dst 2
else 0 → dst 2;
if (ubyte3(src1) > ubyte3(src2)), 1 → dst 3
else 0 → dst 3
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
CMPGTU4 .S1 A3,A4,A5
Example 2
CMPGTU4 .S2 B2,B8,B13
B13 xxxx xxxxh B13 0000 000Eh true, true, true, false
Example 3
CMPGTU4 .S2 B2,B8,B13
B13 xxxx xxxxh B13 0000 0002h false, false, true, false
Opcode
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
creg z dst src2 src1 x op 1 1 0 s p
3 1 5 5 5 1 7 1 1
Description Performs a signed comparison of src1 to src2. If src1 is less than src2, then 1 is written
to dst; otherwise, 0 is written to dst.
NOTE: The CMPLT instruction allows using a 5-bit constant as src1. If src2 is a
5-bit constant, as in
CMPLT .L1 A4, 5, A0
These two instructions are equivalent, with the second instruction using
the conventional operand types for src1 and src2.
Similarly, the CMPLT instruction allows a cross path operand to be used
as src2. If src1 is a cross path operand as in
CMPLT .L1x B4, A5, A0
In both of these operations the listing file (.lst) will have the first
implementation, and the second implementation will appear in the
debugger.
Execution
if (cond) {
if (src1 < src2), 1 → dst
else 0 → dst
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
CMPLT .L1 A1,A2,A3
Example 2
CMPLT .L1 A1,A2,A3
Example 3
CMPLT .L1 9,A1,A2
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 0 1 0 0 1 0 0 0 s p
3 1 5 5 5 1 1 1
Execution
if (cond) {
if (lsb16(src2) < lsb16(src1)), 1 → dst 0
else 0 → dst 0;
if (msb16(src2) < msb16(src1)), 1 → dst 1
else 0 → dst 1
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
CMPLT2 .S1 A4,A3,A5; assembler treats as CMPGT2 A3,A4,A5
Example 2
CMPLT2 .S2 B8,B2,B15; assembler treats as CMPGT2 B2,B8,B15
Example 3
CMPLT2 .S2 B8,B2,B12; assembler treats as CMPGT2 B2,B8,B15
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 1 0 1 0 1 0 0 0 s p
3 1 5 5 5 1 1 1
Description Compares src1 to src2. If src1 is less than src2, then 1 is written to dst; otherwise, 0 is
written to dst.
Special cases of inputs:
NOTE: No configuration bits other than those in the preceding table are set,
except the NaNn and DENn bits when appropriate.
Execution
if (cond) {
if (src1 < src2), 1 → dst
else 0 → dst
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src1_l, src2_l src1_h, src2_h
Written dst
Unit in use .S .S
Delay Slots 1
A1:A0 4021 3333h 3333 3333h 8.6 A1:A0 4021 3333h 3333 3333h
B3:B2 C004 0000h 0000 0000h -2.5 B3:B2 C004 0000h 0000 0000h
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 1 0 1 0 1 0 0 0 s p
3 1 5 5 5 1 1 1
Description Compares src1 to src2. If src1 is less than src2, then 1 is written to dst; otherwise, 0 is
written to dst.
Special cases of inputs:
NOTE: No configuration bits other than those in the preceding table are set,
except the NaNn and DENn bits when appropriate.
Execution
if (cond) {
if (src1 < src2), 1 → dst
else 0 → dst
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Opcode
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
creg z dst src2 src1 x op 1 1 0 s p
3 1 5 5 5 1 7 1 1
Description Performs an unsigned comparison of src1 to src2. If src1 is less than src2, then 1 is
written to dst; otherwise, 0 is written to dst.
Execution
if (cond) {
if (src1 < src2), 1 → dst
else 0 → dst
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
CMPLTU .L1 A1,A2,A3
Example 2
CMPLTU .L1 14,A1,A2
Example 3
CMPLTU .L1 A1,A5:A4,A2
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 0 1 0 1 1 0 0 0 s p
3 1 5 5 5 1 1 1
Execution
if (cond) {
if (ubyte0(src2) < ubyte0(src1)), 1 → dst 0
else 0 → dst 0;
if (ubyte1(src2) < ubyte1(src1)), 1 → dst 1
else 0 → dst 1;
if (ubyte2(src2) < ubyte2(src2)), 1 → dst 2
else 0 → dst 2;
if (ubyte3(src2) < ubyte3(src1)), 1 → dst 3
else 0 → dst 3
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
CMPLTU4 .S1 A4,A3,A5; assembler treats as CMPGTU4 A3,A4,A5
Example 2
CMPLTU4 .S2 B8,B2,B13; assembler treats as CMPGTU4 B2,B8,B13
Example 3
CMPLTU4 .S2 B8,B2,B13; assembler treats as CMPGTU4 B2,B8,B13
Opcode
31 30 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 dst src2 src1 x 0 0 1 0 1 0 1 1 0 0 s p
5 5 5 1 1 1
Description Returns two dot-products between two pairs of signed, packed 16-bit values. The values
in src1 and src2 are treated as signed, packed 16-bit quantities. The signed results are
written to a 64-bit register pair.
The product of the lower halfwords of src1 and src2 is subtracted from the product of the
upper halfwords of src1 and src2. The result is written to dst_o.
The product of the upper halfword of src1 and the lower halfword of src2 is added to the
product of the lower halfword of src1 and the upper halfword of src2. The result is written
to dst_e.
If the result saturates, the M1 or M2 bit in SSR and the SAT bit in CSR are written one
cycle after the result is written to dst_e.
This instruction executes unconditionally.
NOTE: In the overflow case, where all four halfwords in src1 and src2 are
8000h, the saturation value 7FFF FFFFh is written into the 32-bit dst_e
register.
Execution
Delay Slots 3
Examples Example 1
CMPY .M1 A0,A1,A3:A2
(1)
Before instruction 4 cycles after instruction
Example 2
CMPY .M2X B0,A1,B3:B2
(1)
Before instruction 4 cycles after instruction
Example 3
CMPY .M1 A0,A1,A3:A2
(1)
Before instruction 4 cycles after instruction
CMPYR Complex Multiply Two Pairs, Signed, Packed 16-Bit With Rounding
Opcode
31 30 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 dst src2 src1 x 0 0 1 0 1 1 1 1 0 0 s p
5 5 5 1 1 1
Description Performs two dot-products between two pairs of signed, packed 16-bit values. The
values in src1 and src2 are treated as signed, packed 16-bit quantities. The signed
results are rounded with saturation, shifted, packed and written to a 32-bit register.
The product of the lower halfwords of src1 and src2 is subtracted from the product of the
upper halfwords of src1 and src2. The result is rounded by adding 215 to it. The 16
most-significant bits of the rounded value are written to the upper half of dst.
The product of the upper halfword of src1 and the lower halfword of src2 is added to the
product of the lower halfword of src1 and the upper halfword of src2. The result is
rounded by adding 215 to it. The 16 most-significant bits of the rounded value are written
to the lower half of dst.
If either result saturates, the M1 or M2 bit in SSR and the SAT bit in CSR are written one
cycle after the result is written to dst.
This instruction executes unconditionally.
Execution
Delay Slots 3
Examples Example 1
CMPYR .M1 A0,A1,A2
(1)
Before instruction 4 cycles after instruction
A1 0900 0200h
(1)
CSR.SAT and SSR.M1 unchanged by operation
Example 2
CMPYR .M2X B0,A1,B2
(1)
Before instruction 4 cycles after instruction
A1 7FFF 8000h
(1)
CSR.SAT and SSR.M2 unchanged by operation
Example 3
CMPYR .M1 A0,A1,A2
A1 8000 8000h
Example 4
CMPYR .M2 B0,B1,B2
B1 8000 8001h
CMPYR1 Complex Multiply Two Pairs, Signed, Packed 16-Bit With Rounding
Opcode
31 30 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 dst src2 src1 x 0 0 1 1 0 0 1 1 0 0 s p
5 5 5 1 1 1
Description Performs two dot-products between two pairs of signed, packed 16-bit values. The
values in src1 and src2 are treated as signed, packed 16-bit quantities. The signed
results are rounded with saturation to 31 bits, shifted, packed and written to a 32-bit
register.
The product of the lower halfwords of src1 and src2 is subtracted from the product of the
upper halfwords of src1 and src2. The intermediate result is rounded by adding 214 to it.
This value is shifted left by 1 with saturation. The 16 most-significant bits of the shifted
value are written to the upper half of dst.
The product of the upper halfword of src1 and the lower halfword of src2 is added to the
product of the lower halfword of src1 and the upper halfword of src2. The intermediate
result is rounded by adding 214 to it. This value is shifted left by 1 with saturation. The 16
most-significant bits of the shifted value are written to the lower half of dst.
If either result saturates in the rounding or shifting process, the M1 or M2 bit in SSR and
the SAT bit in CSR are written one cycle after the results are written to dst.
This instruction executes unconditionally.
Execution
Delay Slots 3
Examples Example 1
CMPYR1 .M1 A0,A1,A2
(1)
Before instruction 4 cycles after instruction
A1 0900 0200h
(1)
CSR.SAT and SSR.M1 unchanged by operation
Example 2
CMPYR1 .M2X B0,A1,B2
A1 7FFF 8000h
Example 3
CMPYR1 .M1 A0,A1,A2
A1 8000 8000h
Example 4
CMPYR1 .M2 B0,B1,B2
B1 8000 8001h
DDOTP4 Double Dot Product, Signed, Packed 16-Bit and Signed, Packed 8-Bit
Opcode
31 30 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 dst src2 src1 x 0 1 1 0 0 0 1 1 0 0 s p
5 5 5 1 1 1
dst_o dst_e
d1 x c3 + d0 x c2 d1 x c1 + d0 x c0
Execution
Delay Slots 3
Examples Example 1
DDOTP4 .M1 A4,A5,A9:A8
Example 2
DDOTP4 .M1X A4,B5,A9:A8
Opcode
31 30 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 dst src2 src1 x 0 1 0 1 1 1 1 1 0 0 s p
5 5 5 1 1 1
Description Returns two dot-products between two pairs of signed, packed 16-bit values. The values
in src1_e, src1_o, and src2 are treated as signed, packed 16-bit quantities. The signed
results are written to a 64-bit register pair.
The product of the lower halfwords of src1_o and src2 is added to the product of the
upper halfwords of src1_o and src2. The result is then written to dst_o.
The product of the upper halfword of src2 and the lower halfword of src1_o is added to
the product of the lower halfword of src2 and the upper halfword of src1_e. The result is
then written to dst_e.
If either result saturates, the M1 or M2 bit in SSR and the SAT bit in CSR are written one
cycle after the results are written to dst_o:dst_e.
This instruction executes unconditionally.
src1_o src1_e src2
d3 d2 d1 d0 c1 c0
MSB16 LSB16 MSB16 LSB16 MSB16 LSB16
32 32
dst_o dst_e
d3 x c1 + d2 x c0 d2 x c1 + d1 x c0
Execution
Delay Slots 3
Examples Example 1
DDOTPH2 .M1 A5:A4,A6,A9:A8
(1)
Before instruction 4 cycles after instruction
Example 2
DDOTPH2 .M1 A5:A4,A6,A9:A8
A6 8000 8000h
Example 3
DDOTPH2 .M2X B5:B4,A6,B9:B8
(1)
Before instruction 4 cycles after instruction
A6 340B F73Bh
(1)
CSR.SAT and SSR.M2 unchanged by operation
DDOTPH2R Double Dot Product With Rounding, Two Pairs, Signed, Packed 16-Bit
Opcode
31 30 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 dst src2 src1 x 0 1 0 1 0 1 1 1 0 0 s p
5 5 5 1 1 1
Description Returns two dot-products between two pairs of signed, packed 16-bit values. The values
in src1_e, src1_o, and src2 are treated as signed, packed 16-bit quantities. The signed
results are rounded, shifted right by 16 and packed into a 32-bit register.
The product of the lower halfwords of src1_o and src2 is added to the product of the
upper halfwords of src1_o and src2. The result is rounded by adding 215 to it and
saturated if appropriate. The 16 most-significant bits of the result are written to the 16
most-significant bits of dst.
The product of the upper halfword of src2 and the lower halfword of src1_o is added to
the product of the lower halfword of src2 and the upper halfword of src1_e. The result is
rounded by adding 215 to it and saturated if appropriate. The 16 most-significant bits of
the result are written to the 16 least-significant bits of dst.
If either result saturates, the M1 or M2 bit in SSR and the SAT bit in CSR are written one
cycle after the results are written to dst.
This instruction executes unconditionally.
Execution
msb16(sat((msb16(src1_o) × msb16(src2)) +
(lsb16(src1_o) × lsb16(src2)) + 0000 8000h)) → msb16(dst)
msb16(sat((lsb16(src1_o) × msb16(src2)) +
(msb16(src1_e) × lsb16(src2)) + 0000 8000h)) → lsb16(dst)
Delay Slots 3
Examples Example 1
DDOTPH2R .M1 A5:A4,A6,A8
(1)
Before instruction 4 cycles after instruction
A5 BBAE D169h
A6 340B F73Bh
(1)
CSR.SAT and SSR.M1 unchanged by operation
Example 2
DDOTPH2R .M1 A5:A4,A6,A8
A5 1234 8000h
A6 8000 8001h
Example 3
DDOTPH2R .M2 B5:B4,B6,B8
B5 8000 8000h
B6 8000 8001h
Opcode
31 30 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 dst src2 src1 x 0 1 0 1 1 0 1 1 0 0 s p
5 5 5 1 1 1
Description Returns two dot-products between two pairs of signed, packed 16-bit values. The values
in src1_e, src1_o, and src2 are treated as signed, packed 16-bit quantities. The signed
results are written to a 64-bit register pair.
The product of the lower halfwords of src1_e and src2 is added to the product of the
upper halfwords of src1_e and src2. The result is then written to dst_e.
The product of the upper halfword of src2 and the lower halfword of src1_o is added to
the product of the lower halfword of src2 and the upper halfword of src1_e. The result is
then written to dst_o.
If either result saturates, the M1 or M2 bit in SSR and the SAT bit in CSR are written one
cycle after the results are written to dst_o:dst_e.
src1_o src1_e src2
d3 d2 d1 d0 c1 c0
MSB16 LSB16 MSB16 LSB16 MSB16 LSB16
32 32
dst_o dst_e
d2 x c1 + d1 x c0 d1 x c1 + d0 x c0
Execution
Delay Slots 3
Examples Example 1
DDOTPL2 .M1 A5:A4,A6,A9:A8
(1)
Before instruction 4 cycles after instruction
Example 2
DDOTPL2 .M1 A5:A4,A6,A9:A8
(1)
Before instruction 4 cycles after instruction
A6 340B F73Bh
(1)
CSR.SAT and SSR.M1 unchanged by operation
Example 3
DDOTPL2 .M1 A5:A4,A6,A9:A8
A6 8000 8000h
DDOTPL2R Double Dot Product With Rounding, Two Pairs, Signed Packed 16-Bit
Opcode
31 30 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 dst src2 src1 x 0 1 0 1 0 0 1 1 0 0 s p
5 5 5 1 1 1
Description Returns two dot-products between two pairs of signed, packed 16-bit values. The values
in src1_e, src1_o, and src2 are treated as signed, packed 16-bit quantities. The signed
results are rounded, shifted right by 16 and packed into a 32-bit register.
The product of the lower halfwords of src1_e and src2 is added to the product of the
upper halfwords of src1_e and src2. The result is rounded by adding 215 to it and
saturated if appropriate. The 16 most-significant bits of the result are written to the 16
least-significant bits of dst.
The product of the upper halfword of src2 and the lower halfword of src1_o is added to
the product of the lower halfword of src2 and the upper halfword of src1_e. The result is
rounded by adding 215 to it and saturated if appropriate. The 16 most-significant bits of
the result are written to the 16 most-significant bits of dst.
If either result saturates, the M1 or M2 bit in SSR and the SAT bit in CSR are written one
cycle after the results are written to dst.
Execution
msb16(sat((msb16(src1_e) × msb16(src2)) +
(lsb16(src1_e) × lsb16(src2)) + 0000 8000h)) → lsb16(dst)
msb16(sat((lsb16(src1_o) × msb16(src2)) +
(msb16(src1_e) × lsb16(src2)) + 0000 8000h)) → msb16(dst)
Delay Slots 3
Examples Example 1
DDOTPL2R .M1 A5:A4,A6,A8
(1)
Before instruction 4 cycles after instruction
A5 BBAE D169h
A6 340B F73Bh
(1)
CSR.SAT and SSR.M1 unchanged by operation
Example 2
DDOTPL2R .M1 A5:A4,A6,A8
A5 1234 8000h
A6 8000 8001h
Example 3
DDOTPL2R .M2 B5:B4,B6,B8
B5 8000 8000h
B6 8000 8001h
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 1 1 1 0 1 x 0 0 0 0 1 1 1 1 0 0 s p
3 1 5 5 1 1 1
Description Performs a deinterleave and pack operation on the bits in src2. The odd and even bits of
src2 are extracted into two separate, 16-bit quantities. These 16-bit quantities are then
packed such that the even bits are placed in the lower halfword, and the odd bits are
placed in the upper halfword.
As a result, bits 0, 2, 4, ... , 28, 30 of src2 are placed in bits 0, 1, 2, ... , 14, 15 of dst.
Likewise, bits 1, 3, 5, ... , 29, 31 of src2 are placed in bits 16, 17, 18, ... , 30, 31 of dst.
31 0
aAbB cCdD eEfF gGhH iIjJ kKlL mMnN oOpP ← src2
DEAL
↓ ↓
31 0
abcd efgh ijkl mnop ABCD EFGH IJKL MNOP ← dst
NOTE: The DEAL instruction is the exact inverse of the SHFL instruction
(see SHFL).
Execution
if (cond) {
src2 31,29,27...1 → dst 31,30,29...16
src2 30,28,26...0 → dst 15,14,13...0
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src2
Written dst
Unit in use .M
Delay Slots 1
Syntax DINT
unit = none
Opcode
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 p
1
Description Disables interrupts in the current cycle, copies the contents of the GIE bit in TSR into the
SGIE bit in TSR, and clears the GIE bit in both TSR and CSR. The PGIE bit in CSR is
unchanged.
The CPU will not service a maskable interrupt in the cycle immediately following the
DINT instruction. This behavior differs from writes to GIE using the MVC instruction. See
section 5.2 for details.
The DINT instruction cannot be placed in parallel with the following instructions: MVC
reg, TSR; MVC reg, CSR; B IRP; B NRP; NOP n; RINT; SPKERNEL; SPKERNELR;
SPLOOP; SPLOOPD; SPLOOPW; SPMASK; or SPMASKR.
This instruction executes unconditionally.
NOTE: The use of the DINT and RINT instructions in a nested manner, like the
following code:
DINT
DINT
RINT
RINT
Delay Slots 0
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 1 0 1 1 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description The src1 operand is written to the odd register of the register pair specified by dst and
the src2 operand is written to the even register of the register pair specified by dst.
Execution
if (cond) {
src2 → dst_e
src1 → dst_o
}
else nop
Delay Slots 0
Examples Example 1
DMV .S1 A0,A1,A3:A2
Example 2
DMV .S2X B0,A1,B3:B2
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 op 1 1 0 0 s p
3 1 5 5 5 1 5 1 1
Description Returns the dot-product between two pairs of signed, packed 16-bit values. The values
in src1 and src2 are treated as signed, packed 16-bit quantities. The signed result is
written either to a single 32-bit register, or sign-extended into a 64-bit register pair.
The product of the lower halfwords of src1 and src2 is added to the product of the upper
halfwords of src1 and src2. The result is then written to the dst.
If the result is sign-extended into a 64-bit register pair, the upper word of the register pair
always contains either all 0s or all 1s, depending on whether the result is positive or
negative, respectively.
31 16 15 0
a_hi a_lo ← src1
DOTP2
63 32 31 0
0 or F a_hi × b_hi + a_lo × b_lo ← dst_o:dst_e
The 32-bit result version returns the same results that the 64-bit result version does in
the lower 32 bits. The upper 32-bits are discarded.
31 16 15 0
a_hi a_lo ← src1
DOTP2
=
31 0
a_hi × b_hi + a_lo × b_lo ← dst
NOTE: In the overflow case, where all four halfwords in src1 and src2 are
8000h, the value 8000 0000h is written into the 32-bit dst and
0000 0000 8000 0000h is written into the 64-bit dst.
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Examples Example 1
DOTP2 .M1 A5,A6,A8
Example 2
DOTP2 .M1 A5,A6,A9:A8
A9:A8 xxxx xxxxh xxxx xxxxh A9:A8 FFFF FFFFh E6DF F6D4h
-421,529,900
Example 3
DOTP2 .M2 B2,B5,B8
Example 4
DOTP2 .M2 B2,B5,B9:B8
B9:B8 xxxx xxxxh xxxx xxxxh B9:B8 0000 0000h 12FC 544Dh
318,526,541
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 1 0 0 1 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Returns the dot-product between two pairs of signed, packed 16-bit values where the
second product is negated. The values in src1 and src2 are treated as signed, packed
16-bit quantities. The signed result is written to a single 32-bit register.
The product of the lower halfwords of src1 and src2 is subtracted from the product of the
upper halfwords of src1 and src2. The result is then written to dst.
31 16 15 0
a_hi a_lo ← src1
DOTPN2
31 0
a_hi × b_hi - a_lo × b_lo ← dst
Execution Note that unlike DOTP2, no overflow case exists for this instruction.
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Examples Example 1
DOTPN2 .M1 A5,A6,A8
Example 2
DOTPN2 .M2 B2,B5,B8
DOTPNRSU2 Dot Product With Negate, Shift and Round, Signed by Unsigned, Packed 16-Bit
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 0 1 1 1 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Returns the dot-product between two pairs of packed 16-bit values, where the second
product is negated. This instruction takes the result of the dot-product and performs an
additional round and shift step. The values in src1 are treated as signed, packed 16-bit
quantities; whereas, the values in src2 are treated as unsigned, packed 16-bit quantities.
The results are written to dst.
The product of the lower halfwords of src1 and src2 is subtracted from the product of the
upper halfwords of src1 and src2. The value 215 is then added to this sum, producing an
intermediate 33-bit result. The intermediate result is signed shifted right by 16, producing
a rounded, shifted result that is sign extended and placed in dst.
The intermediate results of the DOTPNRSU2 instruction are maintained to a 33-bit
precision, ensuring that no overflow may occur during the subtracting and rounding
steps.
31 16 15 0
sa_hi sa_lo ← src1
DOTPNRSU2
31 0
(((sa_hi × ub_hi) - (sa_lo × ub_lo)) + 8000h) >> 16 ← dst
Execution
if (cond) {
int33 = (smsb16(src1) × umsb16(src2)) -
(slsb16(src1) × ulsb16(src2)) + 8000h;
int33 >> 16 → dst
}
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Examples Example 1
DOTPNRSU2 .M1 A5, A6, A8
Example 2
DOTPNRSU2 .M2 B2, B5, B8
Example 3
DOTPNRSU2 .M2 B12, B23, B11
DOTPNRUS2 Dot Product With Negate, Shift and Round, Unsigned by Signed, Packed 16-Bit
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 0 1 1 1 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description The DOTPNRUS2 pseudo-operation performs the dot-product between two pairs of
packed 16-bit values, where the second product is negated. This instruction takes the
result of the dot-product and performs an additional round and shift step. The values in
src1 are treated as signed, packed 16-bit quantities; whereas, the values in src2 are
treated as unsigned, packed 16-bit quantities. The results are written to dst. The
assembler uses the DOTPNRSU2 src1, src2, dst instruction to perform this task (see
DOTPNRSU2).
The product of the lower halfwords of src1 and src2 is subtracted from the product of the
upper halfwords of src1 and src2. The value 215 is then added to this sum, producing an
intermediate 32 or 33-bit result. The intermediate result is signed shifted right by 16,
producing a rounded, shifted result that is sign extended and placed in dst.
The intermediate results of the DOTPNRUS2 pseudo-operation are maintained to a
33-bit precision, ensuring that no overflow may occur during the subtracting and
rounding steps.
Execution
if (cond) {
int33 = (smsb16(src1) × umsb16(src2)) -
(slsb16(src1) × ulsb16(src2)) + 8000h;
int33 >> 16 → dst
}
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
DOTPRSU2 Dot Product With Shift and Round, Signed by Unsigned, Packed 16-Bit
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 1 1 0 1 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Returns the dot-product between two pairs of packed 16-bit values. This instruction
takes the result of the dot-product and performs an additional round and shift step. The
values in src1 are treated as signed packed 16-bit quantities; whereas, the values in
src2 are treated as unsigned packed 16-bit quantities. The results are written to dst.
The product of the lower halfwords of src1 and src2 is added to the product of the upper
halfwords of src1 and src2. The value 215is then added to this sum, producing an
intermediate 32 or 33-bit result. The intermediate result is signed shifted right by 16,
producing a rounded, shifted result that is sign extended and placed in dst.
The intermediate results of the DOTPRSU2 instruction are maintained to a 33-bit
precision, ensuring that no overflow may occur during the subtracting and rounding
steps.
31 16 15 0
sa_hi sa_lo ← src1
DOTPRSU2
31 0
(((sa_hi × ub_hi) + (sa_lo × ub_lo)) + 8000h) >> 16 ← dst
Execution
if (cond) {
int33 = (smsb16(src1) × umsb16(src2)) +
(slsb16(src1) × ulsb16(src2)) + 8000h;
int33 >> 16 → dst
}
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Examples Example 1
DOTPRSU2 .M1 A5, A6, A8
Example 2
DOTPRSU2 .M2 B2, B5, B8
Example 3
DOTPRSU2 .M2 B12, B23, B11
DOTPRUS2 Dot Product With Shift and Round, Unsigned by Signed, Packed 16-Bit
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 1 1 0 1 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description The DOTPRUS2 pseudo-operation returns the dot-product between two pairs of packed
16-bit values. This instruction takes the result of the dot-product, and performs an
additional round and shift step. The values in src1 are treated as signed packed 16-bit
quantities; whereas, the values in src2 are treated as unsigned packed 16-bit quantities.
The results are written to dst. The assembler uses the DOTPRSU2 (.unit) src1, src2, dst
instruction to perform this task (see DOTPRSU2).
The product of the lower halfwords of src1 and src2 is added to the product of the upper
halfwords of src1 and src2. The value 215is then added to this sum, producing an
intermediate 32-bit result. The intermediate result is signed shifted right by 16, producing
a rounded, shifted result that is sign extended and placed in dst.
The intermediate results of the DOTPRUS2 pseudo-operation are maintained to a 33-bit
precision, ensuring that no overflow may occur during the subtracting and rounding
steps.
Execution
if (cond) {
int33 = (umsb16(src2) × smsb16(src1)) +
(ulsb16(src2) × slsb16(src1)) + 8000h;
int33 >> 16 → dst
}
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 0 0 1 0 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Returns the dot-product between four sets of packed 8-bit values. The values in src1 are
treated as signed packed 8-bit quantities; whereas, the values in src2 are treated as
unsigned 8-bit packed data. The signed result is written into dst.
For each pair of 8-bit quantities in src1 and src2, the signed 8-bit value from src1 is
multiplied with the unsigned 8-bit value from src2. The four products are summed
together, and the resulting dot product is written as a signed 32-bit result to dst.
31 24 23 16 15 8 7 0
sa_3 sa_2 sa_1 sa_0 ← src1
DOTPSU4
31 0
(sa_3 × ub_3) + (sa_2 × ub_2) + (sa_1 × ub_1) + (sa_0 × ub_0) ← dst
Execution
if (cond) {
(sbyte0(src1) × ubyte0(src2)) +
(sbyte1(src1) × ubyte1(src2)) +
(sbyte2(src1) × ubyte2(src2)) +
(sbyte3(src1) × ubyte3(src2)) → dst
}
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Examples Example 1
DOTPSU4 .M1 A5, A6, A8
Example 2
DOTPSU4 .M2 B2, B5, B8
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 0 0 1 0 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description The DOTPUS4 pseudo-operation returns the dot-product between four sets of packed
8-bit values. The values in src1 are treated as signed packed 8-bit quantities; whereas,
the values in src2 are treated as unsigned 8-bit packed data. The signed result is written
into dst. The assembler uses the DOTPSU4 (.unit) src1, src2, dst instruction to perform
this task (see DOTPSU4).
For each pair of 8-bit quantities in src1 and src2, the signed 8-bit value from src1 is
multiplied with the unsigned 8-bit value from src2. The four products are summed
together, and the resulting dot-product is written as a signed 32-bit result to dst.
Execution
if (cond) {
(ubyte0(src2) × sbyte0(src1)) +
(ubyte1(src2) × sbyte1(src1)) +
(ubyte2(src2) × sbyte2(src1)) +
(ubyte3(src2) × sbyte3(src1)) → dst
}
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 0 1 1 0 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Returns the dot-product between four sets of packed 8-bit values. The values in both
src1 and src2 are treated as unsigned, 8-bit packed data. The unsigned result is written
into dst.
For each pair of 8-bit quantities in src1 and src2, the unsigned 8-bit value from src1 is
multiplied with the unsigned 8-bit value from src2. The four products are summed
together, and the resulting dot-product is written as a 32-bit result to dst.
31 24 23 16 15 8 7 0
ua_3 ua_2 ua_1 ua_0 ← src1
DOTPU4
31 0
(ua_3 × ub_3) + (ua_2 × ub_2) + (ua_1 × ub_1) + (ua_0 × ub_0) ← dst
Execution
if (cond) {
(ubyte0(src1) × ubyte0(src2)) +
(ubyte1(src1) × ubyte1(src2)) +
(ubyte2(src1) × ubyte2(src2)) +
(ubyte3(src1) × ubyte3(src2)) → dst
}
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Opcode
31 30 29 28 27 24 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 dst 0 src2 src1 x 0 1 1 0 1 0 0 1 1 0 s p
4 5 5 1 1 1
Execution
lsb16(src1) → msb16(dst_e)
lsb16(src2) → lsb16(dst_e)
msb16(src1) → msb16(dst_o)
msb16(src2) → lsb16(dst_o)
Delay Slots 0
Example DPACK2 .L1 A0,A1,A3:A2
Opcode
31 30 29 28 27 24 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 dst 0 src2 src1 x 0 1 1 0 0 1 1 1 1 0 s p
4 5 5 1 1 1
Execution
lsb16(src1) → msb16(dst_e)
msb16(src2) → lsb16(dst_e)
msb16(src1) → lsb16(dst_o)
lsb16(src2) → msb16(dst_o)
Delay Slots 0
Examples Example 1
DPACKX2 .L1 A0,A1,A3:A2
Example 2
DPACKX2 .L1X A0,B0,A3:A2
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 0 0 0 0 0 x 0 0 0 1 0 0 0 1 1 0 s p
3 1 5 5 1 1 1
Description The 64-bit double-precision value in src2 is converted to an integer and placed in dst.
The operand is read in one cycle by using the src2 port for the 32 MSBs and the src1
port for the 32 LSBs.
NOTE:
1. If src2 is NaN, the maximum signed integer (7FFF FFFFh or
8000 0000h) is placed in dst and the INVAL bit is set.
2. If src2 is signed infinity or if overflow occurs, the maximum signed
integer (7FFF FFFFh or 8000 0000h) is placed in dst and the INEX
and OVER bits are set. Overflow occurs if src2 is greater than
231 − 1 or less than −231.
3. If src2 is denormalized, 0000 0000h is placed in dst and the INEX
and DEN2 bits are set.
4. If rounding is performed, the INEX bit is set.
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src2_l,
src2_h
Written dst
Unit in use .L
Delay Slots 3
A1:A0 4021 3333h 3333 3333h 8.6 A1:A0 4021 3333h 3333 3333h
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 0 0 0 0 0 x 0 0 0 1 0 0 1 1 1 0 s p
3 1 5 5 1 1 1
Description The double-precision 64-bit value in src2 is converted to a single-precision value and
placed in dst. The operand is read in one cycle by using the src2 port for the 32 MSBs
and the src1 port for the 32 LSBs.
NOTE:
1. If rounding is performed, the INEX bit is set.
2. If src2 is SNaN, NaN_out is placed in dst and the INVAL and NAN2
bits are set.
3. If src2 is QNaN, NaN_out is placed in dst and the NAN2 bit is set.
4. If src2 is a signed denormalized number, signed 0 is placed in dst
and the INEX and DEN2 bits are set.
5. If src2 is signed infinity, the result is signed infinity and the INFO bit is
set.
6. If overflow occurs, the INEX and OVER bits are set and the results
are set as follows (LFPN is the largest floating-point number):
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src2_l,
src2_h
Written dst
Unit in use .L
Delay Slots 3
A1:A0 4021 3333h 3333 3333h 8.6 A1:A0 4021 3333h 3333 3333h
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 0 0 0 0 0 x 0 0 0 0 0 0 1 1 1 0 s p
3 1 5 5 1 1 1
Description The 64-bit double-precision value in src2 is converted to an integer and placed in dst.
This instruction operates like DPINT except that the rounding modes in the floating-point
adder configuration register (FADCR) are ignored; round toward zero (truncate) is
always used. The 64-bit operand is read in one cycle by using the src2 port for the
32 MSBs and the src1 port for the 32 LSBs.
NOTE:
1. If src2 is NaN, the maximum signed integer (7FFF FFFFh or
8000 0000h) is placed in dst and the INVAL bit is set.
2. If src2 is signed infinity or if overflow occurs, the maximum signed
integer (7FFF FFFFh or 8000 0000h) is placed in dst and the INEX
and OVER bits are set. Overflow occurs if src2 is greater than
231 − 1 or less than −231.
3. If src2 is denormalized, 0000 0000h is placed in dst and the INEX
and DEN2 bits are set.
4. If rounding is performed, the INEX bit is set.
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src2_l,
src2_h
Written dst
Unit in use .L
Delay Slots 3
A1:A0 4021 3333h 3333 3333h 8.6 A1:A0 4021 3333h 3333 3333h
31 29 28 27 23 22 18 17 13 12 8 7 6 5 4 3 2 1 0
creg z dst src2 csta cstb 0 1 0 0 1 0 s p
3 1 5 5 5 5 1 1
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 1 1 1 1 1 0 0 0 s p
3 1 5 5 5 1 1 1
Description The field in src2, specified by csta and cstb, is extracted and sign-extended to 32 bits.
The extract is performed by a shift left followed by a signed shift right. csta and cstb are
the shift left amount and shift right amount, respectively. This can be thought of in terms
of the LSB and MSB of the field to be extracted. Then csta = 31 - MSB of the field and
cstb = csta + LSB of the field. The shift left and shift right amounts may also be specified
as the ten LSBs of the src1 register with cstb being bits 0-4 and csta bits 5-9. In the
example below, csta is 12 and cstb is 11 + 12 = 23. Only the ten LSBs are valid for the
register version of the instruction. If any of the 22 MSBs are non-zero, the result is
invalid.
src2 1) X X X X X X X X X X X X 1 0 1 0 0 1 1 0 1 X X X X X X X X X X X
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
2) 1 0 1 0 0 1 1 0 1 X X X X X X X X X X X 0 0 0 0 0 0 0 0 0 0 0 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
dst 3) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 0 1
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
EXT .S1 A1,10,19,A2
Example 2
EXT .S1 A1,A2,A3
31 29 28 27 23 22 18 17 13 12 8 7 6 5 4 3 2 1 0
creg z dst src2 csta cstb 0 0 0 0 1 0 s p
3 1 5 5 5 5 1 1
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 1 0 1 1 1 0 0 0 s p
3 1 5 5 5 1 1 1
Description The field in src2, specified by csta and cstb, is extracted and zero extended to 32 bits.
The extract is performed by a shift left followed by an unsigned shift right. csta and cstb
are the amounts to shift left and shift right, respectively. This can be thought of in terms
of the LSB and MSB of the field to be extracted. Then csta = 31 - MSB of the field and
cstb = csta + LSB of the field. The shift left and shift right amounts may also be specified
as the ten LSBs of the src1 register with cstb being bits 0-4 and csta bits 5-9. In the
example below, csta is 12 and cstb is 11 + 12 = 23. Only the ten LSBs are valid for the
register version of the instruction. If any of the 22 MSBs are non-zero, the result is
invalid.
src2 1) X X X X X X X X X X X X 1 0 1 0 0 1 1 0 1 X X X X X X X X X X X
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
2) 1 0 1 0 0 1 1 0 1 X X X X X X X X X X X 0 0 0 0 0 0 0 0 0 0 0 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
dst 3) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 1
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
EXTU .S1 A1,10,19,A2
Example 2
EXTU .S1 A1,A2,A3
Opcode
31 30 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 dst src2 src1 x 0 1 1 1 1 1 1 1 0 0 s p
5 5 5 1 1 1
Description Performs a Galois field multiply, where src1 is 32 bits and src2 is limited to 9 bits. This
utilizes the existing hardware and produces a 32-bit result. This multiply connects all
levels of the gmpy4 together and only extends out by 8 bits, the resulting data is XORed
down by the 32-bit polynomial.
The polynomial used comes from either the GPLYA or GPLYB control register
depending on which side (A or B) the instruction executes. If the A-side M1 unit is used,
the polynomial comes from GPLYA; if the B-side M2 unit, the polynomial comes from
GPLYB.
This instruction executes unconditionally.
uword gmpy(uword src1,uword src2,uword polynomial)
{
// the multiply is always between GF(2^9) and GF(2^32)
// so no size information is needed
uint pp;
uint mask, tpp;
uint I;
pp = 0;
mask = 0x00000100; // multiply by computing
// partial products.
for ( I=0; i<8; I++ ){
if ( src2 & mask ) pp ^= src1;
mask >>= 1;
tpp = pp << 1;
if (pp & 0x80000000) pp = polynomial ^ tpp;
else pp = tpp;
}
if ( src2 & 0x1 ) pp ^= src1;
Execution
if (unit = M1)
GMPY_poly = GPLYA
lsb9(src2) gmpy src1 → dst
else if (unit = M2)
GMPY_poly = GPLYB
lsb9(src2) gmpy src1 → dst
Delay Slots 3
A1 0000 0126h
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg 1 dst src2 src1 x 0 1 0 0 0 1 1 1 0 0 s p
3 5 5 5 1 1 1
Description Performs the Galois field multiply on four values in src1 with four parallel values in src2.
The four products are packed into dst. The values in both src1 and src2 are treated as
unsigned, 8-bit packed data.
For each pair of 8-bit quantities in src1 and src2, the unsigned, 8-bit value from src1 is
Galois field multiplied (gmpy) with the unsigned, 8-bit value from src2. The product of
src1 byte 0 and src2 byte 0 is written to byte0 of dst. The product of src1 byte 1 and src2
byte 1 is written to byte1 of dst. The product of src1 byte 2 and src2 byte 2 is written to
byte2 of dst. The product of src1 byte 3 and src2 byte 3 is written to the most-significant
byte in dst.
31 24 23 16 15 8 7 0
ua_3 ua_2 ua_1 ua_0 ← src1
GMPY4
= = = =
31 0
ua_3 gmpy ub_3 ua_2 gmpy ub_2 ua_1 gmpy ub_1 ua_0 gmpy ub_0 ← dst
The size and polynomial are controlled by the Galois field polynomial generator function
register (GFPGFR). All registers in the control register file can be written using the MVC
instruction (see MVC).
The default field generator polynomial is 1Dh, and the default size is 7. This setting is
used for many communications standards.
Note that the GMPY4 instruction is commutative, so:
GMPY4 .M1 A10,A12,A13
is equivalent to:
GMPY4 .M1 A12,A10,A13
Execution
if (cond) {
(ubyte0(src1) gmpy ubyte0(src2)) → ubyte0(dst);
(ubyte1(src1) gmpy ubyte1(src2)) → ubyte1(dst);
(ubyte2(src1) gmpy ubyte2(src2)) → ubyte2(dst);
(ubyte3(src1) gmpy ubyte3(src2)) → ubyte3(dst)
}
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Examples Example 1
GMPY4 .M1 A5,A6,A7; polynomial = 0x1d
A5 45 23 00 01h 69 35 0 1 A5 45 23 00 01h
unsigned
A6 57 34 00 01h 87 52 0 1 A6 57 34 00 01h
unsigned
Example 2
GMPY4 .M1 A5,A6,A7; field size is 0x7
Syntax IDLE
unit = none
Opcode
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 p
1
Description Performs an infinite multicycle NOP that terminates upon servicing an interrupt, or a
branch occurs due to an IDLE instruction being in the delay slots of a branch.
The IDLE instruction cannot be paired with any other multicycle NOP instruction in the
same execute packet. Instructions that generate a multicycle NOP are: ADDKPC,
BNOP, and the multicycle NOP.
Delay Slots 0
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 0 0 0 0 0 x 0 1 1 1 0 0 1 1 1 0 s p
3 1 5 5 1 1 1
Description The signed integer value in src2 is converted to a double-precision value and placed in
dst.
You cannot set configuration bits with this instruction.
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4 E5
Read src2
Written dst_l dst_h
Unit in use .L
If dst is used as the source for the ADDDP, CMPEQDP, CMPLTDP, CMPGTDP,
MPYDP, or SUBDP instruction, the number of delay slots can be reduced by one,
because these instructions read the lower word of the DP source one cycle before the
upper word of the DP source.
Delay Slots 4
A1:A0 xxxx xxxxh xxxx xxxxh A1:A0 41B9 6511h 2700 0000h
4.2605393 E08
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 0 0 0 0 0 x 0 1 1 1 0 1 1 1 1 0 s p
3 1 5 5 1 1 1
Description The unsigned integer value in src2 is converted to a double-precision value and placed
in dst.
You cannot set configuration bits with this instruction.
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4 E5
Read src2
Written dst_l dst_h
Unit in use .L
If dst is used as the source for the ADDDP, CMPEQDP, CMPLTDP, CMPGTDP,
MPYDP, or SUBDP instruction, the number of delay slots can be reduced by one,
because these instructions read the lower word of the DP source one cycle before the
upper word of the DP source.
Delay Slots 4
A1:A0 xxxx xxxxh xxxx xxxxh A1:A0 41EF FFFFh FBC0 0000h
4.2949673 E09
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 0 0 0 0 0 x 1 0 0 1 0 1 0 1 1 0 s p
3 1 5 5 1 1 1
Description The signed integer value in src2 is converted to a single-precision value and placed in
dst.
The only configuration bit that can be set is the INEX bit and only if the mantissa is
rounded.
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src2
Written dst
Unit in use .L
Delay Slots 3
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 0 0 0 0 0 x 1 0 0 1 0 0 1 1 1 0 s p
3 1 5 5 1 1 1
Description The unsigned integer value in src2 is converted to a single-precision value and placed in
dst.
The only configuration bit that can be set is the INEX bit and only if the mantissa is
rounded.
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src2
Written dst
Unit in use .L
Delay Slots 3
LDB(U) Load Byte From Memory With a 5-Bit Unsigned Constant Offset or Register Offset
Syntax
Opcode
31 29 28 27 23 22 18 17 13 12 9 8 7 6 4 3 2 1 0
creg z dst baseR offsetR/ucst5 mode 0 y op 0 1 s p
3 1 5 5 5 4 1 3 1 1
Description Loads a byte from memory to a general-purpose register (dst). Table 3-23 summarizes
the data types supported by loads. Table 3-11 describes the addressing generator
options. The memory address is formed from a base address register (baseR) and an
optional offset that is either a register (offsetR) or a 5-bit unsigned constant (ucst5). If an
offset is not given, the assembler assigns an offset of zero.
offsetR and baseR must be in the same register file and on the same side as the .D unit
used. The y bit in the opcode determines the .D unit and register file used: y = 0 selects
the .D1 unit and baseR and offsetR from the A register file, and y = 1 selects the .D2 unit
and baseR and offsetR from the B register file.
offsetR/ucst5 is scaled by a left-shift of 0 bits. After scaling, offsetR/ucst5 is added to or
subtracted from baseR. For the preincrement, predecrement, positive offset, and
negative offset address generator options, the result of the calculation is the address to
be accessed in memory. For postincrement or postdecrement addressing, the value of
baseR before the addition or subtraction is the address to be accessed in memory.
The addressing arithmetic that performs the additions and subtractions defaults to linear
mode. However, for A4-A7 and for B4-B7, the mode can be changed to circular mode by
writing the appropriate value to the AMR (see Section 2.8.3).
For LDB(U), the values are loaded into the 8 LSBs of dst. For LDB, the upper 24 bits of
dst values are sign-extended; for LDBU, the upper 24 bits of dst are zero-filled. The s bit
determines which file dst will be loaded into: s = 0 indicates dst will be loaded in the A
register file and s = 1 indicates dst will be loaded in the B register file.
Increments and decrements default to 1 and offsets default to 0 when no bracketed
register or constant is specified. Loads that do no modification to the baseR can use the
syntax *R. Square brackets, [ ], indicate that the ucst5 offset is left-shifted by 0.
Parentheses, ( ), can be used to set a nonscaled, constant offset. You must type either
brackets or parentheses around the specified offset, if you use the optional offset
parameter.
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4 E5
Read baseR, offsetR
Written baseR dst
Unit in use .D
Examples Example 1
LDB .D1 *-A5[4],A7
Example 2
LDB .D1 *++A4[5],A8
mem 4000h 0112 2334h mem 4000h 0112 2334h mem 4000h 0112 2334h
mem 4004h 4556 6778h mem 4004h 4556 6778h mem 4004h 4556 6778h
Example 3
LDB .D1 *A4++[5],A8
mem 4000h 0112 2334h mem 4000h 0112 2334h mem 4000h 0112 2334h
mem 4004h 4556 6778h mem 4004h 4556 6778h mem 4004h 4556 6778h
Example 4
LDB .D1 *++A4[A12],A8
mem 4000h 0112 2334h mem 4000h 0112 2334h mem 4000h 0112 2334h
mem 4004h 4556 6778h mem 4004h 4556 6778h mem 4004h 4556 6778h
LDB(U) Load Byte From Memory With a 15-Bit Unsigned Constant Offset
Opcode
31 29 28 27 23 22 8 7 6 4 3 2 1 0
creg z dst ucst15 y op 1 1 s p
3 1 5 15 1 3 1 1
Description Loads a byte from memory to a general-purpose register (dst). Table 3-24 summarizes
the data types supported by loads. The memory address is formed from a base address
register B14 (y = 0) or B15 (y = 1) and an offset, which is a 15-bit unsigned constant
(ucst15). The assembler selects this format only when the constant is larger than five
bits in magnitude. This instruction operates only on the .D2 unit.
The offset, ucst15, is scaled by a left shift of 0 bits. After scaling, ucst15 is added to
baseR. Subtraction is not supported. The result of the calculation is the address sent to
memory. The addressing arithmetic is always performed in linear mode.
For LDB(U), the values are loaded into the 8 LSBs of dst. For LDB, the upper 24 bits of
dst values are sign-extended; for LDBU, the upper 24 bits of dst are zero-filled. The s bit
determines which file dst will be loaded into: s = 0 indicates dst will be loaded in the A
register file and s = 1 indicates dst will be loaded in the B register file.
Square brackets, [ ], indicate that the ucst15offset is left-shifted by 0. Parentheses, ( ),
can be used to set a nonscaled, constant offset. You must type either brackets or
parentheses around the specified offset, if you use the optional offset parameter.
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4 E5
Read B14/B15
Written dst
Unit in use .D2
Delay Slots 4
B1 0000 0012h
LDDW Load Doubleword From Memory With a 5-Bit Unsigned Constant Offset or
Register Offset
Syntax
Opcode
31 29 28 27 23 22 18 17 13 12 9 8 7 6 5 4 3 2 1 0
creg z dst baseR offsetR/ucst5 mode 1 y 1 1 0 0 1 s p
3 1 5 5 5 4 1 1 1
Description Loads a 64-bit quantity from memory into a register pair dst_o:dst_e. Table 3-11
describes the addressing generator options. The memory address is formed from a base
address register (baseR) and an optional offset that is either a register (offsetR) or a
5-bit unsigned constant (ucst5).
Both offsetR and baseR must be in the same register file and on the same side as the .D
unit used. The y bit in the opcode determines the .D unit and the register file used: y = 0
selects the .D1 unit and the baseR and offsetR from the A register file, and y = 1 selects
the .D2 unit and baseR and offsetR from the B register file. The s bit determines the
register file into which the dst is loaded: s = 0 indicates that dst is in the A register file,
and s = 1 indicates that dst is in the B register file. The dst field must always be an even
value because the LDDW instruction loads register pairs. Therefore, bit 23 is always
zero.
The offsetR/ucst5 is scaled by a left-shift of 3 to correctly represent doublewords. After
scaling, offsetR/ucst5 is added to or subtracted from baseR. For the preincrement,
predecrement, positive offset, and negative offset address generator options, the result
of the calculation is the address to be accessed in memory. For postincrement or
postdecrement addressing, the shifted value of baseR before the addition or subtraction
is the address to be accessed in memory.
Increments and decrements default to 1 and offsets default to 0 when no bracketed
register, bracketed constant, or constant enclosed in parentheses is specified. Square
brackets, [ ], indicate that ucst5 is left shifted by 3. Parentheses, ( ), indicate that ucst5 is
not left shifted. In other words, parentheses indicate a byte offset rather than a
doubleword offset. You must type either brackets or parenthesis around the specified
offset if you use the optional offset parameter.
The addressing arithmetic that performs the additions and subtractions defaults to linear
mode. However, for A4-A7 and for B4-B7, the mode can be changed to circular mode by
writing the appropriate value to the AMR (see Section 2.8.3).
The destination register pair must consist of a consecutive even and odd register pair
from the same register file. The instruction can be used to load a double-precision
floating-point value (64 bits), a pair of single-precision floating-point words (32 bits), or a
pair of 32-bit integers. The 32 least-significant bits are loaded into the even-numbered
register and the 32 most-significant bits (containing the sign bit and exponent) are
loaded into the next register (which is always odd-numbered register). The register pair
syntax places the odd register first, followed by a colon, then the even register (that is,
A1:A0, B1:B0, A3:A2, B3:B2, etc.).
All 64 bits of the double-precision floating point value are stored in big- or little-endian
byte order, depending on the mode selected. When the LDDW instruction is used to load
two 32-bit single-precision floating-point values or two 32-bit integer values, the order is
dependent on the endian mode used. In little-endian mode, the first 32-bit word in
memory is loaded into the even register. In big-endian mode, the first 32-bit word in
memory is loaded into the odd register. Regardless of the endian mode, the doubleword
address must be on a doubleword boundary (the three LSBs are zero).
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4 E5
Read baseR, offsetR
Written baseR dst
Unit in use .D
Delay Slots 4
Examples Example 1
LDDW .D2 *+B10[1],A1:A0
A1:A0 xxxx xxxxh xxxx xxxxh A1:A0 4021 3333h 3333 3333h
mem 18h 3333 3333h 4021 3333h 8.6 mem 18h 3333 3333h 4021 3333h
Little-endian mode
Example 2
LDDW .D1 *++A10[1],A1:A0
A1:A0 xxxx xxxxh xxxx xxxxh A1:A0 xxxx xxxxh xxxx xxxxh
mem 18h 4021 3333h 3333 3333h 8.6 mem 18h 4021 3333h 3333 3333h
Example 3
LDDW .D1 *A4++[5],A9:A8
A9:A8 xxxx xxxxh xxxx xxxxh A9:A8 xxxx xxxxh xxxx xxxxh
mem 40B0h 0112 2334h 4556 6778h mem 40B0h 0112 2334h 4556 6778h
A4 0000 40B0h
Example 4
LDDW .D1 *++A4[A12],A9:A8
A9:A8 xxxx xxxxh xxxx xxxxh A9:A8 xxxx xxxxh xxxx xxxxh
mem 40E0h 0112 2334h 4556 6778h 8 mem 40E0h 0112 2334h 4556 6778h
A4 0000 40E0h
Example 5
LDDW .D1 *++A4(16),A9:A8
A9:A8 xxxx xxxxh xxxx xxxxh A9:A8 xxxx xxxxh xxxx xxxxh
mem 40C0h 4556 6778h 899A ABBCh mem 40C0h 4556 6778h 899A ABBCh
A4 0000 40C0h
LDH(U) Load Halfword From Memory With a 5-Bit Unsigned Constant Offset or
Register Offset
Syntax
Opcode
31 29 28 27 23 22 18 17 13 12 9 8 7 6 4 3 2 1 0
creg z dst baseR offsetR/ucst5 mode 0 y op 0 1 s p
3 1 5 5 5 4 1 3 1 1
Description Loads a halfword from memory to a general-purpose register (dst). Table 3-25
summarizes the data types supported by halfword loads. Table 3-11 describes the
addressing generator options. The memory address is formed from a base address
register (baseR) and an optional offset that is either a register (offsetR) or a 5-bit
unsigned constant (ucst5). If an offset is not given, the assembler assigns an offset of
zero.
offsetR and baseR must be in the same register file and on the same side as the .D unit
used. The y bit in the opcode determines the .D unit and register file used: y = 0 selects
the .D1 unit and baseR and offsetR from the A register file, and y = 1 selects the .D2 unit
and baseR and offsetR from the B register file.
offsetR/ucst5 is scaled by a left-shift of 1 bit. After scaling, offsetR/ucst5 is added to or
subtracted from baseR. For the preincrement, predecrement, positive offset, and
negative offset address generator options, the result of the calculation is the address to
be accessed in memory. For postincrement or postdecrement addressing, the value of
baseR before the addition or subtraction is the address to be accessed in memory.
The addressing arithmetic that performs the additions and subtractions defaults to linear
mode. However, for A4-A7 and for B4-B7, the mode can be changed to circular mode by
writing the appropriate value to the AMR (see Section 2.8.3).
For LDH(U), the values are loaded into the 16 LSBs of dst. For LDH, the upper 16 bits of
dst are sign-extended; for LDHU, the upper 16 bits of dst are zero-filled. The s bit
determines which file dst will be loaded into: s = 0 indicates dst will be loaded in the A
register file and s = 1 indicates dst will be loaded in the B register file.
Increments and decrements default to 1 and offsets default to 0 when no bracketed
register or constant is specified. Loads that do no modification to the baseR can use the
syntax *R. Square brackets, [ ], indicate that the ucst5 offset is left-shifted by 1.
Parentheses, ( ), can be used to set a nonscaled, constant offset. You must type either
brackets or parentheses around the specified offset, if you use the optional offset
parameter.
Halfword addresses must be aligned on halfword (LSB is 0) boundaries.
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4 E5
Read baseR, offsetR
Written baseR dst
Unit in use .D
LDH(U) Load Halfword From Memory With a 15-Bit Unsigned Constant Offset
Opcode
31 29 28 27 23 22 8 7 6 4 3 2 1 0
creg z dst ucst15 y op 1 1 s p
3 1 5 15 1 3 1 1
Description Loads a halfword from memory to a general-purpose register (dst). Table 3-26
summarizes the data types supported by loads. The memory address is formed from a
base address register B14 (y = 0) or B15 (y = 1) and an offset, which is a 15-bit
unsigned constant (ucst15). The assembler selects this format only when the constant is
larger than five bits in magnitude. This instruction operates only on the .D2 unit.
The offset, ucst15, is scaled by a left shift of 1 bit. After scaling, ucst15 is added to
baseR. Subtraction is not supported. The result of the calculation is the address sent to
memory. The addressing arithmetic is always performed in linear mode.
For LDH(U), the values are loaded into the 16 LSBs of dst. For LDH, the upper 16 bits of
dst are sign-extended; for LDHU, the upper 16 bits of dst are zero-filled. The s bit
determines which file dst will be loaded into: s = 0 indicates dst will be loaded in the A
register file and s = 1 indicates dst will be loaded in the B register file.
Square brackets, [ ], indicate that the ucst15offset is left-shifted by 1. Parentheses, ( ),
can be used to set a nonscaled, constant offset. You must type either brackets or
parentheses around the specified offset, if you use the optional offset parameter.
Halfword addresses must be aligned on halfword (LSB is 0) boundaries.
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4 E5
Read B14/B15
Written dst
Unit in use .D2
Delay Slots 4
LDNDW Load Nonaligned Doubleword From Memory With Constant or Register Offset
Syntax
Opcode
31 29 28 27 24 23 22 18 17 13 12 9 8 7 6 5 4 3 2 1 0
creg z dst sc baseR offsetR/ucst5 mode 1 y 0 1 0 0 1 s p
3 1 4 1 5 5 4 1 1 1
Description Loads a 64-bit quantity from memory into a register pair, dst_o:dst_e. Table 3-11
describes the addressing generator options. The LDNDW instruction may read a 64-bit
value from any byte boundary. Thus alignment to a 64-bit boundary is not required. The
memory address is formed from a base address register (baseR) and an optional offset
that is either a register (offsetR) or a 5-bit unsigned constant (ucst5).
Both offsetR and baseR must be in the same register file, and on the same side, as the
.D unit used. The y bit in the opcode determines the .D unit and register file used: y = 0
selects the .D1 unit and baseR and offsetR from the A register file, and y = 1 selects the
.D2 unit and baseR and offsetR from the B register file.
The LDNDW instruction supports both scaled offsets and nonscaled offsets. The sc field
is used to indicate whether the offsetR/ucst5 is scaled or not. If sc is 1 (scaled), the
offsetR/ucst5 is shifted left 3 bits before adding or subtracting from the baseR. If sc is 0
(nonscaled), the offsetR/ucst5 is not shifted before adding or subtracting from the baseR.
For the preincrement, predecrement, positive offset, and negative offset address
generator options, the result of the calculation is the address to be accessed in memory.
For postincrement or postdecrement addressing, the value of baseR before the addition
or subtraction is the address to be accessed from memory.
The addressing arithmetic that performs the additions and subtractions defaults to linear
mode. However, for A4-A7 and for B4-B7, the mode can be changed to circular mode by
writing the appropriate value to the AMR (see Section 2.8.3).
The dst field of the instruction selects a register pair, a consecutive even-numbered and
odd-numbered register pair from the same register file. The instruction can be used to
load a pair of 32-bit integers. The 32 least-significant bits are loaded into the
even-numbered register and the 32 most-significant bits are loaded into the next register
(that is always an odd-numbered register).
The dst can be in either register file, regardless of the .D unit or baseR or offsetR used.
The s bit determines which file dst will be loaded into: s = 0 indicates dst will be in the A
register file and s = 1 indicates dst will be loaded in the B register file.
Assembler Notes When no bracketed register or constant is specified, the assembler defaults increments
and decrements to 1 and offsets to 0. Loads that do no modification to the baseR can
use the assembler syntax *R. Square brackets, [ ], indicate that the ucst5 offset is
left-shifted by 3 for doubleword loads.
Parentheses, ( ), can be used to tell the assembler that the offset is a non-scaled offset.
For example, LDNDW (.unit) *+baseR (14), dst represents an offset of 14 bytes, and the
assembler writes out the instruction with offsetC = 14 and sc = 0.
LDNDW (.unit) *+baseR [16], dst represents an offset of 16 doublewords, or 128 bytes,
and the assembler writes out the instruction with offsetC = 16 and sc = 1.
Either brackets or parentheses must be typed around the specified offset if the optional
offset parameter is used.
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4 E5
Read baseR, offsetR
Written baseR dst
Unit in use .D
Examples Example 1
LDNDW .D1 *A0++, A3:A2
A3:A2 xxxx xxxxh xxxx xxxxh A3:A2 xxxx xxxxh xxxx xxxxh
A0 0000 1009h
Byte Memory 100C 100B 100A 1009 1008 1007 1006 1005 1004 1003 1002 1001 1000
Address
Data Value 11 05 69 34 5E 1C 4F 29 A8 12 B6 C5 D4
Example 2
LDNDW .D1 *A0++, A3:A2
A3:A2 xxxx xxxxh xxxx xxxxh A3:A2 xxxx xxxxh xxxx xxxxh
A0 0000 100Bh
Byte Memory 100C 100B 100A 1009 1008 1007 1006 1005 1004 1003 1002 1001 1000
Address
Data Value 11 05 69 34 5E 1C 4F 29 A8 12 B6 C5 D4
LDNW Load Nonaligned Word From Memory With Constant or Register Offset
Syntax
Opcode
31 29 28 27 23 22 18 17 13 12 9 8 7 6 5 4 3 2 1 0
creg z dst baseR offsetR/ucst5 mode 1 y 0 1 1 0 1 s p
3 1 5 5 5 4 1 1 1
Description Loads a 32-bit quantity from memory into a 32-bit register, dst. Table 3-11 describes the
addressing generator options. The LDNW instruction may read a 32-bit value from any
byte boundary. Thus alignment to a 32-bit boundary is not required. The memory
address is formed from a base address register (baseR), and an optional offset that is
either a register (offsetR) or a 5-bit unsigned constant (ucst5). If an offset is not given,
the assembler assigns an offset of zero.
Both offsetR and baseR must be in the same register file, and on the same side, as the
.D unit used. The y bit in the opcode determines the .D unit and register file used: y = 0
selects the .D1 unit and baseR and offsetR from the A register file, and y = 1 selects the
.D2 unit and baseR and offsetR from the B register file.
The offsetR/ucst5 is scaled by a left shift of 2 bits. After scaling, offsetR/ucst5 is added
to, or subtracted from, baseR. For the preincrement, predecrement, positive offset, and
negative offset address generator options, the result of the calculation is the address to
be accessed in memory. For postincrement or postdecrement addressing, the value of
baseR before the addition or subtraction is the address to be accessed from memory.
The addressing arithmetic that performs the additions and subtractions defaults to linear
mode. However, for A4-A7 and for B4-B7, the mode can be changed to circular mode by
writing the appropriate value to the AMR (see Section 2.8.3).
The dst can be in either register file, regardless of the .D unit or baseR or offsetR used.
The s bit determines which file dst will be loaded into: s = 0 indicates dst will be in the A
register file and s = 1 indicates dst will be loaded in the B register file.
Assembler Notes When no bracketed register or constant is specified, the assembler defaults increments
and decrements to 1 and offsets to 0. Loads that do no modification to the baseR can
use the assembler syntax *R. Square brackets, [ ], indicate that the ucst5 offset is
left-shifted by 2 for word loads.
Parentheses, ( ), can be used to tell the assembler that the offset is a nonscaled,
constant offset. The assembler right shifts the constant by 2 bits for word loads before
using it for the ucst5 field. After scaling by the LDNW instruction, this results in the same
constant offset as the assembler source if the least-significant two bits are zeros.
For example, LDNW (.unit) *+baseR (12), dst represents an offset of 12 bytes (3 words),
and the assembler writes out the instruction with ucst5 = 3.
LDNW (.unit) *+baseR [12], dst represents an offset of 12 words, or 48 bytes, and the
assembler writes out the instruction with ucst5 = 12.
Either brackets or parentheses must be typed around the specified offset if the optional
offset parameter is used.
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4 E5
Read baseR, offsetR
Written baseR dst
Unit in use .D
Examples Example 1
LDNW .D1 *A0++, A2
mem 1000h 12B6 C5D4h mem 1000h 12B6 C5D4h mem 1000h 12B6 C5D4h
mem 1004h 1C4F 29A8h mem 1004h 1C4F 29A8h mem 1004h 1C4F 29A8h
Byte Memory Address 1007 1006 1005 1004 1003 1002 1001 1000
Data Value 1C 4F 29 A8 12 B6 C5 D4
Example 2
LDNW .D1 *A0++, A2
mem 1000h 12B6 C5D4h mem 1000h 12B6 C5D4h mem 1000h 12B6 C5D4h
mem 1004h 1C4F 29A8h mem 1004h 1C4F 29A8h mem 1004h 1C4F 29A8h
Byte Memory Address 1007 1006 1005 1004 1003 1002 1001 1000
Data Value 1C 4F 29 A8 12 B6 C5 D4
LDW Load Word From Memory With a 5-Bit Unsigned Constant Offset or Register Offset
Syntax
Opcode
31 29 28 27 23 22 18 17 13 12 9 8 7 6 5 4 3 2 1 0
creg z dst baseR offsetR/ucst5 mode 0 y 1 1 0 0 1 s p
3 1 5 5 5 4 1 1 1
Description Loads a word from memory to a general-purpose register (dst). Table 3-11 describes the
addressing generator options. The memory address is formed from a base address
register (baseR) and an optional offset that is either a register (offsetR) or a 5-bit
unsigned constant (ucst5). If an offset is not given, the assembler assigns an offset of
zero.
offsetR and baseR must be in the same register file and on the same side as the .D unit
used. The y bit in the opcode determines the .D unit and register file used: y = 0 selects
the .D1 unit and baseR and offsetR from the A register file, and y = 1 selects the .D2 unit
and baseR and offsetR from the B register file.
offsetR/ucst5 is scaled by a left-shift of 2 bits. After scaling, offsetR/ucst5 is added to or
subtracted from baseR. For the preincrement, predecrement, positive offset, and
negative offset address generator options, the result of the calculation is the address to
be accessed in memory. For postincrement or postdecrement addressing, the value of
baseR before the addition or subtraction is the address to be accessed in memory.
The addressing arithmetic that performs the additions and subtractions defaults to linear
mode. However, for A4-A7 and for B4-B7, the mode can be changed to circular mode by
writing the appropriate value to the AMR (see Section 2.8.3).
For LDW, the entire 32 bits fills dst. dst can be in either register file, regardless of the .D
unit or baseR or offsetR used. The s bit determines which file dst will be loaded into:
s = 0 indicates dst will be loaded in the A register file and s = 1 indicates dst will be
loaded in the B register file.
Increments and decrements default to 1 and offsets default to 0 when no bracketed
register or constant is specified. Loads that do no modification to the baseR can use the
syntax *R. Square brackets, [ ], indicate that the ucst5 offset is left-shifted by 2.
Parentheses, ( ), can be used to set a nonscaled, constant offset. For example,
LDW (.unit) *+baseR (12), dst represents an offset of 12 bytes; whereas, LDW (.unit)
*+baseR [12], dst represents an offset of 12 words, or 48 bytes. You must type either
brackets or parentheses around the specified offset, if you use the optional offset
parameter.
Word addresses must be aligned on word (two LSBs are 0) boundaries.
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4 E5
Read baseR, offsetR
Written baseR dst
Unit in use .D
Examples Example 1
LDW .D1 *A10,B1
mem 100h 21F3 1996h mem 100h 21F3 1996h mem 100h 21F3 1996h
Example 2
LDW .D1 *A4++[1],A6
mem 100h 0798 F25Ah mem 100h 0798 F25Ah mem 100h 0798 F25Ah
mem 104h 1970 19F3h mem 104h 1970 19F3h mem 104h 1970 19F3h
Example 3
LDW .D1 *++A4[1],A6
mem 104h 0217 6991h mem 104h 0217 6991h mem 104h 0217 6991h
Example 4
LDW .D1 *++A4[A12],A8
mem 40C8h DCCB BAA8h mem 40C8h DCCB BAA8h mem 40C8h DCCB BAA8h
Example 5
LDW .D1 *++A4(8),A8
mem 40B8h 9AAB BCCDh mem 40B8h 9AAB BCCDh mem 40B8h 9AAB BCCDh
LDW Load Word From Memory With a 15-Bit Unsigned Constant Offset
Opcode
31 29 28 27 23 22 8 7 6 5 4 3 2 1 0
creg z dst ucst15 y 1 1 0 1 1 s p
3 1 5 15 1 1 1
Description Load a word from memory to a general-purpose register (dst). The memory address is
formed from a base address register B14 (y = 0) or B15 (y = 1) and an offset, which is a
15-bit unsigned constant (ucst15). The assembler selects this format only when the
constant is larger than five bits in magnitude. This instruction operates only on the .D2
unit.
The offset, ucst15, is scaled by a left shift of 2 bits. After scaling, ucst15 is added to
baseR. Subtraction is not supported. The result of the calculation is the address sent to
memory. The addressing arithmetic is always performed in linear mode.
For LDW, the entire 32 bits fills dst. dst can be in either register file. The s bit determines
which file dst will be loaded into: s = 0 indicates dst will be loaded in the A register file
and s = 1 indicates dst will be loaded in the B register file.
Square brackets, [ ], indicate that the ucst15offset is left-shifted by 2. Parentheses, ( ),
can be used to set a nonscaled, constant offset. For example,
LDW (.unit) *+B14/B15(60), dst represents an offset of 60 bytes; whereas,
LDW (.unit) *+B14/B15[60], dst represents an offset of 60 words, or 240 bytes. You must
type either brackets or parentheses around the specified offset, if you use the optional
offset parameter.
Word addresses must be aligned on word (two LSBs are 0) boundaries.
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4 E5
Read B14/B15
Written dst
Unit in use .D2
Delay Slots 4
Opcode
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
creg z dst src2 src1/cst5 x op 1 1 0 s p
3 1 5 5 5 1 7 1 1
Description The LSB of the src1 operand determines whether to search for a leftmost 1 or 0 in src2.
The number of bits to the left of the first 1 or 0 when searching for a 1 or 0, respectively,
is placed in dst.
The following diagram illustrates the operation of LMBD for several cases.
When searching for 0 in src2, LMBD returns 0:
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 1 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 0 1 x x x x x x x x x x x x x x x x x x x x x x x x x x x
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Execution
if (cond) {
if (src1 0 == 0), lmb0(src2) → dst
if (src1 0 == 1), lmb1(src2) → dst
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Example LMBD .L1 A1,A2,A3
Opcode .L unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 0 0 0 1 0 1 1 0 s p
3 1 5 5 5 1 1 1
Opcode .S unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 1 1 0 1 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Performs a maximum operation on signed, packed 16-bit values. For each pair of signed
16-bit values in src1 and src2, MAX2 places the larger value in the corresponding
position in dst.
31 16 15 0
a_hi a_lo ← src1
MAX2
↓ ↓
31 16 15 0
(a_hi > b_hi) ? a_hi:b_hi (a_lo > b_lo) ? a_lo:b_lo ← dst
Execution
if (cond) {
if (lsb16(src1) >= lsb16(src2)), lsb16(src1) → lsb16(dst)
else lsb16(src2) → lsb16(dst);
if (msb16(src1) >= msb16(src2)), msb16(src1) → msb16(dst)
else msb16(src2) → msb16(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
MAX2 .L1 A2, A8, A9
Example 2
MAX2 .L2X A2, B8, B12
Example 3
MAX2 .S1 A2, A8, A9
Example 4
MAX2 .S2X A2, B8, B12
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 0 0 0 1 1 1 1 0 s p
3 1 5 5 5 1 1 1
Description Performs a maximum operation on unsigned, packed 8-bit values. For each pair of
unsigned 8-bit values in src1 and src2, MAXU4 places the larger value in the
corresponding position in dst.
31 24 23 16 15 8 7 0
ua_3 ua_2 ua_1 ua_0 ← src1
MAXU4
↓ ↓ ↓ ↓
31 24 23 16 15 8 7 0
ua_3 > ub_3 ? ua_3:ub_3 ua_2 > ub_2 ? ua_2:ub_2 ua_1 > ub_1 ? ua_1:ub_1 ua_0 > ub_0 ? ua_0:ub_0 ← dst
Execution
if (cond) {
if (ubyte0(src1) >= ubyte0(src2)), ubyte0(src1) → ubyte0(dst)
else ubyte0(src2) → ubyte0(dst);
if (ubyte1(src1) >= ubyte1(src2)), ubyte1(src1) → ubyte1(dst)
else ubyte1(src2) → ubyte1(dst);
if (ubyte2(src1) >= ubyte2(src2)), ubyte2(src1) → ubyte2(dst)
else ubyte2(src2) → ubyte2(dst);
if (ubyte3(src1) >= ubyte3(src2)), ubyte3(src1) → ubyte3(dst)
else ubyte3(src2) → ubyte3(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
MAXU4 .L1 A2, A8, A9
Example 2
MAXU4 .L2X A2, B8, B12
Opcode .L unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 0 0 0 0 1 1 1 0 s p
3 1 5 5 5 1 1 1
Opcode .S unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 1 1 0 0 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Performs a minimum operation on signed, packed 16-bit values. For each pair of signed
16-bit values in src1 and src2, MIN2 instruction places the smaller value in the
corresponding position in dst.
31 16 15 0
a_hi a_lo ← src1
MIN2
↓ ↓
31 16 15 0
(a_hi < b_hi) ? a_hi:b_hi (a_lo < b_lo) ? a_lo:b_lo ← dst
Execution
if (cond) {
if (lsb16(src1) <= lsb16(src2)), lsb16(src1) → lsb16(dst)
else lsb16(src2) → lsb16(dst);
if (msb16(src1) <= msb16(src2)), msb16(src1) → msb16(dst)
else msb16(src2)→ msb16(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
MIN2 .L1 A2, A8, A9
Example 2
MIN2 .L2X A2, B8, B12
Example 3
MIN2 .S1 A2, A8, A9
Example 4
MIN2 .S2X A2, B8, B12
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 0 1 0 0 0 1 1 0 s p
3 1 5 5 5 1 1 1
Description Performs a minimum operation on unsigned, packed 8-bit values. For each pair of
unsigned 8-bit values in src1 and src2, MINU4 places the smaller value in the
corresponding position in dst.
31 24 23 16 15 8 7 0
ua_3 ua_2 ua_1 ua_0 ← src1
MINU4
↓ ↓ ↓ ↓
31 24 23 16 15 8 7 0
ua_3 < ub_3 ? ua_3:ub_3 ua_2 < ub_2 ? ua_2:ub_2 ua_1 < ub_1 ? ua_1:ub_1 ua_0 < ub_0 ? ua_0:ub_0 ← dst
Execution
if (cond) {
if (ubyte0(src1) <= ubyte0(src2)), ubyte0(src1) → ubyte0(dst)
else ubyte0(src2) → ubyte0(dst);
if (ubyte1(src1) <= ubyte1(src2)), ubyte1(src1) → ubyte1(dst)
else ubyte1(src2) → ubyte1(dst);
if (ubyte2(src1) <= ubyte2(src2)), ubyte2(src1) → ubyte2(dst)
else ubyte2(src2) → ubyte2(dst);
if (ubyte3(src1) <= ubyte3(src2)), ubyte3(src1) → ubyte3(dst)
else ubyte3(src2) → ubyte3(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
MINU4 .L1 A2, A8, A9
Example 2
MINU4 .L2 B2, B8, B12
Opcode
31 29 28 27 23 22 18 17 13 12 11 7 6 5 4 3 2 1 0
creg z dst src2 src1 x op 0 0 0 0 0 s p
3 1 5 5 5 1 5 1 1
Description The src1 operand is multiplied by the src2 operand. The result is placed in dst. The
source operands are signed by default.
Execution
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
Examples Example 1
MPY .M1 A1,A2,A3
Example 2
MPY .M1 13,A1,A2
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 1 1 0 0 0 0 0 0 s p
3 1 5 5 5 1 1 1
Description The src1 operand is multiplied by the src2 operand. The result is placed in dst.
NOTE:
1. If one source is SNaN or QNaN, the result is a signed NaN_out. If
either source is SNaN, the INVAL bit is set also. The sign of
NaN_out is the exclusive-OR of the input signs.
2. Signed infinity multiplied by signed infinity or a normalized number
(other than signed 0) returns signed infinity. Signed infinity multiplied
by signed 0 returns a signed NaN_out and sets the INVAL bit.
3. If one or both sources are signed 0, the result is signed 0 unless the
other source is NaN or signed infinity, in which case the result is
signed NaN_out.
4. A denormalized source is treated as signed 0 and the DENn bit is
set. The INEX bit is set except when the other source is signed
infinity, signed NaN, or signed 0. Therefore, a signed infinity
multiplied by a denormalized number gives a signed NaN_out and
sets the INVAL bit.
5. If rounding is performed, the INEX bit is set.
Execution
Pipeline
If dst is used as the source for the ADDDP, CMPEQDP, CMPLTDP, CMPGTDP,
MPYSP, or SUBDP instruction, the number of delay slots can be reduced by one,
because these instructions read the lower word of the DP source one cycle before the
upper word of the DP source.
Delay Slots 9
A1:A0 4021 3333h 3333 3333h 8.6 A1:A0 4021 3333h 4021 3333h
A3:A2 C004 0000h 0000 0000h -2.5 A3:A2 C004 0000h 0000 0000h
A5:A4 xxxx xxxxh xxxx xxxxh A5:A4 C035 8000h 0000 0000h -21.5
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 0 0 1 0 0 0 0 0 s p
3 1 5 5 5 1 1 1
Description The src1 operand is multiplied by the src2 operand. The result is placed in dst. The
source operands are signed by default.
Execution
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 0 1 0 0 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Performs a 16-bit by 32-bit multiply. The upper half of src1 is used as a signed 16-bit
input. The value in src2 is treated as a signed 32-bit value. The result is written into the
lower 48 bits of a 64-bit register pair, dst_o:dst_e, and sign extended to 64 bits.
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Examples Example 1
MPYHI .M1 A5,A6,A9:A8
A9:A8 xxxx xxxxh xxxx xxxxh A9:A8 FFFF DF6Ah DDB9 2008h
-35,824,897,286,136
Example 2
MPYHI .M2 B2,B5,B9:B8
B9:B8 xxxx xxxxh xxxx xxxxh B9:B8 0000 026Ah DB88 1FECh
2,657,972,920,300
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 0 0 0 0 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Performs a 16-bit by 32-bit multiply. The upper half of src1 is treated as a signed 16-bit
input. The value in src2 is treated as a signed 32-bit value. The product is then rounded
to a 32-bit result by adding the value 214 and then this sum is right shifted by 15. The
lower 32 bits of the result are written into dst.
31 16 15 0
a_hi a_lo ← src1
×
MPYHIR
31 0
((a_hi × b_hi:b_lo) + 4000h) >> 15 ← dst
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 0 0 1 0 0 0 0 0 s p
3 1 5 5 5 1 1 1
Description The src1 operand is multiplied by the src2 operand. The result is placed in dst. The
source operands are signed by default.
Execution
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 1 1 1 0 0 0 0 0 s p
3 1 5 5 5 1 1 1
Description The src1 operand is multiplied by the src2 operand. The result is placed in dst. The
source operands are unsigned by default.
Execution
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 0 1 1 0 0 0 0 0 s p
3 1 5 5 5 1 1 1
Description The signed operand src1 is multiplied by the unsigned operand src2. The result is placed
in dst. The S is needed in the mnemonic to specify a signed operand when both signed
and unsigned operands are used.
Execution
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 0 1 1 0 0 0 0 0 s p
3 1 5 5 5 1 1 1
Description The signed operand src1 is multiplied by the unsigned operand src2. The result is placed
in dst. The S is needed in the mnemonic to specify a signed operand when both signed
and unsigned operands are used.
Execution
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 1 1 1 0 0 0 0 0 s p
3 1 5 5 5 1 1 1
Description The src1 operand is multiplied by the src2 operand. The result is placed in dst. The
source operands are unsigned by default.
Execution
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 1 0 1 0 0 0 0 0 s p
3 1 5 5 5 1 1 1
Description The unsigned operand src1 is multiplied by the signed operand src2. The result is placed
in dst. The S is needed in the mnemonic to specify a signed operand when both signed
and unsigned operands are used.
Execution
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 1 0 1 0 0 0 0 0 s p
3 1 5 5 5 1 1 1
Description The unsigned operand src1 is multiplied by the signed operand src2. The result is placed
in dst. The S is needed in the mnemonic to specify a signed operand when both signed
and unsigned operands are used.
Execution
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
Opcode
31 29 28 27 23 22 18 17 13 12 11 7 6 5 4 3 2 1 0
creg z dst src2 src1 x op 0 0 0 0 0 s p
3 1 5 5 5 1 5 1 1
Description The src1 operand is multiplied by the src2 operand. The lower 32 bits of the result are
placed in dst.
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4 E5 E6 E7 E8 E9
Read src1, src1, src1, src1,
src2 src2 src2 src2
Written dst
Unit in use .M .M .M .M
Delay Slots 8
Opcode
31 29 28 27 23 22 18 17 13 12 11 7 6 5 4 3 2 1 0
creg z dst src2 src1 x op 0 0 0 0 0 s p
3 1 5 5 5 1 5 1 1
Description The src1 operand is multiplied by the src2 operand. The 64-bit result is placed in the dst
register pair.
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
Read src1, src1, src1, src1,
src2 src2 src2 src2
Written dst_l dst_h
Unit in use .M .M .M .M
A5:A4 xxxx xxxxh xxxx xxxxh A5:A4 0000 0381h CBCA 6558h
3,856,004,703,576
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 0 1 0 0 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description The MPYIH pseudo-operation performs a 16-bit by 32-bit multiply. The upper half of src1
is used as a signed 16-bit input. The value in src2 is treated as a signed 32-bit value.
The result is written into the lower 48 bits of a 64-bit register pair, dst_o:dst_e, and sign
extended to 64 bits. The assembler uses the MPYHI (.unit) src1, src2, dst instruction to
perform this operation (see MPYHI).
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 0 0 0 0 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description The MPYIHR pseudo-operation performs a 16-bit by 32-bit multiply. The upper half of
src1 is treated as a signed 16-bit input. The value in src2 is treated as a signed 32-bit
value. The product is then rounded to a 32-bit result by adding the value 214 and then
this sum is right shifted by 15. The lower 32 bits of the result are written into dst. The
assembler uses the MPYHIR (.unit) src1, src2, dst instruction to perform this operation
(see MPYHIR).
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 0 1 0 1 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description The MPYIL pseudo-operation performs a 16-bit by 32-bit multiply. The lower half of src1
is used as a signed 16-bit input. The value in src2 is treated as a signed 32-bit value.
The result is written into the lower 48 bits of a 64-bit register pair, dst_o:dst_e, and sign
extended to 64 bits. The assembler uses the MPYLI (.unit) src1, src2, dst instruction to
perform this operation (see MPYLI).
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 1 1 1 0 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description The MPYILR pseudo-operation performs a 16-bit by 32-bit multiply. The lower half of
src1 is used as a signed 16-bit input. The value in src2 is treated as a signed 32-bit
value. The product is then rounded to a 32-bit result by adding the value 214 and then
this sum is right shifted by 15. The lower 32 bits of the result are written into dst. The
assembler uses the MPYLIR (.unit) src1, src2, dst instruction to perform this operation
(see MPYLIR).
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 0 0 1 0 0 0 0 0 s p
3 1 5 5 5 1 1 1
Description The src1 operand is multiplied by the src2 operand. The result is placed in dst. The
source operands are signed by default.
Execution
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 1 1 1 0 0 0 0 0 s p
3 1 5 5 5 1 1 1
Description The src1 operand is multiplied by the src2 operand. The result is placed in dst. The
source operands are unsigned by default.
Execution
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 0 1 0 1 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Performs a 16-bit by 32-bit multiply. The lower half of src1 is used as a signed 16-bit
input. The value in src2 is treated as a signed 32-bit value. The result is written into the
lower 48 bits of a 64-bit register pair, dst_o:dst_e, and sign extended to 64 bits.
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Examples Example 1
MPYLI .M1 A5,A6,A9:A8
A9:A8 xxxx xxxxh xxxx xxxxh A9:A8 FFFF FA9Bh A111 462Ch
-5,928,647,571,924
Example 2
MPYLI .M2 B2,B5,B9:B8
B9:B8 xxxx xxxxh xxxx xxxxh B9:B8 0000 06FBh E9FA 7E81h
7,679,032,065,665
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 1 1 1 0 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Performs a 16-bit by 32-bit multiply. The lower half of src1 is treated as a signed 16-bit
input. The value in src2 is treated as a signed 32-bit value. The product is then rounded
into a 32-bit result by adding the value 214 and then this sum is right shifted by 15. The
lower 32 bits of the result are written into dst.
31 16 15 0
a_hi a_lo ← src1
×
MPYLIR
31 0
((a_lo × b_hi:b_lo) + 4000h) >> 15 ← dst
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 0 1 1 0 0 0 0 0 s p
3 1 5 5 5 1 1 1
Description The signed operand src1 is multiplied by the unsigned operand src2. The result is placed
in dst. The S is needed in the mnemonic to specify a signed operand when both signed
and unsigned operands are used.
Execution
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 1 0 1 0 0 0 0 0 s p
3 1 5 5 5 1 1 1
Description The unsigned operand src1 is multiplied by the signed operand src2. The result is placed
in dst. The S is needed in the mnemonic to specify a signed operand when both signed
and unsigned operands are used.
Execution
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 1 0 0 0 0 0 0 0 s p
3 1 5 5 5 1 1 1
Description The src1 operand is multiplied by the src2 operand. The result is placed in dst.
NOTE:
1. If one source is SNaN or QNaN, the result is a signed NaN_out. If
either source is SNaN, the INVAL bit is set also. The sign of
NaN_out is the exclusive-OR of the input signs.
2. Signed infinity multiplied by signed infinity or a normalized number
(other than signed 0) returns signed infinity. Signed infinity multiplied
by signed 0 returns a signed NaN_out and sets the INVAL bit.
3. If one or both sources are signed 0, the result is signed 0 unless the
other source is NaN or signed infinity, in which case the result is
signed NaN_out.
4. A denormalized source is treated as signed 0 and the DENn bit is
set. The INEX bit is set except when the other source is signed
infinity, signed NaN, or signed 0. Therefore, a signed infinity
multiplied by a denormalized number gives a signed NaN_out and
sets the INVAL bit.
5. If rounding is performed, the INEX bit is set.
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
If dst is used as the source for the ADDDP, CMPEQDP, CMPLTDP, CMPGTDP,
MPYDP, or SUBDP instruction, the number of delay slots can be reduced by one,
because these instructions read the lower word of the DP source one cycle before the
upper word of the DP source.
Delay Slots 3
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 0 1 1 0 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description The single-precision src1 operand is multiplied by the double-precision src2 operand to
produce a double-precision result. The result is placed in dst.
NOTE:
1. If one source is SNaN or QNaN, the result is a signed NaN_out. If
either source is SNaN, the INVAL bit is set also. The sign of
NaN_out is the exclusive-OR of the input signs.
2. Signed infinity multiplied by signed infinity or a normalized number
(other than signed 0) returns signed infinity. Signed infinity multiplied
by signed 0 returns a signed NaN_out and sets the INVAL bit.
3. If one or both sources are signed 0, the result is signed 0 unless the
other source is NaN or signed infinity, in which case the result is
signed NaN_out.
4. A denormalized source is treated as signed 0 and the DENn bit is
set. The INEX bit is set except when the other source is signed
infinity, signed NaN, or signed 0. Therefore, a signed infinity
multiplied by a denormalized number gives a signed NaN_out and
sets the INVAL bit.
5. If rounding is performed, the INEX bit is set.
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4 E5 E6 E7
Read src1, src1,
src2_l src2_h
Written dst_l dst_h
Unit in use .M .M
The low half of the result is written out one cycle earlier than the high half. If dst is used
as the source for the ADDDP, CMPEQDP, CMPLTDP, CMPGTDP, MPYDP, MPYSPDP,
MPYSP2DP, or SUBDP instruction, the number of delay slots can be reduced by one,
because these instructions read the lower word of the DP source one cycle before the
upper word of the DP source.
Delay Slots 6
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 0 1 1 1 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description The src1 operand is multiplied by the src2 operand to produce a double-precision result.
The result is placed in dst.
NOTE:
1. If one source is SNaN or QNaN, the result is a signed NaN_out. If
either source is SNaN, the INVAL bit is set also. The sign of
NaN_out is the exclusive-OR of the input signs.
2. Signed infinity multiplied by signed infinity or a normalized number
(other than signed 0) returns signed infinity. Signed infinity multiplied
by signed 0 returns a signed NaN_out and sets the INVAL bit.
3. If one or both sources are signed 0, the result is signed 0 unless the
other source is NaN or signed infinity, in which case the result is
signed NaN_out.
4. A denormalized source is treated as signed 0 and the DENn bit is
set. The INEX bit is set except when the other source is signed
infinity, signed NaN, or signed 0. Therefore, a signed infinity
multiplied by a denormalized number gives a signed NaN_out and
sets the INVAL bit.
5. If rounding is performed, the INEX bit is set.
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4 E5
Read src1, src2
Written dst_l dst_h
Unit in use .M
The low half of the result is written out one cycle earlier than the high half. If dst is used
as the source for the ADDDP, CMPEQDP, CMPLTDP, CMPGTDP, MPYDP, MPYSPDP,
MPYSP2DP, or SUBDP instruction, the number of delay slots can be reduced by one,
because these instructions read the lower word of the DP source one cycle before the
upper word of the DP source.
Delay Slots 4
Opcode
31 29 28 27 23 22 18 17 13 12 11 7 6 5 4 3 2 1 0
creg z dst src2 src1 x op 0 0 0 0 0 s p
3 1 5 5 5 1 5 1 1
Description The signed operand src1 is multiplied by the unsigned operand src2. The result is placed
in dst. The S is needed in the mnemonic to specify a signed operand when both signed
and unsigned operands are used.
Execution
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
MPYSU4 Multiply Signed × Unsigned, Four 8-Bit Pairs for Four 8-Bit Results
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 0 1 0 1 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Returns the product between four sets of packed 8-bit values producing four signed
16-bit results. The four signed 16-bit results are packed into a 64-bit register pair,
dst_o:dst_e. The values in src1 are treated as signed 8-bit packed quantities; whereas,
the values in src2 are treated as unsigned 8-bit packed data.
For each pair of 8-bit quantities in src1 and src2, the signed 8-bit value from src1 is
multiplied with the unsigned 8-bit value from src2:
• The product of src1 byte 0 and src2 byte 0 is written to the lower half of dst_e.
• The product of src1 byte 1 and src2 byte 1 is written to the upper half of dst_e.
• The product of src1 byte 2 and src2 byte 2 is written to the lower half of dst_o.
• The product of src1 byte 3 and src2 byte 3 is written to the upper half of dst_o.
31 24 23 16 15 8 7 0
sa_3 sa_2 sa_1 sa_0 ← src1
× × × ×
MPYSU4
63 48 47 32 31 16 15 0
sa_3 × ub_3 sa_2 × ub_2 sa_1 × ub_1 sa_0 × ub_0 ← dst_o:dst_e
Execution
if (cond) {
(sbyte0(src1) × ubyte0(src2)) → lsb16(dst_e);
(sbyte1(src1) × ubyte1(src2)) → msb16(dst_e);
(sbyte2(src1) × ubyte2(src2)) → lsb16(dst_o);
(sbyte3(src1) × ubyte3(src2)) → msb16(dst_o)
}
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Examples Example 1
MPYSU4 .M1 A5,A6,A9:A8
A9:A8 xxxx xxxxh xxxx xxxxh A9:A8 494A 16A8h 072C BA2Ch
18762 5800 1386 -17876
signed
Example 2
MPYSU4 .M2 B5,B6,B9:B8
B9:B8 xxxx xxxxh xxxx xxxxh B9:B8 2FFD FCA4h 00A0 0440h
12285 -680 160 1088
signed
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 1 1 1 0 0 0 0 0 s p
3 1 5 5 5 1 1 1
Description The src1 operand is multiplied by the src2 operand. The result is placed in dst. The
source operands are unsigned by default.
Execution
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
MPYU4 Multiply Unsigned × Unsigned, Four 8-Bit Pairs for Four 8-Bit Results
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 0 1 0 0 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Returns the product between four sets of packed 8-bit values producing four unsigned
16-bit results that are packed into a 64-bit register pair, dst_o:dst_e. The values in both
src1 and src2 are treated as unsigned 8-bit packed data.
For each pair of 8-bit quantities in src1 and src2, the unsigned 8-bit value from src1 is
multiplied with the unsigned 8-bit value from src2:
• The product of src1 byte 0 and src2 byte 0 is written to the lower half of dst_e.
• The product of src1 byte 1 and src2 byte 1 is written to the upper half of dst_e.
• The product of src1 byte 2 and src2 byte 2 is written to the lower half of dst_o.
• The product of src1 byte 3 and src2 byte 3 is written to the upper half of dst_o.
31 24 23 16 15 8 7 0
ua_3 ua_2 ua_1 ua_0 ← src1
× × × ×
MPYU4
63 48 47 32 31 16 15 0
ua_3 × ub_3 ua_2 × ub_2 ua_1 × ub_1 ua_0 × ub_0 ← dst_o:dst_e
Execution
if (cond) {
(ubyte0(src1) × ubyte0(src2)) → lsb16(dst_e);
(ubyte1(src1) × ubyte1(src2)) → msb16(dst_e);
(ubyte2(src1) × ubyte2(src2)) → lsb16(dst_o);
(ubyte3(src1) × ubyte3(src2)) → msb16(dst_o)
}
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Examples Example 1
MPYU4 .M1 A5,A6,A9:A8
A9:A8 xxxx xxxxh xxxx xxxxh A9:A8 47E8 16A8h 212C 6231h
18408 5800 8492 25137
unsigned
Example 2
MPYU4 .M2 B2,B5,B9:B8
B9:B8 xxxx xxxxh xxxx xxxxh B9:B8 2E77 4D44h 00A0 21BCh
11895 19780 160 8636
unsigned
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 1 0 1 0 0 0 0 0 s p
3 1 5 5 5 1 1 1
Description The unsigned operand src1 is multiplied by the signed operand src2. The result is placed
in dst. The S is needed in the mnemonic to specify a signed operand when both signed
and unsigned operands are used.
Execution
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
MPYUS4 Multiply Unsigned × Signed, Four 8-Bit Pairs for Four 8-Bit Results
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 0 1 0 1 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description The MPYUS4 pseudo-operation returns the product between four sets of packed 8-bit
values, producing four signed 16-bit results. The four signed 16-bit results are packed
into a 64-bit register pair, dst_o:dst_e. The values in src1 are treated as signed 8-bit
packed quantities; whereas, the values in src2 are treated as unsigned 8-bit packed
data. The assembler uses the MPYSU4 (.unit)src1, src2, dst instruction to perform this
operation (see MPYSU4).
For each pair of 8-bit quantities in src1 and src2, the signed 8-bit value from src1 is
multiplied with the unsigned 8-bit value from src2:
• The product of src1 byte 0 and src2 byte 0 is written to the lower half of dst_e.
• The product of src1 byte 1 and src2 byte 1 is written to the upper half of dst_e.
• The product of src1 byte 2 and src2 byte 2 is written to the lower half of dst_o.
• The product of src1 byte 3 and src2 byte 3 is written to the upper half of dst_o.
Execution
if (cond) {
(ubyte0(src2) × sbyte0(src1)) → lsb16(dst_e);
(ubyte1(src2) × sbyte1(src1)) → msb16(dst_e);
(ubyte2(src2) × sbyte2(src1)) → lsb16(dst_o);
(ubyte3(src2) × sbyte3(src1)) → msb16(dst_o)
}
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 0 0 0 0 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Performs two 16-bit by 16-bit multiplications between two pairs of signed, packed 16-bit
values. The values in src1 and src2 are treated as signed, packed 16-bit quantities. The
two 32-bit results are written into a 64-bit register pair.
The product of the lower halfwords of src1 and src2 is written to the even destination
register, dst_e. The product of the upper halfwords of src1 and src2 is written to the odd
destination register, dst_o.
This instruction helps reduce the number of instructions required to perform two 16-bit by
16-bit multiplies on both the lower and upper halves of two registers.
31 16 15 0
a_hi a_lo ← src1
× ×
MPY2
63 32 31 0
a_hi × b_hi a_lo × b_lo ← dst_o:dst_e
Execution
if (cond) {
lsb16(src1) × lsb16(src2) → dst_e;
msb16(src1) × msb16(src2) → dst_o
}
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Examples Example 1
MPY2 .M1 A5,A6, A9:A8
A9:A8 xxxx xxxxh xxxx xxxxh A9:A8 DF6A B0A8h 0775 462Ch
-546,656,088 125,126,188
Example 2
MPY2 .M2 B2, B5, B9:B8
B9:B8 xxxx xxxxh xxxx xxxxh B9:B8 026A D5CCh 1091 7E81h
40,555,980 277,970,561
MPY2IR Multiply Two 16-Bit × 32-Bit, Shifted by 15 to Produce a Rounded 32-Bit Result
Opcode
31 30 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 dst src2 src1 x 0 0 1 1 1 1 1 1 0 0 s p
5 5 5 1 1 1
Description Performs two 16-bit by 32-bit multiplies. The upper and lower halves of src1 are treated
as 16-bit signed inputs. The value in src2 is treated as a 32-bit signed value. The
products are then rounded to a 32-bit result by adding the value 214 and then these sums
are right shifted by 15. The lower 32 bits of the two results are written into dst_o:dst_e.
If either result saturates, the M1 or M2 bit in SSR and the SAT bit in CSR are written one
cycle after the results are written to dst_o:dst_e.
This instruction executes unconditionally and cannot be predicated.
NOTE: In the overflow case, where the 16-bit input to the MPYIR operation is
8000h and the 32-bit input is 8000 0000h, the saturation value
7FFF FFFFh is written into the corresponding 32-bit dst register.
Execution
Delay Slots 3
Examples Example 1
MPY2IR .M2 B2,B5,B9:B8
Example 2
MPY2IR .M1X A2,B5,A9:A8
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 0 0 0 0 0 0 0 0 s p
3 1 5 5 5 1 1 1
Description Performs a 32-bit by 32-bit multiply. src1 and src2 are signed 32-bit values. Only the
lower 32 bits of the 64-bit result are written to dst.
Execution
Delay Slots 3
MPY32 Multiply Signed 32-Bit × Signed 32-Bit Into Signed 64-Bit Result
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 1 0 0 0 0 0 0 0 s p
3 1 5 5 5 1 1 1
Description Performs a 32-bit by 32-bit multiply. src1 and src2 are signed 32-bit values. The signed
64-bit result is written to the register pair specified by dst.
Execution
Delay Slots 3
MPY32SU Multiply Signed 32-Bit × Unsigned 32-Bit Into Signed 64-Bit Result
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 1 1 0 0 0 0 0 0 s p
3 1 5 5 5 1 1 1
Description Performs a 32-bit by 32-bit multiply. src1 is a signed 32-bit value and src2 is an unsigned
32-bit value. The signed 64-bit result is written to the register pair specified by dst.
Execution
Delay Slots 3
MPY32U Multiply Unsigned 32-Bit × Unsigned 32-Bit Into Unsigned 64-Bit Result
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 1 0 0 0 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Performs a 32-bit by 32-bit multiply. src1 and src2 are unsigned 32-bit values. The
unsigned 64-bit result is written to the register pair specified by dst.
Execution
Delay Slots 3
MPY32US Multiply Unsigned 32-Bit × Signed 32-Bit Into Signed 64-Bit Result
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 1 0 0 1 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Performs a 32-bit by 32-bit multiply. src1 is an unsigned 32-bit value and src2 is a signed
32-bit value. The signed 64-bit result is written to the register pair specified by dst.
Execution
Delay Slots 3
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 s p
3 1 5 5 1 1
31 29 28 27 23 22 18 17 16 15 14 13 12 11 5 4 3 2 1 0
creg z dst src2 0 0 0 0 0 x op 1 1 0 s p
3 1 5 5 1 7 1 1
Opcode .S unit
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 0 0 0 0 0 x 0 0 0 1 1 0 1 0 0 0 s p
3 1 5 5 1 1 1
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 s p
3 1 5 5 1 1
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 0 0 0 0 0 x 1 0 0 0 1 1 1 1 0 0 s p
3 1 5 5 1 1 1
Description The MV pseudo-operation moves a value from one register to another. The assembler
will either use the ADD (.unit) 0, src2, dst instruction (see ADD) or the OR (.unit) 0, src2,
dst instruction (see OR) to perform this operation.
Execution
Delay Slots 0
Opcode
Operands when moving from the control file to the register file:
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst crlo crhi x 0 0 1 1 1 1 1 0 0 0 1 p
3 1 5 5 5 1 1
Description The contents of the control file specified by the crhi and crlo fields is moved to the
register file specified by the dst field. Valid assembler values for crlo and crhi are shown
in Table 3-27.
Operands when moving from the register file to the control file:
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z crlo src2 crhi x 0 0 1 1 1 0 1 0 0 0 1 p
3 1 5 5 5 1 1
Description The contents of the register file specified by the src2 field is moved to the control file
specified by the crhi and crlo fields. Valid assembler values for crlo and crhi are shown in
Table 3-27.
Execution
Pipeline
Pipeline Stage E1
Read src2
Written dst
Unit in use .S2
Delay Slots 0
Example MVC .S2 B1,AMR
NOTE: The six MSBs of the AMR are reserved and therefore are not written to.
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 1 1 0 1 0 x 0 0 0 0 1 1 1 1 0 0 s p
3 1 5 5 1 1 1
Description Moves data from the src2 register to the dst register over 4 cycles. This is done using
the multiplier path.
MVD .M2x A0, B0 ;
NOP ;
NOP ;
NOP ; B0 = A0
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src2
Written dst
Unit in use .M
Delay Slots 3
Example MVD .M2X A5,B8
Opcode .S unit
31 29 28 27 23 22 7 6 5 4 3 2 1 0
creg z dst cst16 0 1 0 1 0 s p
3 1 5 16 1 1
Opcode .L unit
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst cst5 0 0 1 0 1 x 0 0 1 1 0 1 0 1 1 0 s p
3 1 5 5 1 1 1
Opcode .D unit
31 29 28 27 23 22 21 20 19 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst 0 0 0 0 0 cst5 0 0 0 0 0 0 1 0 0 0 0 s p
3 1 5 5 1 1
Description The constant cst is sign extended and placed in dst. The .S unit form allows for a 16-bit
signed constant.
Since many nonaddress constants fall into a 5-bit sign constant range, this allows the
flexibility to schedule the MVK instruction on the .L or .D units. In the .D unit form, the
constant is in the position normally used by src1, as for address math.
In most cases, the C6000 assembler and linker issue a warning or an error when a
constant is outside the range supported by the instruction. In the case of MVK .S, a
warning is issued whenever the constant is outside the signed 16-bit range, -32768 to
32767 (or FFFF 8000h to 0000 7FFFh).
For example:
MVK .S1 0x00008000X, A0
Execution
Pipeline
Pipeline Stage E1
Read
Written dst
Unit in use .L, .S, or .D
Delay Slots 0
Examples Example 1
MVK .L2 -5,B8
Example 2
MVK .D2 14,B8
Opcode
31 29 28 27 23 22 7 6 5 4 3 2 1 0
creg z dst cst16 h 1 0 1 0 s p
3 1 5 16 1 1 1
Description The 16-bit constant, cst16 , is loaded into the upper 16 bits of dst. The 16 LSBs of dst
are unchanged. For the MVKH instruction, the assembler encodes the 16 MSBs of a
32-bit constant into the cst16 field of the opcode. For the MVKLH instruction, the
assembler encodes the 16 LSBs of a constant into the cst16 field of the opcode.
NOTE: Use the MVK instruction (see MVK) to load 16-bit constants. The
assembler generates a warning for any constant over 16 bits. To load
32-bit constants, such as 1234 5678h, use the following pair of
instructions:
MVKL 0x12345678
MVKH 0x12345678
Pipeline
Pipeline Stage E1
Read
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
MVKH .S1 0A329123h,A1
Example 2
MVKLH .S1 7A8h,A1
Opcode
31 29 28 27 23 22 7 6 5 4 3 2 1 0
creg z dst cst16 0 1 0 1 0 s p
3 1 5 16 1 1
Description The 16-bit constant, cst16, is sign extended and placed in dst.
The MVKL instruction is equivalent to the MVK instruction (see MVK), except that the
MVKL instruction disables the constant range checking normally performed by the
assembler/linker. This allows the MVKL instruction to be paired with the MVKH
instruction (see MVKH) to generate 32-bit constants.
To load 32-bit constants, such as 1234 ABCDh, use the following pair of instructions:
MVKL .S1 0x0ABCD, A4
MVKLH .S1 0x1234, A4
Execution
Pipeline
Pipeline Stage E1
Read
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
MVKL .S1 5678h,A8
Example 2
MVKL .S1 0C678h,A8
NEG Negate
Opcode .S unit
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 0 0 0 0 0 x 0 1 0 1 1 0 1 0 0 0 s p
3 1 5 5 1 1 1
Opcode .L unit
31 29 28 27 23 22 18 17 16 15 14 13 12 11 5 4 3 2 1 0
creg z dst src2 0 0 0 0 0 x op 1 1 0 s p
3 1 5 5 1 7 1 1
Description The NEG pseudo-operation negates src2 and places the result in dst. The assembler
uses the SUB (.unit) 0, src2, dst instruction to perform this operation (see SUB).
Execution
Delay Slots 0
NOP No Operation
Opcode
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 src 0 0 0 0 0 0 0 0 0 0 0 0 p
4 1
Description src is encoded as count - 1. For src + 1 cycles, no operation is performed. The maximum
value for count is 9. NOP with no operand is treated like NOP 1 with src encoded as
0000.
A multicycle NOP will not finish if a branch is completed first. For example, if a branch is
initiated on cycle n and a NOP 5 instruction is initiated on cycle n + 3, the branch is
complete on cycle n + 6 and the NOP is executed only from cycle n + 3 to cycle n + 5. A
single-cycle NOP in parallel with other instructions does not affect operation.
A multicycle NOP instruction cannot be paired with any other multicycle NOP instruction
in the same execute packet. Instructions that generate a multicycle NOP are: ADDKPC,
BNOP, CALLP, and IDLE.
Delay Slots 0
Examples Example 1
NOP
MVK .S1 125h,A1
Example 2
MVK .S1 1,A1
MVKLH .S1 0,A1
NOP 5
ADD .L1 A1,A2,A1
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 5 4 3 2 1 0
creg z dst src2 0 0 0 0 0 x op 1 1 0 s p
3 1 5 5 1 7 1 1
Description The number of redundant sign bits of src2 is placed in dst. Several examples are shown
in the following diagram.
In this case, NORM returns 0:
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 1 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 0 1 x x x x x x x x x x x x x x x x x x x x x x x x x x x
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Execution
Pipeline
Pipeline Stage E1
Read src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
NORM .L1 A1,A2
Example 2
NORM .L1 A1,A2
Example 3
NORM .L1 A1:A0,A3
Opcode .L unit
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 1 1 1 1 1 x 1 1 0 1 1 1 0 1 1 0 s p
3 1 5 5 1 1 1
Opcode .S unit
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 1 1 1 1 1 x 0 0 1 0 1 0 1 0 0 0 s p
3 1 5 5 1 1 1
Opcode .D unit
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 1 1 1 1 1 x 1 0 1 1 1 1 1 1 0 0 s p
3 1 5 5 1 1 1
Description The NOT pseudo-operation performs a bitwise NOT on the src2 operand and places the
result in dst. The assembler uses the XOR (.unit) -1, src2, dst instruction to perform this
operation (see XOR).
Execution
Delay Slots 0
OR Bitwise OR
Opcode .D unit
31 29 28 27 23 22 18 17 13 12 11 10 9 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 op 1 1 0 0 s p
3 1 5 5 5 1 4 1 1
Opcode .L unit
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
creg z dst src2 src1 x op 1 1 0 s p
3 1 5 5 5 1 7 1 1
Opcode .S unit
31 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
creg z dst src2 src1 x op 1 0 0 0 s p
3 1 5 5 5 1 6 1 1
Description Performs a bitwise OR operation between src1 and src2. The result is placed in dst. The
scst5 operands are sign extended to 32 bits.
Execution
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L, .S, or .D
Delay Slots 0
Examples Example 1
OR .S1 A3,A4,A5
Example 2
OR .D2 -12,B2,B8
PACK2 Pack Two 16 LSBs Into Upper and Lower Register Halves
Opcode .L unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 0 0 0 0 0 1 1 0 s p
3 1 5 5 5 1 1 1
Opcode .S unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 1 1 1 1 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Moves the lower halfwords from src1 and src2 and packs them both into dst. The lower
halfword of src1 is placed in the upper halfword of dst. The lower halfword of src2 is
placed in the lower halfword of dst.
This instruction is useful for manipulating and preparing pairs of 16-bit values to be used
by the packed arithmetic operations, such as ADD2 ( see ADD2).
31 16 15 0
a_hi a_lo ← src1
PACK2
31 16 15 0
a_lo b_lo ← dst
Execution
if (cond) {
lsb16(src2) → lsb16(dst);
lsb16(src1) → msb16(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L, .S
Delay Slots 0
Examples Example 1
PACK2 .L1 A2,A8,A9
Example 2
PACK2 .S2 B2,B8,B12
PACKH2 Pack Two 16 MSBs Into Upper and Lower Register Halves
Opcode .L unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 1 1 1 1 0 1 1 0 s p
3 1 5 5 5 1 1 1
Opcode .S unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 1 0 0 1 1 0 0 0 s p
3 1 5 5 5 1 1 1
Description Moves the upper halfwords from src1 and src2 and packs them both into dst. The upper
halfword of src1 is placed in the upper half-word of dst. The upper halfword of src2 is
placed in the lower halfword of dst.
This instruction is useful for manipulating and preparing pairs of 16-bit values to be used
by the packed arithmetic operations, such as ADD2 (see ADD2).
31 16 15 0
a_hi a_lo ← src1
PACKH2
31 16 15 0
a_hi b_hi ← dst
Execution
if (cond) {
msb16(src2) → lsb16(dst);
msb16(src1) → msb16(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L, .S
Delay Slots 0
Examples Example 1
PACKH2 .L1 A2,A8,A9
Example 2
PACKH2 .S2 B2,B8,B12
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 0 1 0 0 1 1 1 0 s p
3 1 5 5 5 1 1 1
Description Moves the high bytes of the two halfwords in src1 and src2, and packs them into dst.
The bytes from src1 are packed into the most-significant bytes of dst, and the bytes from
src2 are packed into the least-significant bytes of dst.
• The high byte of the upper halfword of src1 is moved to the upper byte of the upper
halfword of dst. The high byte of the lower halfword of src1 is moved to the lower
byte of the upper halfword of dst.
• The high byte of the upper halfword of src2 is moved to the upper byte of the lower
halfword of dst. The high byte of the lower halfword of src2 is moved to the lower
byte of the lower halfword of dst.
31 24 23 16 15 8 7 0
a_3 a_2 a_1 a_0 ← src1
PACKH4
31 24 23 16 15 8 7 0
a_3 a_1 b_3 b_1 ← dst
Execution
if (cond) {
byte3(src1) → byte3(dst);
byte1(src1) → byte2(dst);
byte3(src2) → byte1(dst);
byte1(src2) → byte0(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
PACKH4 .L1 A2,A8,A9
A2 37 89 F2 3Ah A2 37 89 F2 3Ah
A8 04 B8 49 75h A8 04 B8 49 75h
Example 2
PACKH4 .L2 B2,B8,B12
B2 01 24 24 51h B2 01 24 24 51h
B8 01 A6 A0 51h B8 01 A6 A0 51h
PACKHL2 Pack 16 MSB Into Upper and 16 LSB Into Lower Register Halves
Opcode .L unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 1 1 1 0 0 1 1 0 s p
3 1 5 5 5 1 1 1
Opcode .S unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 1 0 0 0 1 0 0 0 s p
3 1 5 5 5 1 1 1
Description Moves the upper halfword from src1 and the lower halfword from src2 and packs them
both into dst. The upper halfword of src1 is placed in the upper halfword of dst. The
lower halfword of src2 is placed in the lower halfword of dst.
This instruction is useful for manipulating and preparing pairs of 16-bit values to be used
by the packed arithmetic operations, such as ADD2 (see ADD2).
31 16 15 0
a_hi a_lo ← src1
PACKHL2
31 16 15 0
a_hi b_lo ← dst
Execution
if (cond) {
lsb16(src2) → lsb16(dst);
msb16(src1) → msb16(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L, .S
Delay Slots 0
Examples Example 1
PACKHL2 .L1 A2,A8,A9
Example 2
PACKHL2 .S2 B2,B8,B12
PACKLH2 Pack 16 LSB Into Upper and 16 MSB Into Lower Register Halves
Opcode .L unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 1 1 0 1 1 1 1 0 s p
3 1 5 5 5 1 1 1
Opcode .S unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 0 0 0 0 1 0 0 0 s p
3 1 5 5 5 1 1 1
Description Moves the lower halfword from src1, and the upper halfword from src2, and packs them
both into dst. The lower halfword of src1 is placed in the upper halfword of dst. The
upper halfword of src2 is placed in the lower halfword of dst.
This instruction is useful for manipulating and preparing pairs of 16-bit values to be used
by the packed arithmetic operations, such as ADD2 (see ADD2).
31 16 15 0
a_hi a_lo ← src1
PACKLH2
31 16 15 0
a_lo b_hi ← dst
Execution
if (cond) {
msb16(src2) → lsb16(dst);
lsb16(src1) → msb16(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L, .S
Delay Slots 0
Examples Example 1
PACKLH2 .L1 A2,A8,A9
Example 2
PACKLH2 .S2 B2,B8,B12
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 0 1 0 0 0 1 1 0 s p
3 1 5 5 5 1 1 1
Description Moves the low bytes of the two halfwords in src1 and src2, and packs them into dst. The
bytes from src1 are packed into the most-significant bytes of dst, and the bytes from src2
are packed into the least-significant bytes of dst.
• The low byte of the upper halfword of src1 is moved to the upper byte of the upper
halfword of dst. The low byte of the lower halfword of src1 is moved to the lower byte
of the upper halfword of dst.
• The low byte of the upper halfword of src2 is moved to the upper byte of the lower
halfword of dst. The low byte of the lower halfword of src2 is moved to the lower byte
of the lower halfword of dst.
31 24 23 16 15 8 7 0
a_3 a_2 a_1 a_0 ← src1
PACKL4
31 24 23 16 15 8 7 0
a_2 a_0 b_2 b_0 ← dst
Execution
if (cond) {
byte2(src1) → byte3(dst);
byte0(src1) → byte2(dst);
byte2(src2) → byte1(dst);
byte0(src2) → byte0(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
PACKL4 .L1 A2,A8,A9
A2 37 89 F2 3Ah A2 37 89 F2 3Ah
A8 04 B8 49 75h A8 04 B8 49 75h
Example 2
PACKL4 .L2 B2,B8,B12
B2 01 24 24 51h B2 01 24 24 51h
B8 01 A6 A0 51h B8 01 A6 A0 51h
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 0 0 0 0 0 x 1 0 1 1 0 1 1 0 0 0 s p
3 1 5 5 1 1 1
NOTE:
1. If src2 is SNaN, NaN_out is placed in dst and the INVAL and NAN2
bits are set.
2. If src2 is QNaN, NaN_out is placed in dst and the NAN2 bit is set.
3. If src2 is a signed denormalized number, signed infinity is placed in
dstand the DIV0, INFO, OVER, INEX, and DEN2 bits are set.
4. If src2 is signed 0, signed infinity is placed in dst and the DIV0 and
INFO bits are set.
5. If src2 is signed infinity, signed 0 is placed in dst.
6. If the result underflows, signed 0 is placed in dst and the INEX and
UNDER bits are set. Underflow occurs when 21022 < src2 < infinity.
Execution
Pipeline
Pipeline Stage E1 E2
Read src2_l, src2_h
Written dst_l dst_h
Unit in use .S
If dst is used as the source for the ADDDP, CMPEQDP, CMPLTDP, CMPGTDP,
MPYDP, or SUBDP instruction, the number of delay slots can be reduced by one,
because these instructions read the lower word of the DP source one cycle before the
upper word of the DP source.
Delay Slots 1
A1:A0 4010 0000h 0000 0000h A1:A0 4010 0000h 0000 0000h 4.00
A3:A2 xxxx xxxxh xxxx xxxxh A3:A2 3FD0 0000h 0000 0000h 0.25
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 0 0 0 0 0 x 1 1 1 1 0 1 1 0 0 0 s p
3 1 5 5 1 1 1
Description The single-precision floating-point reciprocal approximation value of src2 is placed in dst.
The RCPSP instruction provides the correct exponent, and the mantissa is accurate to
the eighth binary position (therefore, mantissa error is less than 2-8). This estimate can
be used as a seed value for an algorithm to compute the reciprocal to greater accuracy.
The Newton-Rhapson algorithm can further extend the mantissa's precision:
x[n + 1] = x[n](2 - v × x[n])
where v = the number whose reciprocal is to be found.
x[0], the seed value for the algorithm, is given by RCPSP. For each iteration, the
accuracy doubles. Thus, with one iteration, accuracy is 16 bits in the mantissa; with the
second iteration, the accuracy is the full 23 bits.
NOTE:
1. If src2 is SNaN, NaN_out is placed in dst and the INVAL and NAN2
bits are set.
2. If src2 is QNaN, NaN_out is placed in dst and the NAN2 bit is set.
3. If src2 is a signed denormalized number, signed infinity is placed in
dstand the DIV0, INFO, OVER, INEX, and DEN2 bits are set.
4. If src2 is signed 0, signed infinity is placed in dst and the DIV0 and
INFO bits are set.
5. If src2 is signed infinity, signed 0 is placed in dst.
6. If the result underflows, signed 0 is placed in dst and the INEX and
UNDER bits are set. Underflow occurs when 2126 < src2 < infinity.
Execution
Pipeline
Pipeline Stage E1
Read src2
Written dst
Unit in use .S
Delay Slots 0
Syntax RINT
unit = none
Opcode
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 p
1
Description Copies the contents of the SGIE bit in TSR into the GIE bit in TSR and CSR, and clears
the SGIE bit in TSR. The value of the SGIE bit in TSR is used for the current cycle as
the GIE indication; if restoring the GIE bit to 1, interrupts are enabled and can be taken
after the E1 phase containing the RINT instruction.
The CPU may service a maskable interrupt in the cycle immediately following the RINT
instruction. See section 5.2 for details.
The RINT instruction cannot be placed in parallel with: MVC reg, TSR; MVC reg, CSR;
B IRP; B NRP; NOP n; DINT; SPKERNEL; SPKERNELR; SPLOOP; SPLOOPD;
SPLOOPW; SPMASK; or SPMASKR.
This instruction executes unconditionally and cannot be predicated.
NOTE: The use of the DINT and RINT instructions in a nested manner, like the
following code:
DINT
DINT
RINT
RINT
Delay Slots 0
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 op 1 1 0 0 s p
3 1 5 5 5 1 5 1 1
Description Rotates the 32-bit value of src2 to the left, and places the result in dst. The number of
bits to rotate is given in the 5 least-significant bits of src1. Bits 5 through 31 of src1 are
ignored and may be non-zero.
In the following figure, src1 is equal to 8.
31 24 23 16 15 8 7 0
abcdefgh ijklmnop qrstuvwx yzABCDEF ← src2
ROTL
31 0
ijklmnopqrstuvwxyzABCDEFabcdefgh ← dst
(for src1 = 8)
Execution
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
Examples Example 1
ROTL .M2 B2,B4,B5
Example 2
ROTL .M1 A4,10h,A5
RPACK2 Shift With Saturation and Pack Two 16 MSBs Into Upper and Lower Register
Halves
Opcode
31 30 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 dst src2 src1 x 1 1 1 0 1 1 1 1 0 0 s p
5 5 5 1 1 1
Description src1 and src2 are shifted left by 1 with saturation. The 16 most-significant bits of the
shifted src1 value are placed in the 16 most-significant bits of dst. The 16
most-significant bits of the shifted src2 value are placed in the 16 least-significant bits of
dst.
If either value saturates, the S1 or S2 bit in SSR and the SAT bit in CSR are written one
cycle after the result is written to dst.
This instruction executes unconditionally and cannot be predicated.
31 16 15 0
a_hi a_lo ← src1
RPACK2
↓ ↓
31 16 15 0
sat(a_hi << 1) sat(b_hi << 1) ← dst
Execution
Delay Slots 0
Examples Example 1
RPACK2 .S1 A0,A1,A2
A1 1234 5678h
Example 2
RPACK2 .S2X B0,A1,B2
A1 1234 5678h
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 0 0 0 0 0 x 1 0 1 1 1 0 1 0 0 0 s p
3 1 5 5 1 1 1
NOTE:
1. If src2 is SNaN, NaN_out is placed in dst and the INVAL and NAN2
bits are set.
2. If src2 is QNaN, NaN_out is placed in dst and the NAN2 bit is set.
3. If src2 is a negative, nonzero, nondenormalized number, NaN_out is
placed in dst and the INVAL bit is set.
4. If src2 is a signed denormalized number, signed infinity is placed in
dst and the DIV0, INEX, and DEN2 bits are set.
5. If src2 is signed 0, signed infinity is placed in dst and the DIV0 and
INFO bits are set. The Newton-Rhapson approximation cannot be
used to calculate the square root of 0 because infinity multiplied by 0
is invalid.
6. If src2 is positive infinity, positive 0 is placed in dst.
Execution
Pipeline
Pipeline Stage E1 E2
Read src2_l, src2_h
Written dst_l dst_h
Unit in use .S
If dst is used as the source for the ADDDP, CMPEQDP, CMPLTDP, CMPGTDP,
MPYDP, or SUBDP instruction, the number of delay slots can be reduced by one,
because these instructions read the lower word of the DP source one cycle before the
upper word of the DP source.
Delay Slots 1
A1:A0 4010 0000h 0000 0000h A1:A0 4010 0000h 0000 0000h 4.0
A3:A2 xxxx xxxxh xxxx xxxxh A3:A2 3FE0 0000h 0000 0000h 0.5
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 0 0 0 0 0 x 1 1 1 1 1 0 1 0 0 0 s p
3 1 5 5 1 1 1
NOTE:
1. If src2 is SNaN, NaN_out is placed in dst and the INVAL and NAN2
bits are set.
2. If src2 is QNaN, NaN_out is placed in dst and the NAN2 bit is set.
3. If src2 is a negative, nonzero, nondenormalized number, NaN_out is
placed in dst and the INVAL bit is set.
4. If src2 is a signed denormalized number, signed infinity is placed in
dst and the DIV0, INEX, and DEN2 bits are set.
5. If src2 is signed 0, signed infinity is placed in dst and the DIV0 and
INFO bits are set. The Newton-Rhapson approximation cannot be
used to calculate the square root of 0 because infinity multiplied by 0
is invalid.
6. If src2 is positive infinity, positive 0 is placed in dst.
Execution
Pipeline
Pipeline Stage E1
Read src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
RSQRSP .S1 A1,A2
Example 2
RSQRSP .S2X A1,B2
Opcode .L unit
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
creg z dst src2 src1 x op 1 1 0 s p
3 1 5 5 5 1 7 1 1
Opcode .S unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 0 0 0 0 1 0 0 0 s p
3 1 5 5 5 1 1 1
Description src1 is added to src2 and saturated, if an overflow occurs according to the following
rules:
1. If the dst is an int and src1 + src2 > 231 - 1, then the result is 231 - 1.
2. If the dst is an int and src1 + src2 < -231, then the result is -231.
3. If the dst is a long and src1 + src2 > 239 - 1, then the result is 239 - 1.
4. If the dst is a long and src1 + src2 < -239, then the result is -239.
The result is placed in dst. If a saturate occurs, the SAT bit in the control status register
(CSR) is set one cycle after dst is written.
Execution
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L, .S
Delay Slots 0
Examples Example 1
SADD .L1 A1,A2,A3
A1 5A2E 51A3h
A2 012A 3FA2h
A3 5B58 9145h
Example 2
SADD .L1 A1,A2,A3
A1 4367 71F2h
A2 5A2E 51A3h
A3 7FFF FFFFh
Example 3
SADD .L1X B2,A5:A4,A7:A6
A5:A4 0000 0000h 7C83 39B1h A5:A4 0000 0000h 7C83 39B1h
2,088,974,769 (1)
A7:A6 xxxx xxxxh xxxx xxxxh A7:A6 0000 0000h 8DAD 7953h
(1)
2,376,956,243
B2 112A 3FA2h
SADD2 Add Two Signed 16-Bit Integers on Upper and Lower Register Halves With
Saturation
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 0 0 0 0 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Performs 2s-complement addition between signed, packed 16-bit quantities in src1 and
src2. The results are placed in a signed, packed 16-bit format into dst.
For each pair of 16-bit quantities in src1 and src2, the sum between the signed 16-bit
value from src1 and the signed 16-bit value from src2 is calculated and saturated to
produce a signed 16-bit result. The result is placed in the corresponding position in dst.
Saturation is performed on each 16-bit result independently. For each sum, the following
tests are applied:
• If the sum is in the range - 215 to 2 15 - 1, inclusive, then no saturation is performed
and the sum is left unchanged.
• If the sum is greater than 215 - 1, then the result is set to 215 - 1.
• If the sum is less than - 215, then the result is set to - 215.
31 16 15 0
a_hi a_lo ← src1
SADD2
↓ ↓
31 16 15 0
sat(a_hi + b_hi) sat(a_lo + b_lo) ← dst
Execution
if (cond) {
sat(msb16(src1) + msb16(src2)) → msb16(dst);
sat(lsb16(src1) + lsb16(src2)) → lsb16(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
SADD2 .S1 A2,A8,A9
Example 2
SADD2 .S2 B2,B8,B12
Opcode
31 30 29 28 27 24 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 dst 0 src2 src1 x 0 0 0 1 1 1 0 1 1 0 s p
4 5 5 1 1 1
Execution
Delay Slots 0
Examples Example 1
SADDSUB .L1 A0,A1,A3:A2
Example 2
SADDSUB .L2X B0,A1,B3:B2
Example 3
SADDSUB .L1X A0,B1,A3:A2
Opcode
31 30 29 28 27 24 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 dst 0 src2 src1 x 0 0 0 1 1 1 1 1 1 0 s p
4 5 5 1 1 1
Execution
Delay Slots 0
Examples Example 1
SADDSUB2 .L1 A0,A1,A3:A2
Example 2
SADDSUB2 .L2X B0,A1,B3:B2
SADDSU2 Add Two Signed and Unsigned 16-Bit Integers on Register Halves With Saturation
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 0 0 0 1 1 1 0 0 s p
3 1 5 5 5 1 1 1
Execution
if (cond) {
sat(smsb16(src2) + umsb16(src1)) → umsb16(dst);
sat(slsb16(src2) + ulsb16(src1)) → ulsb16(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
SADDUS2 Add Two Unsigned and Signed 16-Bit Integers on Register Halves With Saturation
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 0 0 0 1 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Performs 2s-complement addition between unsigned and signed, packed 16-bit
quantities. The values in src1 are treated as unsigned, packed 16-bit quantities; and the
values in src2 are treated as signed, packed 16-bit quantities. The results are placed in
an unsigned, packed 16-bit format into dst.
For each pair of 16-bit quantities in src1 and src2, the sum between the unsigned 16-bit
value from src1 and the signed 16-bit value from src2 is calculated and saturated to
produce a signed 16-bit result. The result is placed in the corresponding position in dst.
Saturation is performed on each 16-bit result independently. For each sum, the following
tests are applied:
• If the sum is in the range 0 to 216 - 1, inclusive, then no saturation is performed and
the sum is left unchanged.
• If the sum is greater than 216 - 1, then the result is set to 216 - 1.
• If the sum is less than 0, then the result is cleared to 0.
31 16 15 0
ua_hi ua_lo ← src1
SADDUS2
↓ ↓
31 16 15 0
sat(ua_hi + sb_hi) sat(ua_lo + sb_lo) ← dst
Execution
if (cond) {
sat(umsb16(src1) + smsb16(src2)) → umsb16(dst);
sat(ulsb16(src1) + slsb16(src2)) → ulsb16(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
SADDUS2 .S1 A2, A8, A9
Example 2
SADDUS2 .S2 B2, B8, B12
SADDU4 Add With Saturation, Four Unsigned 8-Bit Pairs for Four 8-Bit Results
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 0 0 1 1 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Performs 2s-complement addition between unsigned, packed 8-bit quantities. The values
in src1 and src2 are treated as unsigned, packed 8-bit quantities and the results are
written into dst in an unsigned, packed 8-bit format.
For each pair of 8-bit quantities in src1 and src2, the sum between the unsigned 8-bit
value from src1 and the unsigned 8-bit value from src2 is calculated and saturated to
produce an unsigned 8-bit result. The result is placed in the corresponding position in
dst.
Saturation is performed on each 8-bit result independently. For each sum, the following
tests are applied:
• If the sum is in the range 0 to 28 - 1, inclusive, then no saturation is performed and
the sum is left unchanged.
• If the sum is greater than 28 - 1, then the result is set to 28 - 1.
31 24 23 16 15 8 7 0
ua_3 ua_2 ua_1 ua_0 ← src1
SADDU4
↓ ↓ ↓ ↓
31 24 23 16 15 8 7 0
sat(ua_3 + ub_3) sat(ua_2 + ub_2) sat(ua_1 + ub_1) sat(ua_0 + ub_0) ← dst
Execution
if (cond) {
sat(ubyte0(src1) + ubyte0(src2)) → ubyte0(dst);
sat(ubyte1(src1) + ubyte1(src2)) → ubyte1(dst);
sat(ubyte2(src1) + ubyte2(src2)) → ubyte2(dst);
sat(ubyte3(src1) + ubyte3(src2)) → ubyte3(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
SADDU4 .S1 A2, A8, A9
Example 2
SADDU4 .S2 B2, B8, B12
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 0 0 0 0 0 x 1 0 0 0 0 0 0 1 1 0 s p
3 1 5 5 1 1 1
Description A 40-bit src2 value is converted to a 32-bit value. If the value in src2 is greater than what
can be represented in 32-bits, src2 is saturated. The result is placed in dst. If a saturate
occurs, the SAT bit in the control status register (CSR) is set one cycle after dst is
written.
Execution
if (cond) {
if (src2 > (231 - 1)), (231 - 1) → dst
else if (src2 < -231), -231 → dst
else src2 31..0 → dst
}
else nop
Pipeline
Pipeline Stage E1
Read src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
SAT .L2 B1:B0,B5
B1:B0 0000 001Fh 3413 539Ah B1:B0 0000 001Fh 3413 539Ah
B5 7FFF FFFFh
Example 2
SAT .L2 B1:B0,B5
B1:B0 0000 0000h A190 7321h B1:B0 0000 0000h A190 7321h
B5 7FFF FFFFh
Example 3
SAT .L2 B1:B0,B5
B1:B0 0000 00FFh A190 7321h B1:B0 0000 00FFh A190 7321h
B5 A190 7321h
31 29 28 27 23 22 18 17 13 12 8 7 6 5 4 3 2 1 0
creg z dst src2 csta cstb 1 0 0 0 1 0 s p
3 1 5 5 5 5 1 1
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 1 0 1 1 1 0 0 0 s p
3 1 5 5 5 1 1 1
Description For cstb ≥ csta, the field in src2 as specified by csta to cstb is set to all 1s in dst. The
csta and cstb operands may be specified as constants or in the 10 LSBs of the src1
register, with cstb being bits 0-4 (src1 4..0) and csta being bits 5-9 (src1 9..5). csta is the
LSB of the field and cstb is the MSB of the field. In other words, csta and cstb represent
the beginning and ending bits, respectively, of the field to be set to all 1s in dst. The LSB
location of src2 is bit 0 and the MSB location of src2 is bit 31.
In the following example, csta is 15 and cstb is 23. For the register version of the
instruction, only the 10 LSBs of the src1 register are valid. If any of the 22 MSBs are
non-zero, the result is invalid.
cstb
csta
src2 X X X X X X X X 1 0 1 0 0 1 1 0 1 X X X X X X X X X X X X X X X
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
dst X X X X X X X X 1 1 1 1 1 1 1 1 1 X X X X X X X X X X X X X X X
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
For cstb < csta, the src2 register is copied to dst. The csta and cstb operands may be
specified as constants or in the 10 LSBs of the src1 register, with cstb being bits 0−4
(src1 4..0) and csta being bits 5−9 (src1 9..5).
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
SET .S1 A0,7,21,A1
Example 2
SET .S2 B0,B1,B2
SHFL Shuffle
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 1 1 1 0 0 x 0 0 0 0 1 1 1 1 0 0 s p
3 1 5 5 1 1 1
Description Performs an interleave operation on the two halfwords in src2. The bits in the lower
halfword of src2 are placed in the even bit positions in dst, and the bits in the upper
halfword of src2 are placed in the odd bit positions in dst.
As a result, bits 0, 1, 2, ..., 14, 15 of src2 are placed in bits 0, 2, 4, ... , 28, 30 of dst.
Likewise, bits 16, 17, 18, .. 30, 31 of src2 are placed in bits 1, 3, 5, ..., 29, 31 of dst.
31 16 15 0
abcdefghijklmnop ABCDEFGHIJKLMNOP ← src2
SHFL
31 16 15 0
aAbBcCdDeEfFgGhH iIjJkKlLmMnNoOpP ← dst
NOTE: The SHFL instruction is the exact inverse of the DEAL instruction
(see DEAL).
Execution
if (cond) {
src2 31,30,29...16 → dst 31,29,27...1
src2 15,14,13...0 → dst 30,28,26...0
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src2
Written dst
Unit in use .M
Delay Slots 1
SHFL3 3-Way Bit Interleave On Three 16-Bit Values Into a 48-Bit Result
Opcode
31 30 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 dst src2 src1 x 0 1 1 0 1 1 0 1 1 0 s p
5 5 5 1 1 1
Description Performs a 3-way bit interleave on three 16-bit values and creating a 48-bit result.
This instruction executes unconditionally and cannot be predicated.
31 16 15 0
a15 a14 a13 ... a2 a1 a0 b15 b14 b13 ... b2 b1 b0 ← src1
SHFL3
31 16 15 0
0 0 0 ... 0 0 0 a15 b15 d15 ... b11 d11 a10 ← dst_o
Execution
Delay Slots 0
Example SHFL3 .L1 A0,A1,A3:A2
Opcode
31 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
creg z dst src2 src1 x op 1 0 0 0 s p
3 1 5 5 5 1 6 1 1
Description The src2 operand is shifted to the left by the src1 operand. The result is placed in dst.
When a register is used, the six LSBs specify the shift amount and valid values are 0-40.
When an immediate is used, valid shift amounts are 0-31. If src2 is a register pair, only
the bottom 40 bits of the register pair are shifted. The upper 24 bits of the register pair
are unused.
If 39 < src1 < 64, src2 is shifted to the left by 40. Only the six LSBs of src1 are used by
the shifter, so any bits set above bit 5 do not affect execution.
Execution
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
SHL .S1 A0,4,A1
Example 2
SHL .S2 B0,B1,B2
Example 3
SHL .S2 B1:B0,B2,B3:B2
B1:B0 0000 0009h 4197 51A5h B1:B0 0000 0009h 4197 51A5h
B3:B2 xxxx xxxxh xxxx xxxxh B3:B2 0000 0094h 0000 0000h
Example 4
SHL .S1 A5:A4,0,A1:A0
A5:A4 FFFF FFFFh FFFF FFFFh A5:A4 FFFF FFFFh FFFF FFFFh
A1:A0 xxxx xxxxh xxxx xxxxh A1:A0 0000 00FFh FFFF FFFFh
Opcode .L unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 0 0 0 0 1 1 1 0 s p
3 1 5 5 5 1 1 1
Opcode .S unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 1 0 0 1 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Shifts the contents of src2 left by 1 byte, and then the most-significant byte of src1 is
merged into the least-significant byte position. The result is placed in dst.
31 24 23 16 15 8 7 0
ua_3 ua_2 ua_1 ua_0 ← src1
SHLMB
31 24 23 16 15 8 7 0
ub_2 ub_1 ub_0 ua_3 ← dst
Execution
if (cond) {
ubyte2(src2) → ubyte3(dst);
ubyte1(src2) → ubyte2(dst);
ubyte0(src2) → ubyte1(dst);
ubyte3(src1) → ubyte0(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L, .S
Delay Slots 0
Examples Example 1
SHLMB .L1 A2, A8, A9
Example 2
SHLMB .S2 B2,B8, B12
Opcode
31 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
creg z dst src2 src1 x op 1 0 0 0 s p
3 1 5 5 5 1 6 1 1
Description The src2 operand is shifted to the right by the src1 operand. The sign-extended result is
placed in dst. When a register is used, the six LSBs specify the shift amount and valid
values are 0-40. When an immediate value is used, valid shift amounts are 0-31. If src2
is a register pair, only the bottom 40 bits of the register pair are shifted. The upper 24
bits of the register pair are unused.
If 39 < src1 < 64, src2 is shifted to the right by 40. Only the six LSBs of src1 are used by
the shifter, so any bits set above bit 5 do not affect execution.
Execution
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
SHR .S1 A0,8,A1
Example 2
SHR .S2 B0,B1,B2
Example 3
SHR .S2 B1:B0,B2,B3:B2
B1:B0 0000 0012h 1492 5A41h B1:B0 0000 0012h 1492 5A41h
B3:B2 xxxx xxxxh xxxx xxxxh B3:B2 0000 0000h 0000 090Ah
Example 4
SHR .S1 A5:A4,0,A1:A0
A5:A4 FFFF FFFFh FFFF FFFFh A5:A4 FFFF FFFFh FFFF FFFFh
A1:A0 xxxx xxxxh xxxx xxxxh A1:A0 0000 00FFh FFFF FFFFh
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 0 1 1 1 1 1 0 0 s p
3 1 5 5 5 1 1 1
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 1 0 0 0 1 0 0 0 s p
3 1 5 5 5 1 1 1
Description Performs an arithmetic shift right on signed, packed 16-bit quantities. The values in src2
are treated as signed, packed 16-bit quantities. The lower 5 bits of src1 are treated as
the shift amount. The results are placed in a signed, packed 16-bit format into dst.
For each signed 16-bit quantity in src2, the quantity is shifted right by the number of bits
specified in the lower 5 bits of src1. Bits 5 through 31 of src1 are ignored and may be
non-zero. The shifted quantity is sign-extended, and placed in the corresponding position
in dst. Bits shifted out of the least-significant bit of the signed 16-bit quantity are
discarded.
31 16 15 0
abcdefgh ijklmnop qrstuvwx yzABCDEF ← src2
SHR2
31 16 15 0
aaaaaaaa abcdefgh qqqqqqqq qrstuvwx ← dst
(for src1 = 8)
NOTE: If the shift amount specified in src1 is in the range 16 to 31, the behavior
is identical to a shift value of 15.
Execution
if (cond) {
smsb16(src2) >> src1 → smsb16(dst);
slsb16(src2) >> src1 → slsb16(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
SHR2 .S2 B2,B4,B5
Example 2
SHR2 .S1 A4,0fh,A5 ; shift value is 15
Opcode .L unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 0 0 0 1 0 1 1 0 s p
3 1 5 5 5 1 1 1
Opcode .S unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 1 0 1 0 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Shifts the contents of src2 right by 1 byte, and then the least-significant byte of src1 is
merged into the most-significant byte position. The result is placed in dst.
31 24 23 16 15 8 7 0
ua_3 ua_2 ua_1 ua_0 ← src1
SHRMB
31 24 23 16 15 8 7 0
ua_0 ub_3 ub_2 ub_1 ← dst
Execution
if (cond) {
ubyte0(src1) → ubyte3(dst);
ubyte3(src2) → ubyte2(dst);
ubyte2(src2) → ubyte1(dst);
ubyte1(src2) → ubyte0(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L, .S
Delay Slots 0
Examples Example 1
SHRMB .L1 A2,A8,A9
Example 2
SHRMB .S2 B2,B8,B12
Opcode
31 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
creg z dst src2 src1 x op 1 0 0 0 s p
3 1 5 5 5 1 6 1 1
Description The src2 operand is shifted to the right by the src1 operand. The zero-extended result is
placed in dst. When a register is used, the six LSBs specify the shift amount and valid
values are 0-40. When an immediate value is used, valid shift amounts are 0-31. If src2
is a register pair, only the bottom 40 bits of the register pair are shifted. The upper 24
bits of the register pair are unused.
If 39 < src1 < 64, src2 is shifted to the right by 40. Only the six LSBs of src1 are used by
the shifter, so any bits set above bit 5 do not affect execution.
Execution
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
SHRU .S1 A0,8,A1
Example 2
SHRU .S1 A5:A4,0,A1:A0
A5:A4 FFFF FFFFh FFFF FFFFh A5:A4 FFFF FFFFh FFFF FFFFh
A1:A0 xxxx xxxxh xxxx xxxxh A1:A0 0000 00FFh FFFF FFFFh
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 1 0 0 0 1 1 0 0 s p
3 1 5 5 5 1 1 1
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 1 0 0 1 1 0 0 0 s p
3 1 5 5 5 1 1 1
Description Performs an arithmetic shift right on unsigned, packed 16-bit quantities. The values in
src2 are treated as unsigned, packed 16-bit quantities. The lower 5 bits of src1 are
treated as the shift amount. The results are placed in an unsigned, packed 16-bit format
into dst.
For each unsigned 16-bit quantity in src2, the quantity is shifted right by the number of
bits specified in the lower 5 bits of src1. Bits 5 through 31 of src1 are ignored and may
be non-zero. The shifted quantity is zero-extended, and placed in the corresponding
position in dst. Bits shifted out of the least-significant bit of the signed 16-bit quantity are
discarded.
NOTE: If the shift amount specified in src1 is in the range of 16 to 31, the dst
will be cleared to all zeros.
31 16 15 0
abcdefgh ijklmnop qrstuvwx yzABCDEF ← src2
SHRU2
31 16 15 0
00000000 abcdefgh 00000000 qrstuvwx ← dst
(for src1 = 8)
Execution
if (cond) {
umsb16(src2) >> src1 → umsb16(dst);
ulsb16(src2) >> src1 → ulsb16(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
SHRU2 .S2 B2,B4,B5
Example 2
SHRU2 .S1 A4,0Fh,A5 ; Shift value is 15
SMPY Multiply Signed 16 LSB × Signed 16 LSB With Left Shift and Saturation
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 0 1 0 0 0 0 0 0 s p
3 1 5 5 5 1 1 1
Description The 16 least-significant bits of src1 operand is multiplied by the 16 least-significant bits
of the src2 operand. The result is left shifted by 1 and placed in dst. If the left-shifted
result is 8000 0000h, then the result is saturated to 7FFF FFFFh. If a saturate occurs,
the SAT bit in CSR is set one cycle after dst is written. The source operands are signed
by default.
Execution
if (cond) {
if (((lsb16(src1) × lsb16(src2)) << 1) != 8000 0000h),
((lsb16(src1) × lsb16(src2)) << 1) → dst
else 7FFF FFFFh → dst
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
SMPYH Multiply Signed 16 MSB × Signed 16 MSB With Left Shift and Saturation
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 0 1 0 0 0 0 0 0 s p
3 1 5 5 5 1 1 1
Description The 16 most-significant bits of src1 operand is multiplied by the 16 most-significant bits
of the src2 operand. The result is left shifted by 1 and placed in dst. If the left-shifted
result is 8000 0000h, then the result is saturated to 7FFF FFFFh. If a saturation occurs,
the SAT bit in CSR is set one cycle after dst is written. The source operands are signed
by default.
Execution
if (cond) {
if (((msb16(src1) × msb16(src2)) << 1) != 8000 0000h),
((msb16(src1) × msb16(src2)) << 1) → dst
else 7FFF FFFFh → dst
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
SMPYHL Multiply Signed 16 MSB × Signed 16 LSB With Left Shift and Saturation
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 0 1 0 0 0 0 0 0 s p
3 1 5 5 5 1 1 1
Description The 16 most-significant bits of the src1 operand is multiplied by the 16 least-significant
bits of the src2 operand. The result is left shifted by 1 and placed in dst. If the left-shifted
result is 8000 0000h, then the result is saturated to 7FFF FFFFh. If a saturation occurs,
the SAT bit in CSR is set one cycle after dst is written.
Execution
if (cond) {
if (((msb16(src1) × lsb16(src2)) << 1) != 8000 0000h),
((msb16(src1) × lsb16(src2)) << 1) → dst
else 7FFF FFFFh → dst
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
SMPYLH Multiply Signed 16 LSB × Signed 16 MSB With Left Shift and Saturation
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 0 1 0 0 0 0 0 0 s p
3 1 5 5 5 1 1 1
Description The 16 least-significant bits of the src1 operand is multiplied by the 16 most-significant
bits of the src2 operand. The result is left shifted by 1 and placed in dst. If the left-shifted
result is 8000 0000h, then the result is saturated to 7FFF FFFFh. If a saturation occurs,
the SAT bit in CSR is set one cycle after dst is written.
Execution
if (cond) {
if (((lsb16(src1) × msb16(src2)) << 1) != 8000 0000h),
((lsb16(src1) × msb16(src2)) << 1) → dst
else 7FFF FFFFh → dst
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
SMPY2 Multiply Signed by Signed, 16 LSB × 16 LSB and 16 MSB × 16 MSB With Left Shift
and Saturation
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 0 0 0 1 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Performs two 16-bit by 16-bit multiplies between two pairs of signed, packed 16-bit
values, with an additional left-shift and saturate. The values in src1 and src2 are treated
as signed, packed 16-bit quantities. The two 32-bit results are written into a 64-bit
register pair.
The SMPY2 instruction produces two 16 × 16 products. Each product is shifted left by 1.
If the left-shifted result is 8000 0000h, the output value is saturated to 7FFF FFFFh.
The saturated product of the lower halfwords of src1 and src2 is written to the even
destination register, dst_e. The saturated product of the upper halfwords of src1 and
src2 is written to the odd destination register, dst_o.
31 16 15 0
a_hi a_lo ← src1
× ×
SMPY2
63 32 31 0
sat((a_hi × b_hi) << 1) sat((a_lo × b_lo) << 1) ← dst_o:dst_e
NOTE: If either product saturates, the SAT bit is set in CSR one cycle after the
cycle that the result is written to dst_o:dst_e. If neither product saturates,
the SAT bit in CSR remains unaffected.
The SMPY2 instruction helps reduce the number of instructions required to perform two
16-bit by 16-bit saturated multiplies on both the lower and upper halves of two registers.
The following code:
SMPY .M1 A0, A1, A2
SMPYH .M1 A0, A1, A3
Execution
if (cond) {
sat((lsb16(src1) × lsb16(src2)) << 1) → dst_e;
sat((msb16(src1) × msb16(src2)) << 1) → dst_o
}
else nop
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .M
Delay Slots 3
Examples Example 1
SMPY2 .M1 A5,A6,A9:A8
A9:A8 xxxx xxxxh xxxx xxxxh A9:A8 BED5 6150h 0EEA 8C58h
-1,093,312,176 250,252,376
Example 2
SMPY2 .M2 B2, B5, B9:B8
B9:B8 xxxx xxxxh xxxx xxxxh B9:B8 04D5 AB98h 2122 FD02h
81,111,960 555,941,122
SMPY32 Multiply Signed 32-Bit × Signed 32-Bit Into 64-Bit Result With Left Shift and
Saturation
Opcode
31 30 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 dst src2 src1 x 0 1 1 0 0 1 1 1 0 0 s p
5 5 5 1 1 1
Description Performs a 32-bit by 32-bit multiply. src1 and src2 are signed 32-bit values. The 64-bit
result is shifted left by 1 with saturation, and the 32 most-significant bits of the shifted
value are written to dst.
If the result saturates either on the multiply or the shift, the M1 or M2 bit in SSR and the
SAT bit in CSR are written one cycle after the results are written to dst.
This instruction executes unconditionally and cannot be predicated.
NOTE: When both inputs are 8000 0000h, the shifted result cannot be
represented as a 32-bit signed value. In this case, the saturation value
7FFF FFFFh is written into dst.
Execution
Delay Slots 3
Examples Example 1
SMPY32 .M1 A0,A1,A2
A1 1234 5678h
Example 2
SMPY32 .L1 A0,A1,A2
A1 8000 0000h
SPACK2 Saturate and Pack Two 16 LSBs Into Upper and Lower Register Halves
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 0 0 1 0 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Takes two signed 32-bit quantities in src1 and src2 and saturates them to signed 16-bit
quantities. The signed 16-bit results are then packed into a signed, packed 16-bit format
and written to dst. Specifically, the saturated 16-bit signed value of src1 is written to the
upper halfword of dst, and the saturated 16-bit signed value of src2 is written to the
lower halfword of dst.
Saturation is performed on each input value independently. The input values start as
signed 32-bit quantities, and are saturated to 16-bit quantities according to the following
rules:
• If the value is in the range - 215 to 215 - 1, inclusive, then no saturation is performed
and the value is truncated to 16 bits.
• If the value is greater than 215 - 1, then the result is set to 215 - 1.
• If the value is less than - 215, then the result is set to - 215.
31 16 15 0
00000000 ABCDEFGH IJKLMNOP QRSTUVWX ← src1
SPACK2
31 16 15 0
01111111 11111111 00YZ1234 56789ABC ← dst
The SPACK2 instruction is useful in code that manipulates 16-bit data at 32-bit precision
for its intermediate steps, but that requires the final results to be in a 16-bit
representation. The saturate step ensures that any values outside the signed 16-bit
range are clamped to the high or low end of the range before being truncated to 16 bits.
Execution
if (cond) {
if (src2 > 0000 7FFFh), 7FFFh → lsb16(dst) or
if (src2 < FFFF 8000h), 8000h → lsb16(dst)
else truncate(src2) → lsb16(dst);
if (src1 > 0000 7FFFh), 7FFFFh→ msb16(dst) or
if (src1 < FFFF 8000h), 8000h→ msb16(dst)
else truncate(src1) → msb16(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
SPACK2 .S1 A2,A8,A9
Example 2
SPACK2 .S2 B2,B8,B12
SPACKU4 Saturate and Pack Four Signed 16-Bit Integers Into Four Unsigned 8-Bit Halfwords
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 0 1 0 0 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Takes four signed 16-bit values and saturates them to unsigned 8-bit quantities. The
values in src1 and src2 are treated as signed, packed 16-bit quantities. The results are
written into dst in an unsigned, packed 8-bit format.
Each signed 16-bit quantity in src1 and src2 is saturated to an unsigned 8-bit quantity as
described below. The resulting quantities are then packed into an unsigned, packed 8-bit
format. Specifically, the upper halfword of src1 is used to produce the most-significant
byte of dst. The lower halfword of src1 is used to produce the second most-significant
byte (bits 16 to 23) of dst. The upper halfword of src2 is used to produce the third
most-significant byte (bits 8 to 15) of dst. The lower halfword of src2 is used to produce
the least-significant byte of dst.
Saturation is performed on each signed 16-bit input independently, producing separate
unsigned 8-bit results. For each value, the following tests are applied:
• If the value is in the range 0 to 28 - 1, inclusive, then no saturation is performed and
the result is truncated to 8 bits.
• If the value is greater than 28 - 1, then the result is set to 28 - 1.
• If the value is less than 0, the result is cleared to 0.
31 16 15 0
00000000 ABCDEFGH 00001111 IJKLMNOP ← src1
SPACKU4
31 24 23 16 15 8 7 0
ABCDEFGH 11111111 YZ123456 00000000 ← dst
The SPACKU4 instruction is useful in code that manipulates 8-bit data at 16-bit precision
for its intermediate steps, but that requires the final results to be in an 8-bit
representation. The saturate step ensures that any values outside the unsigned 8-bit
range are clamped to the high or low end of the range before being truncated to 8 bits.
Execution
if (cond) {
if (msb16(src1) >> 0000 00FFh), FFh → ubyte3(dst) or
if (msb16(src1) << 0), 0 → ubyte3(dst)
else truncate(msb16(src1)) → ubyte3(dst);
if (lsb16(src1) >> 0000 00FFh), FFh → ubyte2(dst) or
if (lsb16(src1) << 0), 0 → ubyte2(dst)
else truncate(lsb16(src1)) → ubyte2(dst);
if (msb16(src2) >> 0000 00FFh), FFh → ubyte1(dst) or
if (msb16(src2) << 0), 0 → ubyte1(dst)
else truncate(msb16(src2)) → ubyte1(dst);
if (lsb16(src2) >> 0000 00FFh), FFh → ubyte0(dst) or
if (lsb16(src2) << 0), 0 → ubyte0(dst)
else truncate(lsb16(src2)) → ubyte0(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
SPACKU4 .S1 A2,A8,A9
Example 2
SPACKU4 .S2 B2,B8,B12
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 0 0 0 0 0 x 0 0 0 0 1 0 1 0 0 0 s p
3 1 5 5 1 1 1
Description The single-precision value in src2 is converted to a double-precision value and placed in
dst.
NOTE:
1. If src2 is SNaN, NaN_out is placed in dst and the INVAL and NAN2
bits are set.
2. If src2 is QNaN, NaN_out is placed in dst and the NAN2 bit is set.
3. If src2 is a signed denormalized number, signed 0 is placed in dst
and the INEX and DEN2 bits are set.
4. If src2 is signed infinity, INFO bit is set.
5. No overflow or underflow can occur.
Execution
Pipeline
Pipeline Stage E1 E2
Read src2
Written dst_l dst_h
Unit in use .S
If dst is used as the source for the ADDDP, CMPEQDP, CMPLTDP, CMPGTDP,
MPYDP, or SUBDP instruction, the number of delay slots can be reduced by one,
because these instructions read the lower word of the DP source one cycle before the
upper word of the DP source.
Delay Slots 1
A1:A0 xxxx xxxxh xxxx xxxxh A1:A0 4021 3333h 4000 0000h 8.6
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 0 0 0 0 0 x 0 0 0 1 0 1 0 1 1 0 s p
3 1 5 5 1 1 1
Description The single-precision value in src2 is converted to an integer and placed in dst.
NOTE:
1. If src2 is NaN, the maximum signed integer (7FFF FFFFh or
8000 0000h) is placed in dst and the INVAL bit is set.
2. If src2 is signed infinity or if overflow occurs, the maximum signed
integer (7FFF FFFFh or 8000 0000h) is placed in dst and the INEX
and OVER bits are set. Overflow occurs if src2 is greater than
231 − 1 or less than −231.
3. If src2 is denormalized, 0000 0000h is placed in dst and the INEX
and DEN2 bits are set.
4. If rounding is performed, the INEX bit is set.
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src2
Written dst
Unit in use .L
Delay Slots 3
Opcode
31 30 29 28 27 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 0 fstg/fcyc 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 s p
6 1 1
Description The SPKERNEL instruction is placed in parallel with the last execute packet of the
SPLOOP code body indicating there are no more instructions to load into the loop buffer.
The SPKERNEL instruction also controls at what point in the epilog the execution of
post-SPLOOP instructions begins. This point is specified in terms of stage and cycle
counts, and is derived from the fstg/fcyc field.
The stage and cycle values for both the post-SPLOOP fetch and reload cases are
derived from the fstg/fcyc field. The 6-bit field is interpreted as a function of the ii value
from the associated SPLOOP(D) instruction. The number of bits allocated to stage and
cycle vary according to ii. The value for cycle starts from the least-significant end; the
value for stage starts from the most-significant end, and they grow together. The number
of epilog stages and the number of cycles within those stages are shown in Table 3-28.
The exact bit allocation to stage and cycle is shown in Table 3-29.
The following restrictions apply to the use of the SPKERNEL instruction:
• The SPKERNEL instruction must be the first instruction in the execute packet
containing it.
• The SPKERNEL instruction cannot be placed in the same execute packet as any
instruction that initiates multicycle NOPs. This includes BNOP, CALLP, NOP n
(n > 1), and protected loads (see compact instruction discussion in Section 3.10).
• The SPKERNEL instruction cannot be placed in the execute packet immediately
following an execute packet containing any instruction that initiates multicycle NOPs.
This includes BNOP, CALLP, NOP n (n > 1), and protected loads (see compact
instruction discussion in Section 3.10).
• The SPKERNEL instruction cannot be placed in parallel with DINT or RINT
instructions.
• The SPKERNEL instruction cannot be placed in parallel with SPMASK, SPMASKR,
SPLOOP, SPLOOPD, or SPLOOPW instructions.
• When the SPKERNEL instruction is used with the SPLOOPW instruction, fstg and
fcyc should both be zero.
NOTE: The delay specified by the SPKERNEL fstg/fcyc parameters will not
extend beyond the end of the kernel epilog. If the end of the kernel epilog
is reached prior to the end of the delay specified by fstg/fcyc parameters
due to either an excessively large value specified for parameters or due
to an early exit from the loop, program fetch will begin immediately and
the value specified by the fstg/fcyc will be ignored.
Syntax SPKERNELR
unit = none
Opcode
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 s p
1 1
Description The SPKERNELR instruction is placed in parallel with the last execute packet of the
SPLOOP code body indicating there are no more instructions to load into the loop buffer.
The SPKERNELR instruction also indicates that the execution of both post-SPLOOP
instructions and instructions reloaded from the buffer begin in the first cycle of the epilog.
The following restrictions apply to the use of the SPKERNELR instruction:
• The SPKERNELR instruction must be the first instruction in the execute packet
containing it.
• The SPKERNELR instruction cannot be placed in the same execute packet as any
instruction that initiates multicycle NOPs. This includes BNOP, CALLP, NOP n
(n > 1), and protected loads (see compact instruction discussion in Section 3.10).
• The SPKERNELR instruction cannot be placed in the execute packet immediately
following an execute packet containing any instruction that initiates multicycle NOPs.
This includes BNOP, CALLP, NOP n (n > 1), and protected loads (see compact
instruction discussion in Section 3.10).
• The SPKERNELR instruction cannot be placed in parallel with DINT or RINT
instructions.
• The SPKERNELR instruction cannot be placed in parallel with SPMASK,
SPMASKR, SPLOOP, SPLOOPD, or SPLOOPW instructions.
• The SPKERNELR instruction can only be used when the SPLOOP instruction that
began the SPLOOP buffer operation was predicated.
• The SPKERNELR instruction cannot be paired with an SPLOOPW instruction.
This instruction executes unconditionally and cannot be predicated.
Syntax SPLOOP ii
unit = none
Opcode
31 29 28 27 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z ii - 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 s p
3 1 5 1 1
Description The SPLOOP instruction invokes the loop buffer mechanism. See Chapter 7 for more
details.
When the SPLOOP instruction is predicated, it indicates that the loop is a nested loop
using the SPLOOP reload capability. The decision of whether to reload is determined by
the predicate register selected by the creg and z fields.
The following restrictions apply to the use of the SPLOOP instruction:
• The SPLOOP instruction must be the first instruction in the execute packet containing
it.
• The SPLOOP instruction cannot be placed in the same execute packet as any
instruction that initiates multicycle NOPs. This includes BNOP, CALLP, NOP n
(n > 1), and protected loads (see compact instruction discussion in Section 3.10).
• The SPLOOP instruction cannot be placed in parallel with DINT or RINT instructions.
• The SPLOOP instruction cannot be placed in parallel with SPMASK, SPMASKR,
SPKERNEL, or SPKERNELR instructions.
SPLOOPD Software Pipelined Loop (SPLOOP) Buffer Operation With Delayed Testing
Syntax SPLOOPD ii
unit = none
Opcode
31 29 28 27 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z ii - 1 0 0 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 s p
3 1 5 1 1
Description The SPLOOPD instruction invokes the loop buffer mechanism. The testing of the
termination condition is delayed for four cycles. See Chapter 7 for more details.
When the SPLOOPD instruction is predicated, it indicates that the loop is a nested loop
using the SPLOOP reload capability. The decision of whether to reload is determined by
the predicate register selected by the creg and z fields.
The following restrictions apply to the use of the SPLOOPD instruction:
• The SPLOOPD instruction must be the first instruction in the execute packet
containing it.
• The SPLOOPD instruction cannot be placed in the same execute packet as any
instruction that initiates multicycle NOPs. This includes BNOP, CALLP, NOP n
(n > 1), and protected loads (see compact instruction discussion in Section 3.10).
• The SPLOOPD instruction cannot be placed in parallel with DINT or RINT
instructions.
• The SPLOOPD instruction cannot be placed in parallel with SPMASK, SPMASKR,
SPKERNEL, or SPKERNELR instructions.
SPLOOPW Software Pipelined Loop (SPLOOP) Buffer Operation With Delayed Testing and
No Epilog
Syntax SPLOOPW ii
unit = none
Opcode
31 29 28 27 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z ii - 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 s p
3 1 5 1 1
Description The SPLOOPW instruction invokes the loop buffer mechanism. The testing of the
termination condition is delayed for four cycles. See Chapter 7 for more details.
The SPLOOPW instruction is always predicated. The termination condition is the value
of the predicate register selected by the creg and z fields.
The following restrictions apply to the use of the SPLOOPW instruction:
• The SPLOOPW instruction must be the first instruction in the execute packet
containing it.
• The SPLOOPW instruction cannot be placed in the same execute packet as any
instruction that initiates multicycle NOPs. This includes BNOP, NOP n (n > 1), and
protected loads (see compact instruction discussion in Section 3.10).
• The SPLOOPW instruction cannot be placed in parallel with DINT or RINT
instructions.
• The SPLOOPW instruction cannot be placed in parallel with SPMASK, SPMASKR,
SPKERNEL, or SPKERNELR instructions.
Opcode
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 0 0 0 M2 M1 D2 D1 S2 S1 L2 L1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 s p
1 1 1 1 1 1 1 1 1 1
Description The SPMASK instruction serves two purposes within the SPLOOP mechanism:
1. The SPMASK instruction inhibits the execution of specified instructions from the
buffer within the current execute packet.
2. The SPMASK inhibits the loading of specified instructions into the buffer during
loading phase, although the instruction will execute normally.
If the SPLOOP is reloading after returning from an interrupt, the SPMASKed instructions
coming from the buffer execute, but the SPMASKed instructions from program memory
do not execute and are not loaded into the buffer.
An SPMASKed instruction encountered outside of the SPLOOP mechanism shall be
treated as a NOP.
The SPMASKed instruction must be the first instruction in the execute packet containing
it.
The SPMASK instruction cannot be placed in parallel with SPLOOP, SPLOOPD,
SPKERNEL, or SPKERNELR instructions.
The SPMASK instruction executes unconditionally and cannot be predicated.
There are two ways to specify which instructions within the current execute packet will
be masked:
1. The functional units of the instruction can be specified as the SPMASK argument.
2. The instruction to be masked can be marked with a caret (^) in the instruction code.
The following three examples are equivalent:
SPMASK D2,L1
|| MV .D2 B0,B1
|| MV .L1 A0,A1
SPMASK D2
|| MV .D2 B0,B1
||^ MV .L1 A0,A1
SPMASK
||^ MV .D2 B0,B1
||^ MV .L1 A0,A1
The following two examples mask two MV instructions, but do not mask the MPY
instruction.
SPMASK D1, D2
|| MV .D1 A0,A1 ;This unit is SPMASKed
|| MV .D2 B0,B1 ;This unit is SPMASKed
|| MPY .L1 A0,B1 ;This unit is Not SPMASKed
SPMASK
||^ MV .D1 A0,A1 ;This unit is SPMASKed
||^ MV .D2 B0,B1 ;This unit is SPMASKed
|| MPY .L1 A0,B1 ;This unit is Not SPMASKed
Opcode
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 0 0 0 M2 M1 D2 D1 S2 S1 L2 L1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 s p
1 1 1 1 1 1 1 1 1 1
Description The SPMASKR instruction serves three purposes within the SPLOOP mechanism.
Similar to the SPMASK instruction:
1. The SPMASKR instruction inhibits the execution of specified instructions from the
buffer within the current execute packet.
2. The SPMASKR instruction inhibits the loading of specified instructions into the buffer
during loading phase, although the instruction will execute normally.
There are two ways to specify which instructions within the current execute packet will
be masked:
1. The functional units of the instruction can be specified as the SPMASKR argument.
2. The instruction to be masked can be marked with a caret (^) in the instruction code.
The following three examples are equivalent:
SPMASKR D2,L1
|| MV .D2 B0,B1
|| MV .L1 A0,A1
SPMASKR
|| MV .D2 B0,B1
||^ MV .L1 A0,A1
SPMASKR
||^ MV .D2 B0,B1
||^ MV .L1 A0,A1
The following two examples mask two MV instructions, but do not mask the MPY
instruction. The presence of a caret (^) in the instruction code specifies which
instructions are SPMASKed.
SPMASKR D1,D2
|| MV .D1 A0,A1 ;This unit is SPMASKed
|| MV .D2 B0,B1 ;This unit is SPMASKed
|| MPY .L1 A0,B1 ;This unit is Ned SPMASKed
SPMASKR
||^ MV .D1 A0,A1 ;This unit is SPMASKED
||^ MV .D2 B0,B1 ;This unit is SPMASKED
|| MPY .L1 A0,B1 ;This unit is Not SPMASKed
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 0 0 0 0 0 x 0 0 0 1 0 1 1 1 1 0 s p
3 1 5 5 1 1 1
Description The single-precision value in src2 is converted to an integer and placed in dst. This
instruction operates like SPINT except that the rounding modes in the floating-point
adder configuration register (FADCR) are ignored, and round toward zero (truncate) is
always used.
NOTE:
1. If src2 is NaN, the maximum signed integer (7FFF FFFFh or
8000 0000h) is placed in dst and the INVAL bit is set.
2. If src2 is signed infinity or if overflow occurs, the maximum signed
integer (7FFF FFFFh or 8000 0000h) is placed in dst and the INEX
and OVER bits are set. Overflow occurs if src2 is greater than
231 − 1 or less than −231.
3. If src2 is denormalized, 0000 0000h is placed in dst and INEX and
DEN2 bits are set.
4. If rounding is performed, the INEX bit is set.
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src2
Written dst
Unit in use .L
Delay Slots 3
Opcode
31 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
creg z dst src2 src1 x op 1 0 0 0 s p
3 1 5 5 5 1 6 1 1
Description The src2 operand is shifted to the left by the src1 operand. The result is placed in dst.
When a register is used to specify the shift, the 5 least-significant bits specify the shift
amount. Valid values are 0 through 31, and the result of the shift is invalid if the shift
amount is greater than 31. The result of the shift is saturated to 32 bits. If a saturate
occurs, the SAT bit in CSR is set one cycle after dst is written.
NOTE: When a register is used to specify the shift, the 6 least-significant bits
specify the shift amount. Valid values are 0 through 63. If the shift count
value is greater than 32, then the result is saturated to 32 bits when src2
is non-zero.
Execution
if (cond) {
if (bit(31) through bit(31 - src1) of src2 are all 1s or all 0s),
dst = src2 << src1;
else if (src2 > 0), saturate dst to 7FFF FFFFh;
else if (src2 < 0), saturate dst to 8000 0000h
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .S
Delay Slots 0
Examples Example 1
SSHL .S1 A0,2,A1
CSR 0001 0100h CSR 0001 0100h CSR 0001 0100h Not saturated
Example 2
SSHL .S1 A0,A1,A2
CSR 0001 0100h CSR 0001 0100h CSR 0001 0300h Saturated
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 1 1 0 0 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Shifts the signed 32-bit value in src2 to the left or right by the number of bits specified by
src1, and places the result in dst.
The src1 argument is treated as a 2s-complement shift value which is automatically
limited to the range -31 to 31. If src1 is positive, src2 is shifted to the left. If src1 is
negative, src2 is shifted to the right by the absolute value of the shift amount, with the
sign-extended shifted value being placed in dst. It should also be noted that when src1 is
negative, the bits shifted right past bit 0 are lost.
Saturation is performed when the value is shifted left under the following conditions:
• If the shifted value is in the range -231 to 231 - 1, inclusive, then no saturation is
performed, and the result is truncated to 32 bits.
• If the shifted value is greater than 231 - 1, then the result is saturated to 231 - 1.
• If the shifted value is less than - 231, then the result is saturated to - 231.
31 0
abcdefgh ijklmnop qrstuvwx yzABCDEF ← src2
SSHVL
31 0
aaaaaaaa abcdefgh ijklmnop qrstuvwx ← dst
(for src1 = -8)
NOTE: If the shifted value is saturated, then the SAT bit is set in CSR one cycle
after the result is written to dst. If the shifted value is not saturated, then
the SAT bit is unaffected.
Execution
if (cond) {
if (0 <= src1 <= 31), sat(src2 << src1) → dst ;
if (-31 <= src1 < 0), (src2 >> abs(src1)) → dst;
if (src1 > 31), sat(src2 << 31) → dst;
if (src1 < -31), (src2 >> 31) → dst
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
Examples Example 1
SSHVL .M2 B2, B4, B5
Example 2
SSHVL .M1 A2,A4,A5
Example 3
SSHVL .M2 B12, B24, B25
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 1 0 1 0 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description Shifts the signed 32-bit value in src2 to the left or right by the number of bits specified by
src1, and places the result in dst.
The src1 argument is treated as a 2s-complement shift value that is automatically limited
to the range -31 to 31. If src1 is positive, src2 is shifted to the right by the value specified
with the sign-extended shifted value being placed in dst. It should also be noted that
when src1 is positive, the bits shifted right past bit 0 are lost. If src1 is negative, src2 is
shifted to the left by the absolute value of the shift amount value and the result is placed
in dst.
Saturation is performed when the value is shifted left under the following conditions:
• If the shifted value is in the range -231 to 231 - 1, inclusive, then no saturation is
performed, and the result is truncated to 32 bits.
• If the shifted value is greater than 231 - 1, then the result is saturated to 231 - 1.
• If the shifted value is less than - 231, then the result is saturated to - 231.
31 0
abcdefgh ijklmnop qrstuvwx yzABCDEF ← src2
SSHVR
31 0
aaaaaaaa bcdefghi jklmnopq rstuvwxy ← dst
(for src1 = 7)
NOTE: If the shifted value is saturated, then the SAT bit is set in CSR one cycle
after the result is written to dst. If the shifted value is not saturated, then
the SAT bit is unaffected.
Execution
if (cond) {
if (0 <= src1 <= 31), (src2 >> src1) → dst;
if (-31 <= src1 < 0), sat(src2 << abs(src1)) → dst;
if (src1 > 31), (src2 >> 31) → dst;
if (src1 < -31), sat(src2 << 31) → dst
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src1, src2
Written dst
Unit in use .M
Delay Slots 1
Examples Example 1
SSHVR .M2 B2,B4,B5
Example 2
SSHVR .M1 A2,A4,A5
Example 3
SSHVR .M2 B12, B24, B25
Opcode
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
creg z dst src2 src1 x op 1 1 0 s p
3 1 5 5 5 1 7 1 1
Description src2 is subtracted from src1 and is saturated to the result size according to the following
rules:
1. If the result is an int and src1 - src2 > 231 - 1, then the result is 231 - 1.
2. If the result is an int and src1 - src2 < -231, then the result is -231.
3. If the result is a long and src1 - src2 > 239 - 1, then the result is 239 - 1.
4. If the result is a long and src1 - src2 < -239, then the result is -239.
The result is placed in dst. If a saturate occurs, the SAT bit in CSR is set one cycle after
dst is written.
Execution
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
SSUB .L2 B1,B2,B3
B1 5A2E 51A3h
B2 802A 3FA2h
B3 7FFF FFFFh
Example 2
SSUB .L1 A0,A1,A2
A0 4367 71F2h
A1 5A2E 51A3h
A2 E939 204Fh
SSUB2 Subtract Two Signed 16-Bit Integers on Upper and Lower Register Halves With
Saturation
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 0 0 1 0 0 1 1 0 s p
3 1 5 5 5 1 1 1
Description Performs 2s-complement subtraction between signed, packed 16-bit quantities in src1
and src2. The results are placed in a signed, packed 16-bit format into dst.
For each pair of 16-bit quantities in src1 and src2, the difference between the signed
16-bit value from src1 and the signed 16-bit value from src2 is calculated and saturated
to produce a signed 16-bit result. The result is placed in the corresponding position in
dst.
Saturation is performed on each 16-bit result independently. For each sum, the following
tests are applied:
• If the difference is in the range - 215 to 2 15 - 1, inclusive, then no saturation is
performed and the sum is left unchanged.
• If the difference is greater than 215 - 1, then the result is set to 215 - 1.
• If the difference is less than - 215, then the result is set to - 215.
31 16 15 0
a_hi a_lo ← src1
- -
SSUB2
= =
31 16 15 0
sat(a_hi - b_hi) sat(a_lo - b_lo) ← dst
Execution
if (cond) {
sat(msb16(src1) - msb16(src2)) → msb16(dst);
sat(lsb16(src1) - lsb16(src2)) → lsb16(dst)
}
else nop
Delay Slots 0
Examples Example 1
SSUB2 .L1 A0,A1,A2
A1 FFFF FFFFh
Example 2
SSUB2 .L1 A0,A1,A2
A1 8000 FFFFh
STB Store Byte to Memory With a 5-Bit Unsigned Constant Offset or Register Offset
Syntax
Opcode
31 29 28 27 23 22 18 17 13 12 9 8 7 6 5 4 3 2 1 0
creg z src baseR offsetR/ucst5 mode 0 y 0 1 1 0 1 s p
3 1 5 5 5 4 1 1 1
Description Stores a byte to memory from a general-purpose register (src). Table 3-11 describes the
addressing generator options. The memory address is formed from a base address
register (baseR) and an optional offset that is either a register (offsetR) or a 5-bit
unsigned constant (ucst5).
offsetR and baseR must be in the same register file and on the same side as the .D unit
used. The y bit in the opcode determines the .D unit and register file used: y = 0 selects
the .D1 unit and baseR and offsetR from the A register file, and y = 1 selects the .D2 unit
and baseR and offsetR from the B register file.
offsetR/ucst5 is scaled by a left-shift of 0 bits. After scaling, offsetR/ucst5 is added to or
subtracted from baseR. For the preincrement, predecrement, positive offset, and
negative offset address generator options, the result of the calculation is the address to
be accessed in memory. For postincrement or postdecrement addressing, the value of
baseR before the addition or subtraction is sent to memory.
The addressing arithmetic that performs the additions and subtractions defaults to linear
mode. However, for A4-A7 and for B4-B7, the mode can be changed to circular mode by
writing the appropriate value to the AMR (see Section 2.8.3).
For STB, the 8 LSBs of the src register are stored. src can be in either register file,
regardless of the .D unit or baseR or offsetR used. The s bit determines which file src is
read from: s = 0 indicates src will be in the A register file and s = 1 indicates src will be
in the B register file.
Increments and decrements default to 1 and offsets default to zero when no bracketed
register or constant is specified. Stores that do no modification to the baseR can use the
syntax *R. Square brackets, [ ], indicate that the ucst5 offset is left-shifted by 0.
Parentheses, ( ), can be used to set a nonscaled, constant offset. You must type either
brackets or parentheses around the specified offset, if you use the optional offset
parameter.
Execution
Pipeline
Pipeline Stage E1
Read baseR, offsetR, src
Written baseR
Unit in use .D2
Delay Slots 0
For more information on delay slots for a store, see Chapter 4.
Examples Example 1
STB .D1 A1,*A10
Example 2
STB .D1 A8,*++A4[5]
mem 4024:27h xxxx xxxxh mem 4024:27h xxxx xxxxh mem 4024:27h xxxx 67xxh
Example 3
STB .D1 A8,*A4++[5]
mem 4020:23h xxxx xxxxh mem 4020:23h xxxx xxxxh mem 4020:23h xxxx xx67h
Example 4
STB .D1 A8,*++A4[A12]
mem 4024:27h xxxx xxxxh mem 4024:27h xxxx xxxxh mem 4024:27h xx67 xxxxh
Opcode
31 29 28 27 23 22 8 7 6 5 4 3 2 1 0
creg z src ucst15 y 0 1 1 1 1 s p
3 1 5 15 1 1 1
Description Stores a byte to memory from a general-purpose register (src). The memory address is
formed from a base address register B14 (y = 0) or B15 (y = 1) and an offset, which is a
15-bit unsigned constant (ucst15). The assembler selects this format only when the
constant is larger than five bits in magnitude. This instruction executes only on the .D2
unit.
The offset, ucst15, is scaled by a left-shift of 0 bits. After scaling, ucst15 is added to
baseR. The result of the calculation is the address that is sent to memory. The
addressing arithmetic is always performed in linear mode.
For STB, the 8 LSBs of the src register are stored. src can be in either register file. The
s bit determines which file src is read from: s = 0 indicates src is in the A register file and
s = 1 indicates src is in the B register file.
Square brackets, [ ], indicate that the ucst15 offset is left-shifted by 0. Parentheses, ( ),
can be used to set a nonscaled, constant offset. You must type either brackets or
parentheses around the specified offset, if you use the optional offset parameter.
Execution
Pipeline
Pipeline Stage E1
Read B14/B15, src
Written
Unit in use .D2
Delay Slots 0
Syntax
Opcode
31 29 28 27 23 22 18 17 13 12 9 8 7 6 5 4 3 2 1 0
creg z src baseR offsetR/ucst5 mode 1 y 1 0 0 0 1 s p
3 1 5 5 5 4 1 1 1
Description Stores a 64-bit quantity to memory from a 64-bit register, src. Table 3-11 describes the
addressing generator options. Alignment to a 64-bit boundary is required. The memory
address is formed from a base address register (baseR) and an optional offset that is
either a register (offsetR) or a 5-bit unsigned constant (ucst5). If an offset is not given,
the assembler assigns an offset of zero.
Both offsetR and baseR must be in the same register file, and on the same side, as the
.D unit used. The y bit in the opcode determines the .D unit and register file used: y = 0
selects the .D1 unit and baseR and offsetR from the A register file, and y = 1 selects the
.D2 unit and baseR and offsetR from the B register file.
The offsetR/ucst5 is scaled by a left shift of 3 bits. After scaling, offsetR/ucst5 is added
to, or subtracted from, baseR. For the preincrement, predecrement, positive offset, and
negative offset address generator options, the result of the calculation is the address to
be accessed in memory. For postincrement or postdecrement addressing, the value of
baseR before the addition or subtraction is the address to be accessed from memory.
The addressing arithmetic that performs the additions and subtractions defaults to linear
mode. However, for A4-A7 and for B4-B7, the mode can be changed to circular mode by
writing the appropriate value to the AMR (see Section 2.8.3).
The src pair can be in either register file, regardless of the .D unit or baseR or offsetR
used. The s bit determines which file src will be loaded from: s = 0 indicates src will be in
the A register file and s = 1 indicates src will be in the B register file.
Assembler Notes When no bracketed register or constant is specified, the assembler defaults increments
and decrements to 1 and offsets to 0. Stores that do no modification to the baseR can
use the assembler syntax *R. Square brackets, [ ], indicate that the ucst5 offset is
left-shifted by 3 for doubleword stores.
Parentheses, ( ), can be used to tell the assembler that the offset is a non-scaled,
constant offset. The assembler right shifts the constant by 3 bits for doubleword stores
before using it for the ucst5 field. After scaling by the STDW instruction, this results in
the same constant offset as the assembler source if the least-significant three bits are
zeros.
For example, STDW (.unit) src, *+baseR (16) represents an offset of 16 bytes (2
doublewords), and the assembler writes out the instruction with ucst5 = 2. STDW (.unit)
src, *+baseR [16] represents an offset of 16 doublewords, or 128 bytes, and the
assembler writes out the instruction with ucst5 = 16.
Either brackets or parentheses must be typed around the specified offset if the optional
offset parameter is used. The register pair syntax always places the odd-numbered
register first, a colon, followed by the even-numbered register (that is, A1:A0, B1:B0,
A3:A2, B3:B2, etc.).
Execution
Pipeline
Pipeline Stage E1
Read baseR, offsetR, src
Written baseR
Unit in use .D
Delay Slots 0
Examples Example 1
STDW .D1 A3:A2,*A0++
A3:A2 A176 3B28h 6041 AD65h A3:A2 A176 3B28h 6041 AD65h
Byte Memory Address 1009 1008 1007 1006 1005 1004 1003 1002 1001 1000
Data Value Before Store 00 00 00 00 00 00 00 00 00 00
Data Value After Store 00 00 A1 76 3B 28 60 41 AD 65
Example 2
STDW .D1 A3:A2, *A0++
A3:A2 A176 3B28h 6041 AD65h A3:A2 A176 3B28h 6041 AD65h
Byte Memory Address 100D 100C 100B 100A 1009 1008 1007 1006 1005 1004 1003
Data Value Before Store 00 00 00 00 00 00 00 00 00 00 00
Data Value After Store 00 00 A1 76 3B 28 60 41 AD 65 00
Example 3
STDW .D1 A9:A8, *++A4[5]
A9:A8 ABCD EF98h 0123 4567h A9:A8 ABCD EF98h 0123 4567h
Byte Memory Address 4051 4050 404F 404E 404D 404C 404B 404A 4049 4048 4047
Data Value Before Store 00 00 00 00 00 00 00 00 00 00 00
Data Value After Store 00 00 AB CD EF 98 01 23 45 67 00
Example 4
STDW .D1 A9:A8, *++A4(16)
A9:A8 ABCD EF98h 0123 4567h A9:A8 ABCD EF98h 0123 4567h
Byte Memory Address 4039 4038 4037 4036 4035 4034 4033 4032 4031 4030 402F
Data Value Before Store 00 00 00 00 00 00 00 00 00 00 00
Data Value After Store 00 00 AB CD EF 98 01 23 45 67 00
Example 5
STDW .D1 A9:A8, *++A4[A12]
A9:A8 ABCD EF98h 0123 4567h A9:A8 ABCD EF98h 0123 4567h
Byte Memory Address 4059 4058 4057 4056 4055 4054 4053 4052 4051 4050 404F
Data Value Before Store 00 00 00 00 00 00 00 00 00 00 00
Data Value After Store 00 00 AB CD EF 98 01 23 45 67 00
Syntax
Opcode
31 29 28 27 23 22 18 17 13 12 9 8 7 6 5 4 3 2 1 0
creg z src baseR offsetR/ucst5 mode 0 y 1 0 1 0 1 s p
3 1 5 5 5 4 1 1 1
Description Stores a halfword to memory from a general-purpose register (src). Table 3-11 describes
the addressing generator options. The memory address is formed from a base address
register (baseR) and an optional offset that is either a register (offsetR) or a 5-bit
unsigned constant (ucst5).
offsetR and baseR must be in the same register file and on the same side as the .D unit
used. The y bit in the opcode determines the .D unit and register file used: y = 0 selects
the .D1 unit and baseR and offsetR from the A register file, and y = 1 selects the .D2 unit
and baseR and offsetR from the B register file.
offsetR/ucst5 is scaled by a left-shift of 1 bit. After scaling, offsetR/ucst5 is added to or
subtracted from baseR. For the preincrement, predecrement, positive offset, and
negative offset address generator options, the result of the calculation is the address to
be accessed in memory. For postincrement or postdecrement addressing, the value of
baseR before the addition or subtraction is sent to memory.
The addressing arithmetic that performs the additions and subtractions defaults to linear
mode. However, for A4-A7 and for B4-B7, the mode can be changed to circular mode by
writing the appropriate value to the AMR (see Section 2.8.3).
For STH, the 16 LSBs of the src register are stored. src can be in either register file,
regardless of the .D unit or baseR or offsetR used. The s bit determines which file src is
read from: s = 0 indicates src will be in the A register file and s = 1 indicates src will be
in the B register file.
Increments and decrements default to 1 and offsets default to zero when no bracketed
register or constant is specified. Stores that do no modification to the baseR can use the
syntax *R. Square brackets, [ ], indicate that the ucst5 offset is left-shifted by 1.
Parentheses, ( ), can be used to set a nonscaled, constant offset. You must type either
brackets or parentheses around the specified offset, if you use the optional offset
parameter.
Halfword addresses must be aligned on halfword (LSB is 0) boundaries.
Execution
Pipeline
Pipeline Stage E1
Read baseR, offsetR, src
Written baseR
Unit in use .D2
Delay Slots 0
For more information on delay slots for a store, see Chapter 4.
Examples Example 1
STH .D1 A1,*+A10(4)
Example 2
STH .D1 A1,*A10--[A11]
Opcode
31 29 28 27 23 22 8 7 6 5 4 3 2 1 0
creg z src ucst15 y 1 0 1 1 1 s p
3 1 5 15 1 1 1
Description Stores a halfword to memory from a general-purpose register (src). The memory
address is formed from a base address register B14 (y = 0) or B15 (y = 1) and an offset,
which is a 15-bit unsigned constant (ucst15). The assembler selects this format only
when the constant is larger than five bits in magnitude. This instruction executes only on
the .D2 unit.
The offset, ucst15, is scaled by a left-shift of 1 bit. After scaling, ucst15 is added to
baseR. The result of the calculation is the address that is sent to memory. The
addressing arithmetic is always performed in linear mode.
For STH, the 16 LSBs of the src register are stored. src can be in either register file. The
s bit determines which file src is read from: s = 0 indicates src is in the A register file and
s = 1 indicates src is in the B register file.
Square brackets, [ ], indicate that the ucst15 offset is left-shifted by 1. Parentheses, ( ),
can be used to set a nonscaled, constant offset. You must type either brackets or
parentheses around the specified offset, if you use the optional offset parameter.
Halfword addresses must be aligned on halfword (LSB is 0) boundaries.
Execution
Pipeline
Pipeline Stage E1
Read B14/B15, src
Written
Unit in use .D2
Delay Slots 0
STNDW Store Nonaligned Doubleword to Memory With a 5-Bit Unsigned Constant Offset or
Register Offset
Syntax
Opcode
31 29 28 27 24 23 22 18 17 13 12 9 8 7 6 5 4 3 2 1 0
creg z src sc baseR offsetR/ucst5 mode 1 y 1 1 1 0 1 s p
3 1 4 1 5 5 4 1 1 1
Description Stores a 64-bit quantity to memory from a 64-bit register pair, src. Table 3-11 describes
the addressing generator options. The STNDW instruction may write a 64-bit value to
any byte boundary. Thus alignment to a 64-bit boundary is not required. The memory
address is formed from a base address register (baseR) and an optional offset that is
either a register (offsetR) or a 5-bit unsigned constant (ucst5).
Both offsetR and baseR must be in the same register file and on the same side as the .D
unit used. The y bit in the opcode determines the .D unit and register file used: y = 0
selects the .D1 unit and baseR and offsetR from the A register file, and y = 1 selects the
.D2 unit and baseR and offsetR from the B register file.
The STNDW instruction supports both scaled offsets and non-scaled offsets. The sc field
is used to indicate whether the offsetR/ucst5 is scaled or not. If sc is 1 (scaled), the
offsetR/ucst5 is shifted left 3 bits before adding or subtracting from the baseR. If sc is 0
(nonscaled), the offsetR/ucst5 is not shifted before adding to or subtracting from the
baseR. For the preincrement, predecrement, positive offset, and negative offset address
generator options, the result of the calculation is the address to be accessed in memory.
For postincrement or post-decrement addressing, the value of baseR before the addition
or subtraction is the address to be accessed from memory.
The addressing arithmetic that performs the additions and subtractions defaults to linear
mode. However, for A4-A7 and for B4-B7, the mode can be changed to circular mode by
writing the appropriate value to the AMR (see Section 2.8.3).
The src pair can be in either register file, regardless of the .D unit or baseR or offsetR
used. The s bit determines which file src will be loaded from: s = 0 indicates src will be in
the A register file and s = 1 indicates src will be in the B register file.
Assembler Notes When no bracketed register or constant is specified, the assembler defaults increments
and decrements to 1, and offsets to 0. Loads that do no modification to the baseR can
use the assembler syntax *R. Square brackets, [ ], indicate that the ucst5 offset is
left-shifted by 3 for doubleword stores.
Parentheses, ( ), can be used to indicate to the assembler that the offset is a nonscaled
offset.
For example, STNDW (.unit) src, *+baseR (12) represents an offset of 12 bytes and the
assembler writes out the instruction with offsetC = 12 and sc = 0.
STNDW (.unit) src, *+baseR [16] represents an offset of 16 doublewords, or 128 bytes,
and the assembler writes out the instruction with offsetC = 16 and sc = 1.
Either brackets or parentheses must be typed around the specified offset if the optional
offset parameter is used.
Execution
Pipeline
Pipeline Stage E1
Read baseR, offsetR, src
Written baseR
Unit in use .D
Delay Slots 0
Examples Example 1
STNDW .D1 A3:A2, *A0++
Byte Memory Address 1009 1008 1007 1006 1005 1004 1003 1002 1001 1000
Data Value Before Store 00 00 00 00 00 00 00 00 00 00
Data Value After Store 00 A1 76 3B 28 60 41 AD 65 00
Example 2
STNDW .D1 A3:A2, *A0++
A3:A2 A176 3B28h 6041 AD65h A3:A2 A176 3B28h 6041 AD65h
Byte Memory Address 100B 100A 1009 1008 1007 1006 1005 1004 1003 1002 1001 1000
Data Value Before Store 00 00 00 00 00 00 00 00 00 00 00 00
Data Value After Store 00 A1 76 3B 28 60 41 AD 65 00 00 00
STNW Store Nonaligned Word to Memory With a 5-Bit Unsigned Constant Offset or
Register Offset
Syntax
Opcode
31 29 28 27 23 22 18 17 13 12 9 8 7 6 5 4 3 2 1 0
creg z src baseR offsetR/ucst5 mode 1 y 1 0 1 0 1 s p
3 1 5 5 5 4 1 1 1
Description Stores a 32-bit quantity to memory from a 32-bit register, src. Table 3-11 describes the
addressing generator options. The STNW instruction may write a 32-bit value to any byte
boundary. Thus alignment to a 32-bit boundary is not required. The memory address is
formed from a base address register (baseR) and an optional offset that is either a
register (offsetR) or a 5-bit unsigned constant (ucst5).
Both offsetR and baseR must be in the same register file, and on the same side, as the
.D unit used. The y bit in the opcode determines the .D unit and register file used: y = 0
selects the .D1 unit and baseR and offsetR from the A register file, and y = 1 selects the
.D2 unit and baseR and offsetR from the B register file.
The offsetR/ucst5 is scaled by a left shift of 2 bits. After scaling, offsetR/ucst5 is added
to, or subtracted from, baseR. For the preincrement, predecrement, positive offset, and
negative offset address generator options, the result of the calculation is the address to
be accessed in memory. For postincrement or postdecrement addressing, the value of
baseR before the addition or subtraction is the address to be accessed from memory.
The addressing arithmetic that performs the additions and subtractions defaults to linear
mode. However, for A4-A7 and for B4-B7, the mode can be changed to circular mode by
writing the appropriate value to the AMR (see Section 2.8.3).
The src can be in either register file, regardless of the .D unit or baseR or offsetR used.
The s bit determines which file src will be loaded from: s = 0 indicates src will be in the A
register file and s = 1 indicates src will be in the B register file.
Assembler Notes When no bracketed register or constant is specified, the assembler defaults increments
and decrements to 1 and offsets to 0. Loads that do no modification to the baseR can
use the assembler syntax *R. Square brackets, [ ], indicate that the ucst5 offset is
left-shifted by 2 for word stores.
Parentheses, ( ), can be used to tell the assembler that the offset is a non-scaled,
constant offset. The assembler right shifts the constant by 2 bits for word stores before
using it for the ucst5 field. After scaling by the STNW instruction, this results in the same
constant offset as the assembler source if the least-significant two bits are zeros.
For example, STNW (.unit) src,*+baseR (12) represents an offset of 12 bytes (3 words),
and the assembler writes out the instruction with ucst5 = 3.
STNW (.unit) src,*+baseR [12] represents an offset of 12 words, or 48 bytes, and the
assembler writes out the instruction with ucst5 = 12.
Either brackets or parentheses must be typed around the specified offset if the optional
offset parameter is used.
Execution
Pipeline
Pipeline Stage E1
Read baseR, offsetR, src
Written baseR
Unit in use .D
Delay Slots 0
Examples Example 1
STNW .D1 A3, *A0++
Byte Memory Address 1007 1006 1005 1004 1003 1002 1001 1000
Data Value Before Store 00 00 00 00 00 00 00 00
Data Value After Store 00 00 00 A1 76 3B 28 00
Example 2
STNW .D1 A3, *A0++
Byte Memory Address 1007 1006 1005 1004 1003 1002 1001 1000
Data Value Before Store 00 00 00 00 00 00 00 00
Data Value After Store 00 A1 76 3B 28 00 00 00
STW Store Word to Memory With a 5-Bit Unsigned Constant Offset or Register Offset
Syntax
Opcode
31 29 28 27 23 22 18 17 13 12 9 8 7 6 5 4 3 2 1 0
creg z src baseR offsetR/ucst5 mode 0 y 1 1 1 0 1 s p
3 1 5 5 5 4 1 1 1
Description Stores a word to memory from a general-purpose register (src). Table 3-11 describes the
addressing generator options. The memory address is formed from a base address
register (baseR) and an optional offset that is either a register (offsetR) or a 5-bit
unsigned constant (ucst5).
offsetR and baseR must be in the same register file and on the same side as the .D unit
used. The y bit in the opcode determines the .D unit and register file used: y = 0 selects
the .D1 unit and baseR and offsetR from the A register file, and y = 1 selects the .D2 unit
and baseR and offsetR from the B register file.
offsetR/ucst5 is scaled by a left-shift of 2 bits. After scaling, offsetR/ucst5 is added to or
subtracted from baseR. For the preincrement, predecrement, positive offset, and
negative offset address generator options, the result of the calculation is the address to
be accessed in memory. For postincrement or postdecrement addressing, the value of
baseR before the addition or subtraction is sent to memory.
The addressing arithmetic that performs the additions and subtractions defaults to linear
mode. However, for A4-A7 and for B4-B7, the mode can be changed to circular mode by
writing the appropriate value to the AMR (see Section 2.8.3).
For STW, the entire 32-bits of the src register are stored. src can be in either register
file, regardless of the .D unit or baseR or offsetR used. The s bit determines which file
src is read from: s = 0 indicates src will be in the A register file and s = 1 indicates src
will be in the B register file.
Increments and decrements default to 1 and offsets default to zero when no bracketed
register or constant is specified. Stores that do no modification to the baseR can use the
syntax *R. Square brackets, [ ], indicate that the ucst5 offset is left-shifted by 2.
Parentheses, ( ), can be used to set a nonscaled, constant offset. For example,
STW (.unit) src, *+baseR(12) represents an offset of 12 bytes; whereas,
STW (.unit) src, *+baseR[12] represents an offset of 12 words, or 48 bytes. You must
type either brackets or parentheses around the specified offset, if you use the optional
offset parameter.
Word addresses must be aligned on word (two LSBs are 0) boundaries.
Execution
Pipeline
Pipeline Stage E1
Read baseR, offsetR, src
Written baseR
Unit in use .D2
Delay Slots 0
For more information on delay slots for a store, see Chapter 4.
Examples Example 1
STW .D1 A1,*++A10[1]
mem 100h 1111 1134h mem 100h 1111 1134h mem 100h 1111 1134h
mem 104h 0000 1111h mem 104h 0000 1111h mem 104h 9A32 7634h
Example 2
STW .D1 A8,*++A4[5]
mem 4020h xxxx xxxxh mem 4020h xxxx xxxxh mem 4020h xxxx xxxxh
mem 4034h xxxx xxxxh mem 4034h xxxx xxxxh mem 4034h 0123 4567h
Example 3
STW .D1 A8,*++A4(8)
mem 4020h xxxx xxxxh mem 4020h xxxx xxxxh mem 4020h xxxx xxxxh
mem 4028h xxxx xxxxh mem 4028h xxxx xxxxh mem 4028h 0123 4567h
Example 4
STW .D1 A8,*++A4[A12]
mem 4020h xxxx xxxxh mem 4020h xxxx xxxxh mem 4020h xxxx xxxxh
mem 4038h xxxx xxxxh mem 4038h xxxx xxxxh mem 4038h 0123 4567h
Opcode
31 29 28 27 23 22 8 7 6 5 4 3 2 1 0
creg z src ucst15 y 1 1 1 1 1 s p
3 1 5 15 1 1 1
Description Stores a word to memory from a general-purpose register (src). The memory address is
formed from a base address register B14 (y = 0) or B15 (y = 1) and an offset, which is a
15-bit unsigned constant (ucst15). The assembler selects this format only when the
constant is larger than five bits in magnitude. This instruction executes only on the .D2
unit.
The offset, ucst15, is scaled by a left-shift of 2 bits. After scaling, ucst15 is added to
baseR. The result of the calculation is the address that is sent to memory. The
addressing arithmetic is always performed in linear mode.
For STW, the entire 32-bits of the src register are stored. src can be in either register
file. The s bit determines which file src is read from: s = 0 indicates src is in the A
register file and s = 1 indicates src is in the B register file.
Square brackets, [ ], indicate that the ucst15 offset is left-shifted by 2. Parentheses, ( ),
can be used to set a nonscaled, constant offset. For example,
STW (.unit) src, *+B14/B15(60) represents an offset of 12 bytes; whereas,
STW (.unit) src, *+B14/B15[60] represents an offset of 60 words, or 240 bytes. You must
type either brackets or parentheses around the specified offset, if you use the optional
offset parameter.
Word addresses must be aligned on word (two LSBs are 0) boundaries.
Execution
Pipeline
Pipeline Stage E1
Read B14/B15, src
Written
Unit in use .D2
Delay Slots 0
NOTE: Subtraction with a signed constant on the .L and .S units allows either
the first or the second operand to be the signed 5-bit constant.
SUB (.unit) src1, scst5, dst is encoded as ADD (.unit) −scst5, src2, dst
where the src1 register is now src2 and scst5 is now −scst5.
The .D unit, when the cross path form is not used, provides only the
second operand as a constant since it is an unsigned 5-bit constant.
ucst5 allows a greater offset for addressing with the .D unit.
Opcode .L unit
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
creg z dst src2 src1 x op 1 1 0 s p
3 1 5 5 5 1 7 1 1
Opcode .S unit
31 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
creg z dst src2 src1 x op 1 0 0 0 s p
3 1 5 5 5 1 6 1 1
src2 - src1:
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 0 1 0 1 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description for .L1, .L2 and .S1, .S2 Opcodes src2 is subtracted from src1. The result is placed in dst.
31 29 28 27 23 22 18 17 13 12 7 6 5 4 3 2 1 0
creg z dst src2 src1 op 1 0 0 0 0 s p
3 1 5 5 5 6 1 1
Execution
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 1 1 0 0 1 1 0 0 s p
3 1 5 5 5 1 1 1
Execution
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L, .S, or .D
Delay Slots 0
Opcode
31 29 28 27 23 22 18 17 13 12 7 6 5 4 3 2 1 0
creg z dst src2 src1 op 1 0 0 0 0 s p
3 1 5 5 5 6 1 1
Description src1 is subtracted from src2 using the byte addressing mode specified for src2. The
subtraction defaults to linear mode. However, if src2 is one of A4-A7 or B4-B7, the mode
can be changed to circular mode by writing the appropriate value to the AMR (see
Section 2.8.3).The result is placed in dst.
Execution
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .D
Delay Slots 0
(1)
Before instruction 1 cycle after instruction
SUBABS4 Subtract With Absolute Value, Four 8-Bit Pairs for Four 8-Bit Results
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 1 1 0 1 0 1 1 0 s p
3 1 5 5 5 1 1 1
Description Calculates the absolute value of the differences between the packed 8-bit data contained
in the source registers. The values in src1 and src2 are treated as unsigned, packed
8-bit quantities. The result is written into dst in an unsigned, packed 8-bit format.
For each pair of unsigned 8-bit values in src1 and src2, the absolute value of the
difference is calculated. This result is then placed in the corresponding position in dst.
• The absolute value of the difference between src1 byte0 and src2 byte0 is placed in
byte0 of dst.
• The absolute value of the difference between src1 byte1 and src2 byte1 is placed in
byte1 of dst.
• The absolute value of the difference between src1 byte2 and src2 byte2 is placed in
byte2 of dst.
• The absolute value of the difference between src1 byte3 and src2 byte3 is placed in
byte3 of dst.
The SUBABS4 instruction aids in motion-estimation algorithms, and other algorithms,
that compute the "best match" between two sets of 8-bit quantities.
31 24 23 16 15 8 7 0
ua_3 ua_2 ua_1 ua_0 ← src1
- - - -
SUBABS4
= = = =
31 24 23 16 15 8 7 0
abs(ua_3 - ub_3) abs(ua_2 - ub_2) abs(ua_1 - ub_1) abs(ua_0 - ub_0) ← dst
Execution
if (cond) {
abs(ubyte0(src1) - ubyte0(src2)) → ubyte0(dst);
abs(ubyte1(src1) - ubyte1(src2)) → ubyte1(dst);
abs(ubyte2(src1) - ubyte2(src2)) → ubyte2(dst);
abs(ubyte3(src1) - ubyte3(src2)) → ubyte3(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Opcode
31 29 28 27 23 22 18 17 13 12 7 6 5 4 3 2 1 0
creg z dst src2 src1 op 1 0 0 0 0 s p
3 1 5 5 5 6 1 1
Description src1 is subtracted from src2 using the halfword addressing mode specified for src2. The
subtraction defaults to linear mode. However, if src2 is one of A4-A7 or B4-B7, the mode
can be changed to circular mode by writing the appropriate value to the AMR (see
Section 2.8.3). src1 is left shifted by 1. The result is placed in dst.
Execution
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .D
Delay Slots 0
Opcode
31 29 28 27 23 22 18 17 13 12 7 6 5 4 3 2 1 0
creg z dst src2 src1 op 1 0 0 0 0 s p
3 1 5 5 5 6 1 1
Description src1 is subtracted from src2 using the word addressing mode specified for src2. The
subtraction defaults to linear mode. However, if src2 is one of A4-A7 or B4-B7, the mode
can be changed to circular mode by writing the appropriate value to the AMR (see
Section 2.8.3). src1 is left shifted by 2. The result is placed in dst.
Execution
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .D
Delay Slots 0
(1)
Before instruction 1 cycle after instruction
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 0 1 0 1 1 1 1 0 s p
3 1 5 5 5 1 1 1
Description Subtract src2 from src1. If result is greater than or equal to 0, left shift result by 1, add 1
to it, and place it in dst. If result is less than 0, left shift src1 by 1, and place it in dst. This
step is commonly used in division.
Execution
if (cond) {
if (src1 - src2 ≥ 0), ((src1 - src2) << 1) + 1 → dst
else (src1 << 1) → dst
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Examples Example 1
SUBC .L1 A0,A1,A0
Example 2
SUBC .L1 A0,A1,A0
Opcode
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
creg z dst src2 src1 x op 1 1 0 s p
3 1 5 5 5 1 7 1 1
NOTE: The assembly syntax allows a cross-path operand to be used for either
src1 or src2. The assembler selects between the two opcodes based on
which source operand in the assembly instruction requires the cross path.
If src1 requires the cross path, the assembler chooses the second
(reverse) form of the instruction syntax and reverses the order of the
operands in the encoded instruction.
NOTE:
1. This instruction takes the rounding mode from and sets the warning
bits in the floating-point adder configuration register (FADCR), not
the floating-point auxiliary configuration register (FAUCR) as for
other .S unit instructions.
2. The source specific warning bits set in FADCR are set according to
the registers sources in the actual machine instruction and not
according to the order of the sources in the assembly form.
3. If rounding is performed, the INEX bit is set.
4. If one source is SNaN or QNaN, the result is NaN_out. If either
source is SNaN, the INVAL bit is set also.
5. If both sources are +infinity or −infinity, the result is NaN_out and the
INVAL bit is set.
6. If one source is signed infinity and the other source is anything
except NaN or signed infinity of the same sign, the result is signed
infinity and the INFO bit is set.
7. If overflow occurs, the INEX and OVER bits are set and the results
are set as follows (LFPN is the largest floating-point number):
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4 E5 E6 E7
Read src1_l, src1_h,
src2_l src2_h
Written dst_l dst_h
Unit in use .L or .S .L or .S
The low half of the result is written out one cycle earlier than the high half. If dst is used
as the source for the ADDDP, CMPEQDP, CMPLTDP, CMPGTDP, MPYDP, MPYSPDP,
MPYSP2DP, or SUBDP instruction, the number of delay slots can be reduced by one,
because these instructions read the lower word of the DP source one cycle before the
upper word of the DP source.
Delay Slots 6
B1:B0 4021 3333h 3333 3333h B1:B0 4021 3333h 4021 3333h 8.6
A3:A2 C004 0000h 0000 0000h A3:A2 C004 0000h 0000 0000h -2.5
A5:A4 xxxx xxxxh xxxx xxxxh A5:A4 4026 3333h 3333 3333h 11.1
Opcode
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
creg z dst src2 src1 x op 1 1 0 s p
3 1 5 5 5 1 7 1 1
NOTE: The assembly syntax allows a cross-path operand to be used for either
src1or src2. The assembler selects between the two opcodes based on
which source operand in the assembly instruction requires the cross path.
If src1 requires the cross path, the assembler chooses the second
(reverse) form of the instruction syntax and reverses the order of the
operands in the encoded instruction.
NOTE:
1. This instruction takes the rounding mode from and sets the warning
bits in the floating-point adder configuration register (FADCR), not
the floating-point auxiliary configuration register (FAUCR) as for
other .S unit instructions.
2. The source specific warning bits set in FADCR are set according to
the registers sources in the actual machine instruction and not
according to the order of the sources in the assembly form.
3. If rounding is performed, the INEX bit is set.
4. If one source is SNaN or QNaN, the result is NaN_out. If either
source is SNaN, the INVAL bit is set also.
5. If both sources are +infinity or −infinity, the result is NaN_out and the
INVAL bit is set.
6. If one source is signed infinity and the other source is anything
except NaN or signed infinity of the same sign, the result is signed
infinity and the INFO bit is set.
7. If overflow occurs, the INEX and OVER bits are set and the results
are set as follows (LFPN is the largest floating-point number):
Execution
Pipeline
Pipeline Stage E1 E2 E3 E4
Read src1, src2
Written dst
Unit in use .L or .S
Delay Slots 3
Opcode
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
creg z dst src2 src1 x op 1 1 0 s p
3 1 5 5 5 1 7 1 1
Execution
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
A5:A4 xxxx xxxxh xxxx xxxxh A5:A4 0000 00FFh 0000 3348h -4,294,954,168 (2)
(1)
Unsigned 32-bit integer
(2)
Signed 40-bit (long) integer
SUB2 Subtract Two 16-Bit Integers on Upper and Lower Register Halves
Opcode .L unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 0 0 1 0 0 1 1 0 s p
3 1 5 5 5 1 1 1
Opcode .S unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 0 0 0 1 1 0 0 0 s p
3 1 5 5 5 1 1 1
Opcode .D unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 0 1 0 1 1 1 0 0 s p
3 1 5 5 5 1 1 1
Description The upper and lower halves of src2 are subtracted from the upper and lower halves of
src1 and the result is placed in dst. Any borrow from the lower-half subtraction does not
affect the upper-half subtraction. Specifically, the upper-half of src2 is subtracted from
the upper-half of src1 and placed in the upper-half of dst. The lower-half of src2 is
subtracted from the lower-half of src1 and placed in the lower-half of dst.
31 16 15 0
a_hi a_lo ← src1
- -
SUB2
= =
31 16 15 0
a_hi - b_hi a_lo - b_lo ← dst
NOTE: Unlike the SUB instruction, the argument ordering on the .D unit form of
.S2 is consistent with the argument ordering for the .L and .S unit forms.
Execution
if (cond) {
(lsb16(src1) - lsb16(src2)) → lsb16(dst);
(msb16(src1) - msb16(src2)) → msb16(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L, .S, .D
Delay Slots 0
Examples Example 1
SUB2 .S1 A3, A4, A5
Example 2
SUB2 .D2 B2, B8, B15
Example 3
SUB2 .S2X B1,A0,B2
SUB4 Subtract Without Saturation, Four 8-Bit Pairs for Four 8-Bit Results
Opcode
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 1 0 0 1 1 0 1 1 0 s p
3 1 5 5 5 1 1 1
Description Performs 2s-complement subtraction between packed 8-bit quantities. The values in src1
and src2 are treated as packed 8-bit data and the results are written into dst in a packed
8-bit format.
For each pair of 8-bit values in src1 and src2, the difference between the 8-bit value from
src1 and the 8-bit value from src2 is calculated to produce an 8-bit result. No saturation
is performed. The result is placed in the corresponding position in dst:
• The difference between src1 byte0 and src2 byte0 is placed in byte0 of dst.
• The difference between src1 byte1 and src2 byte1 is placed in byte1 of dst.
• The difference between src1 byte2 and src2 byte2 is placed in byte2 of dst.
• The difference between src1 byte3 and src2 byte3 is placed in byte3 of dst.
31 24 23 16 15 8 7 0
a_3 a_2 a_1 a_0 ← src1
- - - -
SUB4
= = = =
31 24 23 16 15 8 7 0
a_3 - b_3 a_2 - b_2 a_1 - b_1 a_0 - b_0 ← dst
Execution
if (cond) {
(byte0(src1) - byte0(src2)) → byte0(dst);
(byte1(src1) - byte1(src2)) → byte1(dst);
(byte2(src1) - byte2(src2)) → byte2(dst);
(byte3(src1) - byte3(src2)) → byte3(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L
Delay Slots 0
Opcode .L unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 0 1 1 0 1 1 1 1 0 s p
3 1 5 5 5 1 1 1
Opcode .S unit
31 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 src1 x 0 1 0 0 0 0 1 0 0 0 s p
3 1 5 5 5 1 1 1
Description The SWAP2 pseudo-operation takes the lower halfword from src2 and places it in the
upper halfword of dst, while the upper halfword from src2 is placed in the lower halfword
of dst. The assembler uses the PACKLH2 (.unit) src1, src2, dst instruction to perform
this operation (see PACKLH2).
31 16 15 0
b_hi b_lo ← src2
SWAP2
31 16 15 0
b_lo b_hi ← dst
The SWAP2 instruction can be used in conjunction with the SWAP4 instruction (see
SWAP4) to change the byte ordering (and therefore, the endianess) of 32-bit data.
Execution
if (cond) {
msb16(src2) → lsb16(dst);
lsb16(src2) → msb16(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src2
Written dst
Unit in use .L, .S
Delay Slots 0
Examples Example 1
SWAP2 .L1 A2,A9
Example 2
SWAP2 .S2 B2,B12
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src 0 0 0 0 1 x 0 0 1 1 0 1 0 1 1 0 s p
3 1 5 5 1 1 1
Description Exchanges pairs of bytes within each halfword of src2, placing the result in dst. The
values in src2 are treated as unsigned, packed 8-bit values.
Specifically the upper byte in the upper halfword is placed in the lower byte in the upper
halfword, while the lower byte of the upper halfword is placed in the upper byte of the
upper halfword. Also the upper byte in the lower halfword is placed in the lower byte of
the lower halfword, while the lower byte in the lower halfword is placed in the upper byte
of the lower halfword.
31 24 23 16 15 8 7 0
ub_3 ub_2 ub_1 ub_0 ← src2
SWAP4
31 24 23 16 15 8 7 0
ub_2 ub_3 ub_0 ub_1 ← dst
By itself, this instruction changes the ordering of bytes within halfwords. This effectively
changes the endianess of 16-bit data packed in 32-bit words. The endianess of full 32-bit
quantities can be changed by using the SWAP4 instruction in conjunction with the
SWAP2 instruction (see SWAP2).
Execution
if (cond) {
ubyte0(src2) → ubyte1(dst);
ubyte1(src2) → ubyte0(dst);
ubyte2(src2) → ubyte3(dst);
ubyte3(src2) → ubyte2(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src2
Written dst
Unit in use .L
Delay Slots 0
Syntax SWE
unit = none
Opcode
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 p
1
Description Causes an internal exception to be taken. It can be used as a mechanism for User mode
programs to request Supervisor mode services. Execution of the SWE instruction results
in an exception being recognized in the E1 pipeline phase containing the SWE
instruction. The SXF bit in EFR is set to 1. The HWE bit in NTSR is cleared to 0. If
exceptions have been globally enabled, this causes an exception to be recognized
before execution of the next execute packet. The address of that next execute packet is
placed in NRP.
Execution
Delay Slots 0
Syntax SWENR
unit = none
Opcode
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 p
1
Description Causes an internal exception to be taken. It is intended for use in systems supporting a
secure operating mode. It can be used as a mechanism for User mode programs to
request Supervisor mode services. It differs from the SWE instruction in four ways:
1. TSR is not copied into NTSR.
2. No return address is placed in NRP (it remains unmodified).
3. The IB bit in TSR is set to 1. This will be observable only in the case where another
exception is recognized simultaneously.
4. A branch to REP (restricted entry point register) is forced in the context switch rather
than the ISTP-based exception (NMI) vector.
This instruction executes unconditionally.
If another exception (internal or external) is recognized simultaneously with the
SWENR-raised exception then the other exception(s) takes priority and normal exception
behavior occurs; that is, NTSR and NRP are used and execution is directed to the NMI
vector.
Execution
Delay Slots 0
UNPKHU4 Unpack 16 MSB Into Two Lower 8-Bit Halfwords of Upper and Lower Register
Halves
Opcode .L unit
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 0 0 0 1 1 x 0 0 1 1 0 1 0 1 1 0 s p
3 1 5 5 1 1 1
Opcode .S unit
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 0 0 0 1 1 x 1 1 1 1 0 0 1 0 0 0 s p
3 1 5 5 1 1 1
Description Moves the two most-significant bytes of src2 into the two low bytes of the two halfwords
of dst.
Specifically the upper byte in the upper halfword is placed in the lower byte in the upper
halfword, while the lower byte of the upper halfword is placed in the lower byte of the
lower halfword. The src2 bytes are zero-extended when unpacked, filling the two high
bytes of the two halfwords of dst with zeros.
31 24 23 16 15 8 7 0
ub_3 ub_2 ub_1 ub_0 ← src2
UNPKHU4
31 24 23 16 15 8 7 0
00000000 ub_3 00000000 ub_2 ← dst
Execution
if (cond) {
ubyte3(src2) → ubyte2(dst);
0 → ubyte3(dst);
ubyte2(src2) → ubyte0(dst);
0 → ubyte1(dst)
}
else nop
Pipeline
Pipeline Stage E1
Read src2
Written dst
Unit in use .L, .S
Delay Slots 0
Examples Example 1
UNPKHU4 .L1 A1,A2
A1 9E 52 6E 30h A1 9E 52 6E 30h
Example 2
UNPKHU4 .L2 B17,B18
UNPKLU4 Unpack 16 LSB Into Two Lower 8-Bit Halfwords of Upper and Lower Register
Halves
Opcode .L unit
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 0 0 0 1 0 x 0 0 1 1 0 1 0 1 1 0 s p
3 1 5 5 1 1 1
Opcode .S unit
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 0 0 0 1 0 x 1 1 1 1 0 0 1 0 0 0 s p
3 1 5 5 1 1 1
Description Moves the two least-significant bytes of src2 into the two low bytes of the two halfwords
of dst.
Specifically, the upper byte in the lower halfword is placed in the lower byte in the upper
halfword, while the lower byte of the lower halfword is kept in the lower byte of the lower
halfword. The src2 bytes are zero-extended when unpacked, filling the two high bytes of
the two halfwords of dst with zeros.
31 24 23 16 15 8 7 0
ub_3 ub_2 ub_1 ub_0 ← src2
UNPKLU4
31 24 23 16 15 8 7 0
00000000 ub_1 00000000 ub_0 ← dst
Execution
if (cond) {
ubyte0(src2) → ubyte0(dst);
0 → ubyte1(dst);
ubyte1(src2) → ubyte2(dst);
0 → ubyte3(dst);
}
else nop
Pipeline
Pipeline Stage E1
Read src2
Written dst
Unit in use .L, .S
Delay Slots 0
Examples Example 1
UNPKLU4 .L1 A1,A2
A1 9E 52 6E 30h A1 9E 52 6E 30h
Example 2
UNPKLU4 .L2 B17,B18
Opcode .L unit
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
creg z dst src2 src1 x op 1 1 0 s p
3 1 5 5 5 1 7 1 1
Opcode .S unit
31 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
creg z dst src2 src1 x op 1 0 0 0 s p
3 1 5 5 5 1 6 1 1
Opcode .D unit
31 29 28 27 23 22 18 17 13 12 11 10 9 6 5 4 3 2 1 0
creg z dst src2 src1 x 1 0 op 1 1 0 0 s p
3 1 5 5 5 1 4 1 1
Description Performs a bitwise exclusive-OR (XOR) operation between src1 and src2. The result is
placed in dst. The scst5 operands are sign extended to 32 bits.
Execution
Pipeline
Pipeline Stage E1
Read src1, src2
Written dst
Unit in use .L, .S, or .D
Delay Slots 0
Examples Example 1
XOR .S1 A3, A4, A5
Example 2
XOR .D2 B1, 0Dh, B8
Opcode
31 30 29 28 27 23 22 18 17 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 dst src2 src1 x 0 1 1 0 1 1 1 1 0 0 s p
5 5 5 1 1 1
Description Performs a Galois field multiply, where src1 is 32 bits and src2 is limited to 9 bits. This
multiply connects all levels of the gmpy4 together and only extends out by 8 bits. The
XORMPY instruction is identical to a GMPY instruction executed with a zero-value
polynomial.
uword xormpy(uword src1,uword src2)
{
// the multiply is always between GF(2^9) and GF(2^32)
// so no size information is needed
uint pp;
uint mask, tpp;
uint I;
pp = 0;
mask = 0x00000100; // multiply by computing
// partial products.
for ( I=0; i<8; I++ ){
if ( src2 & mask ) pp ^= src1;
mask >>= 1;
pp <<= 1;
}
if ( src2 & 0x1 ) pp ^= src1;
Execution
GMPY_poly = 0
(lsb9(src2) gmpy uint(src1)) → uint(dst)
Delay Slots 3
A1 0000 0126h
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 1 1 0 0 1 x 0 0 0 0 1 1 1 1 0 0 s p
3 1 5 5 1 1 1
Description Reads the two least-significant bits of src2 and expands them into two halfword masks
written to dst. Bit 1 of src2 is replicated and placed in the upper halfword of dst. Bit 0 of
src2 is replicated and placed in the lower halfword of dst. Bits 2 through 31 of src2 are
ignored.
31 24 23 16 15 8 7 0
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXX10 ← src2
XPND2
31 24 23 16 15 8 7 0
11111111 11111111 00000000 00000000 ← dst
The XPND2 instruction is useful, when combined with the output of the CMPGT2 or
CMPEQ2 instruction, for generating a mask that corresponds to the individual halfword
positions that were compared. That mask may then be used with ANDN, AND, or OR
instructions to perform other operations like compositing. This is an example:
CMPGT2 .S1 A3, A4, A5 ; Compare two registers, both upper
; and lower halves.
XPND2 .M1 A5, A2 ; Expand the compare results into
; two 16-bit masks.
NOP
AND .D1 A2, A7, A8 ; Apply the mask to a value to create result.
Because the XPND2 instruction only examines the two least-significant bits of src2, it is
possible to store a large bit mask in a single 32-bit word and expand it using multiple
SHR and XPND2 instruction pairs. This can be useful for expanding a packed
1-bit-per-pixel bitmap into full 16-bit pixels in imaging applications.
Execution
if (cond) {
XPND2(src2 & 1) → lsb16(dst);
XPND2(src2 & 2) → msb16(dst)
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src2
Written dst
Unit in use .M
Delay Slots 1
Examples Example 1
XPND2 .M1 A1,A2
Example 2
XPND2 .M2 B1,B2
Opcode
31 29 28 27 23 22 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
creg z dst src2 1 1 0 0 0 x 0 0 0 0 1 1 1 1 0 0 s p
3 1 5 5 1 1 1
Description Reads the four least-significant bits of src2 and expands them into four-byte masks
written to dst. Bit 0 of src2 is replicated and placed in the least-significant byte of dst. Bit
1 of src2 is replicated and placed in second least-significant byte of dst. Bit 2 of src2 is
replicated and placed in second most-significant byte of dst. Bit 3 of src2 is replicated
and placed in most-significant byte of dst. Bits 4 through 31 of src2 are ignored.
31 24 23 16 15 8 7 0
XXXXXXXX XXXXXXXX XXXXXXXX XXXX1001 ← src2
XPND4
31 24 23 16 15 8 7 0
11111111 00000000 00000000 11111111 ← dst
The XPND4 instruction is useful, when combined with the output of the CMPGT4 or
CMPEQ4 instruction, for generating a mask that corresponds to the individual byte
positions that were compared. That mask may then be used with ANDN, AND, or OR
instructions to perform other operations like compositing.
This is an example:
CMPEQ4 .S1 A3, A4, A5 ; Compare two 32-bit registers all four bytes.
XPND4 .M1 A5, A2 ; Expand the compare results into
; four 8-bit masks.
NOP
AND .D1 A2, A7, A8 ; Apply the mask to a value to create result.
Because the XPND4 instruction only examines the four least-significant bits of src2, it is
possible to store a large bit mask in a single 32-bit word and expand it using multiple
SHR and XPND4 instruction pairs. This can be useful for expanding a packed,
1-bit-per-pixel bitmap into full 8-bit pixels in imaging applications.
Execution
if (cond) {
XPND4(src2 & 1) → byte0(dst);
XPND4(src2 & 2) → byte1(dst):
XPND4(src2 & 4) → byte2(dst);
XPND4(src2 & 8) → byte3(dst)
}
else nop
Pipeline
Pipeline Stage E1 E2
Read src2
Written dst
Unit in use .M
Delay Slots 1
Examples Example 1
XPND4 .M1 A1,A2
Example 2
XPND4 .M2 B1,B2
Opcode
Description This is a pseudo-operation used to fill the destination register or register pair with 0s.
When the destination is a single register, the assembler uses the MVK instruction to load
it with zeros: MVK (.unit) 0, dst (see MVK).
When the destination is a register pair, the assembler uses the SUB instruction (see
SUB) to subtract a value from itself and store the result in the destination pair.
Execution
if (cond) 0 → dst
else nop
or
Delay Slots 0
Examples Example 1
ZERO .D1 A1
Example 2
ZERO .L1 A1:A0
Pipeline
The DSP pipeline provides flexibility to simplify programming and improve performance. These two factors
provide this flexibility:
1. Control of the pipeline is simplified by eliminating pipeline interlocks.
2. Increased pipelining eliminates traditional architectural bottlenecks in program fetch, data access, and
multiply operations. This provides single-cycle throughput.
This chapter starts with a description of the pipeline flow. Highlights are:
• The pipeline can dispatch eight parallel instructions every cycle.
• Parallel instructions proceed simultaneously through each pipeline phase.
• Serial instructions proceed through the pipeline with a fixed relative phase difference between
instructions.
• Load and store addresses appear on the CPU boundary during the same pipeline phase, eliminating
read-after-write memory conflicts.
All instructions require the same number of pipeline phases for fetch and decode, but require a varying
number of execute phases. This chapter contains a description of the number of execution phases for
each type of instruction.
Finally, this chapter contains performance considerations for the pipeline. These considerations include
the occurrence of fetch packets that contain multiple execute packets, execute packets that contain
multicycle NOPs, and memory considerations for the pipeline. For more information about fully optimizing
a program and taking full advantage of the pipeline, see the TMS320C6000 Programmer's Guide
(SPRU198).
4.1.1 Fetch
The fetch phases of the pipeline are:
• PG: Program address generate
• PS: Program address send
• PW: Program access ready wait
• PR: Program fetch packet receive
The DSP uses a fetch packet (FP) of eight words. All eight of the words proceed through fetch processing
together, through the PG, PS, PW, and PR phases. Figure 4-2(a) shows the fetch phases in sequential
order from left to right. Figure 4-2(b) is a functional diagram of the flow of instructions through the fetch
phases. During the PG phase, the program address is generated in the CPU. In the PS phase, the
program address is sent to memory. In the PW phase, a memory read occurs. Finally, in the PR phase,
the fetch packet is received at the CPU. Figure 4-2(c) shows fetch packets flowing through the phases of
the fetch stage of the pipeline. In Figure 4-2(c), the first fetch packet (in PR) is made up of four execute
packets, and the second and third fetch packets (in PW and PS) contain two execute packets each. The
last fetch packet (in PG) contains a single execute packet of eight instructions.
PR Memory
PS
PG
(c)
Fetch 256
Decode
4.1.2 Decode
The decode phases of the pipeline are:
• DP: Instruction dispatch
• DC: Instruction decode
In the DP phase of the pipeline, the fetch packets are split into execute packets. Execute packets consist
of one instruction or from two to eight parallel instructions. During the DP phase, the instructions in an
execute packet are assigned to the appropriate functional units. In the DC phase, the source registers,
destination registers, and associated paths are decoded for the execution of the instructions in the
functional units.
Figure 4-3(a) shows the decode phases in sequential order from left to right. Figure 4-3(b) shows a fetch
packet that contains two execute packets as they are processed through the decode stage of the pipeline.
The last six instructions of the fetch packet (FP) are parallel and form an execute packet (EP). This EP is
in the dispatch phase (DP) of the decode stage. The arrows indicate each instruction's assigned functional
unit for execution during the same cycle. The NOP instruction in the eighth slot of the FP is not dispatched
to a functional unit because there is no execution associated with it.
The first two slots of the fetch packet (shaded below) represent an execute packet of two parallel
instructions that were dispatched on the previous cycle. This execute packet contains two MPY
instructions that are now in decode (DC) one cycle before execution. There are no instructions decoded
for the .L, .S, and .D functional units for the situation illustrated.
(b)
Decode 32 32 32 32 32 32 32 32
(A)
ADD ADD STW STW ADDK NOP DP
MPYH MPYH DC
Functional
.L1 .S1 .M1 .D1 units .D2 .M2 .S2 .L2
4.1.3 Execute
The execute portion of the pipeline is subdivided into five phases (E1-E5). Different types of instructions
require different numbers of these phases to complete their execution. These phases of the pipeline play
an important role in your understanding the device state at CPU cycle boundaries. The execution of
different types of instructions in the pipeline is described in Section 4.2. Figure 4-4(a) shows the execute
phases of the pipeline in sequential order from left to right. Figure 4-4(b) shows the portion of the
functional block diagram in which execution occurs.
(a) E1 E2 E3 E4 E5
(b)
Execute
E1
SADD B SMPY SMPY STH SMPYH SUB SADD
.L1 .S1 .M1 .M1 .D2 .M2 .S2 .L2
... 32
...
31 30 29 28 10 9 8 7 6 5 4 3 2 1 0 31 30 29 28 10 9 8 7 6 5 4 3 2 1 0
Register file A 64 64 64 64 Register file B
ST1 LD1 LD2 ST2
32 DA1 DA1 32
Data address 1 Data address 2
L1 Data cache control
PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
Figure 4-6 shows an example of the pipeline flow of consecutive fetch packets that contain eight parallel
instructions. In this case, where the pipeline is full, all instructions in a fetch packet are in parallel and split
into one execute packet per fetch packet. The fetch packets flow in lockstep fashion through each phase
of the pipeline.
For example, examine cycle 7 in Figure 4-6. When the instructions from FPn reach E1, the instructions in
the execute packet from FP n +1 are being decoded. FP n + 2 is in dispatch while FPs n + 3, n + 4, n + 5,
and n + 6 are each in one of four phases of program fetch. See Section 4.4 for additional detail on code
flowing through the pipeline. Table 4-1 summarizes the pipeline phases and what happens in each phase.
Figure 4-6. Pipeline Operation: One Execute Packet per Fetch Packet
Clock cycle
Fetch packet 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
n PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+1 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+2 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9
n+3 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8
n+4 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7
n+5 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6
n+6 PG PS PW PR DP DC E1 E2 E3 E4 E5
n+7 PG PS PW PR DP DC E1 E2 E3 E4
n+8 PG PS PW PR DP DC E1 E2 E3
n+9 PG PS PW PR DP DC E1 E2
n+10 PG PS PW PR DP DC E1
Figure 4-7 shows a functional block diagram of the pipeline stages. The pipeline operation is based on
CPU cycles. A CPU cycle is the period during which a particular execute packet is in a particular pipeline
phase. CPU cycle boundaries always occur at clock cycle boundaries.
As code flows through the pipeline phases, it is processed by different parts of the DSP. Figure 4-7 shows
a full pipeline with a fetch packet in every phase of fetch. One execute packet of eight instructions is being
dispatched at the same time that a 7-instruction execute packet is in decode. The arrows between DP and
DC correspond to the functional units identified in the code in Example 4-1.
In the DC phase portion of Figure 4-7, one box is empty because a NOP was the eighth instruction in the
fetch packet in DC and no functional unit is needed for a NOP. Finally, Figure 4-7 shows six functional
units processing code during the same cycle of the pipeline.
Registers used by the instructions in E1 are shaded in Figure 4-7. The multiplexers used for the input
operands to the functional units are also shaded in the figure. The bold cross paths are used by the MPY
instructions.
Most DSP instructions are single-cycle instructions, which means they have only one execution phase
(E1). A small number of instructions require more than one execute phase. The types of instructions, each
of which require different numbers of execute phases, are described in Section 4.2.
Fetch 256
Decode 32 32 32 32 32 32 32 32
STH STH SADD SADD SMPYH SMPY SUB B DP
Execute
.. ..
32
31 30 29 28 10 9 8 7 6 5 4 3 2 1 0 31 30 29 28 10 9 8 7 6 5 4 3 2 1 0
Register file A ST 1 Data 1 LD 1 LD 2 Data 2 ST 2 Register file B
64 64 64 64
32 DA 1 32
DA 2
Data cache control
LOOP1:
Table 4-2. Execution Stage Length Description for Each Instruction Type - Part A
Instruction Type
16 × 16 Single
Execution Multiply/.M Unit Multiply
Phase (1) (2) Single Cycle Nonmultiply Store Extensions Load Branch
E1 Compute result Read operands Compute address Reads operands Compute address Target code in
and write to and start and start PG (3)
register computations computations
E2 Compute result Send address Send address to
and write to and data to memory
register memory
E3 Access memory Access memory
E4 Write results to Send data back
register to CPU
E5 Write data into
register
Delay slots 0 1 0 (4) 3 4 (4) 5 (3)
Functional 1 1 1 1 1 1
unit latency
(1)
This table assumes that the condition for each instruction is evaluated as true. If the condition is evaluated as false, the
instruction does not write any results or have any pipeline operation after E1.
(2)
NOP is not shown and has no operation in any of the execution phases.
(3)
See Section 4.2.6 for more information on branches.
(4)
See Section 4.2.3 and Section 4.2.5 for more information on execution and delay slots for stores and loads.
Table 4-3. Execution Stage Length Description for Each Instruction Type - Part B
Instruction Type
Execution
Phase (1) (2) 2-Cycle DP 4-Cycle INTDP DP Compare
E1 Compute the lower results Read sources and start Read sources and start Read lower sources and
and write to register computation computation start computation
E2 Compute the upper Continue computation Continue computation Read upper sources,
results and write to finish computation, and
register write results to register
E3 Continue computation Continue computation
E4 Complete computation Continue computation
and write results to and write lower results to
register register
E5 Complete computation
and write upper results to
register
Delay slots 1 3 4 1
Functional unit 1 1 1 2
latency
(1)
This table assumes that the condition for each instruction is evaluated as true. If the condition is evaluated as false, the
instruction does not write any results or have any pipeline operation after E1.
(2)
NOP is not shown and has no operation in any of the execution phases.
Table 4-4. Execution Stage Length Description for Each Instruction Type - Part C
Instruction Type
Execution
Phase (1) (2) ADDDP/SUBDP MPYI MPYID MPYDP
E1 Read lower sources and Read sources and start Read sources and start Read lower sources and
start computation computation computation start computation
E2 Read upper sources and Read sources and Read sources and Read lower src1 and
continue computation continue computation continue computation upper src2 and continue
computation
E3 Continue computation Read sources and Read sources and Read lower src2 and
continue computation continue computation upper src1 and continue
computation
E4 Continue computation Read sources and Read sources and Read upper sources and
continue computation continue computation continue computation
E5 Continue computation Continue computation Continue computation Continue computation
E6 Compute the lower results Continue computation Continue computation Continue computation
and write to register
E7 Compute the upper Continue computation Continue computation Continue computation
results and write to
register
E8 Continue computation Continue computation Continue computation
E9 Complete computation Continue computation Continue computation
and write results to and write lower results to and write lower results to
register register register
E10 Complete computation Complete computation
and write upper results to and write upper results to
register register
Delay slots 6 8 9 9
Functional unit 2 4 4 4
latency
(1)
This table assumes that the condition for each instruction is evaluated as true. If the condition is evaluated as false, the
instruction does not write any results or have any pipeline operation after E1.
(2)
NOP is not shown and has no operation in any of the execution phases.
Table 4-5. Execution Stage Length Description for Each Instruction Type - Part D
Instruction Type
Execution
Phase (1) (2) MPYSPDP MPYSP2DP
E1 Read src1 and lower src2 and start computation Read sources and start computation
E2 Read src1 and upper src2 and continue computation Continue computation
E3 Continue computation Continue computation
E4 Continue computation Continue computation and write lower results to
register
E5 Continue computation Complete computation and write upper results to
register
E6 Continue computation and write lower results to
register
E7 Complete computation and write upper results to
register
Delay slots 6 4
Functional unit 3 2
latency
(1)
This table assumes that the condition for each instruction is evaluated as true. If the condition is evaluated as false, the
instruction does not write any results or have any pipeline operation after E1.
(2)
NOP is not shown and has no operation in any of the execution phases.
PG PS PW PR DP DC E1
Functional
unit
.L, .S, .M,
or .D
Operands
(data)
Write results
Register file E1
PG PS PW PR DP DC E1 E2 1 delay slot
Functional
unit
.M
Operands
(data)
Write results
E1
Register file E2
PG PS PW PR DP DC E1
Address E2 E3
modification
Functional
unit
.D
E2
E1
Register file
Data
E2 Memory controller
Address
E3
Memory
When you perform a load and a store to the same memory location, these rules apply (i = cycle):
• When a load is executed before a store, the old value is loaded and the new value is stored.
i LDW
i + 1 STW
• When a store is executed before a load, the new value is stored and the new value is loaded.
i STW
i + 1 LDW
• When the instructions are executed in parallel, the old value is loaded first and then the new value is
stored, but both occur in the same phase.
i STW
i || LDW
PG PS PW PR DP DC E1 E2 E3 E4
3 delay slots
Functional
unit
.M
Operands
(data)
Write results
Register file E4
PG PS PW PR DP DC E1 E2 E3 E4 E5
4 delay slots
modification
Address
Functional
unit
.D
E2
E1
E5
Register file
Data
E4
Memory controller
Address
E3
Memory
In the E4 stage of a load, the data is received at the CPU core boundary. Finally, in the E5 phase, the
data is loaded into a register. Because data is not written to the register until E5, load instructions have
four delay slots. Because pointer results are written to the register in E1, there are no delay slots
associated with the address modification.
In the following code, pointer results are written to the A4 register in the first execute phase of the pipeline
and data is written to the A3 register in the fifth execute phase.
LDW .D1 *A4++,A3
Because a store takes three execute phases to write a value to memory and a load takes three execute
phases to read from memory, a load following a store accesses the value placed in memory by that store
in the cycle after the store is completed. This is why the store is considered to have zero delay slots.
PG PS PW PR DP DC E1
Branch
PG PS PW PR DP DC E1
target
5 delay slots
Fetch 256
Decode 32 32 32 32 32 32 32 32
SMPYH SMPY SADD SADD B MVK DP
LDW LDW DC
Execute E1
MVK SMPY SMPYH B
.L1 .S1 .M1 .D1 .D2 .M2 .S2 .L2
PG PS PW PR DP DC E1 E2 1 delay slot
PG PS PW PR DP DC E1 E2 E3 E4
3 delay slots
PG PS PW PR DP DC E1 E2 E3 E4 E5
4 delay slots
PG PS PW PR DP DC E1 E2 1 delay slot
PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7
6 delay slots
PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9
8 delay slots
PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
9 delay slots
PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
9 delay slots
PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7
6 delay slots
PG PS PW PR DP DC E1 E2 E3 E4 E5
4 delay slots
LEGEND: Shaded column = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W = Destinations written
for the instruction; ✓ = Next instruction can enter E1 during cycle
Table 4-23 shows the instruction constraints for DP compare instructions executing on the .S unit.
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W = Destinations written for
the instruction;
✓ = Next instruction can enter E1 during cycle; Xr = Next instruction cannot enter E1 during cycle-read/decode constraint; Xrw =
Next instruction cannot enter E1 during cycle-read/decode/write constraint
Table 4-24 shows the instruction constraints for 2-cycle DP instructions executing on the .S unit.
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W = Destinations written for
the instruction;
✓ = Next instruction can enter E1 during cycle; Xw = Next instruction cannot enter E1 during cycle-write constraint
Table 4-25 shows the instruction constraints for ADDSP/SUBSP instructions executing on the .S unit.
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W = Destinations written for
the instruction;
✓ = Next instruction can enter E1 during cycle; Xw = Next instruction cannot enter E1 during cycle-write constraint
Table 4-26 shows the instruction constraints for ADDDP/SUBDP instructions executing on the .S unit.
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W = Destinations written for
the instruction;
✓ = Next instruction can enter E1 during cycle; Xr = Next instruction cannot enter E1 during cycle-read/decode constraint; Xw =
Next instruction cannot enter E1 during cycle-write constraint
Table 4-27 shows the instruction constraints for branch instructions executing on the .S unit.
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; ✓ = Next instruction can
enter E1 during cycle
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W = Destinations written for
the instruction;
✓ = Next instruction can enter E1 during cycle
Table 4-29 shows the instruction constraints for 4-cycle instructions executing on the .M unit.
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W = Destinations written for
the instruction;
✓ = Next instruction can enter E1 during cycle; Xw = Next instruction cannot enter E1 during cycle-write constraint
Table 4-30 shows the instruction constraints for MPYI instructions executing on the .M unit.
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W = Destinations written for
the instruction;
✓ = Next instruction can enter E1 during cycle; Xr = Next instruction cannot enter E1 during cycle-read/decode constraint; Xw =
Next instruction cannot enter E1 during cycle-write constraint; Xu = Next instruction cannot enter E1 during cycle-other resource
conflict
Table 4-31 shows the instruction constraints for MPYID instructions executing on the .M unit.
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W = Destinations written for
the instruction;
✓ = Next instruction can enter E1 during cycle; Xr = Next instruction cannot enter E1 during cycle-read/decode constraint; Xw =
Next instruction cannot enter E1 during cycle-write constraint; Xu = Next instruction cannot enter E1 during cycle-other resource
conflict
Table 4-32 shows the instruction constraints for MPYDP instructions executing on the .M unit.
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W = Destinations written for
the instruction;
✓ = Next instruction can enter E1 during cycle; Xr = Next instruction cannot enter E1 during cycle-read/decode constraint; Xw =
Next instruction cannot enter E1 during cycle-write constraint; Xu = Next instruction cannot enter E1 during cycle-other resource
conflict
Table 4-33 shows the instruction constraints for MPYSP instructions executing on the .M unit.
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W = Destinations written for
the instruction;
✓ = Next instruction can enter E1 during cycle
Table 4-34 shows the instruction constraints for MPYSPDP instructions executing on the .M unit.
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W = Destinations written for
the instruction;
✓ = Next instruction can enter E1 during cycle; Xr = Next instruction cannot enter E1 during cycle-read/decode constraint; Xw =
Next instruction cannot enter E1 during cycle-write constraint; Xu = Next instruction cannot enter E1 during cycle-other resource
conflict
Table 4-35 shows the instruction constraints for MPYSP2DP instructions executing on the .M unit.
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W = Destinations written for
the instruction;
✓ = Next instruction can enter E1 during cycle; Xr = Next instruction cannot enter E1 during cycle-read/decode constraint; Xw =
Next instruction cannot enter E1 during cycle-write constraint; Xu = Next instruction cannot enter E1 during cycle-other resource
conflict
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W = Destinations written for
the instruction;
✓ = Next instruction can enter E1 during cycle
Table 4-37 shows the instruction constraints for 4-cycle instructions executing on the .L unit.
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W = Destinations written for
the instruction;
✓ = Next instruction can enter E1 during cycle; Xw = Next instruction cannot enter E1 during cycle-write constraint
Table 4-38 shows the instruction constraints for INTDP instructions executing on the .L unit.
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W = Destinations written for
the instruction;
✓ = Next instruction can enter E1 during cycle; Xw = Next instruction cannot enter E1 during cycle-write constraint
Table 4-39 shows the instruction constraints for ADDDP/SUBDP instructions executing on the .L unit.
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W = Destinations written for
the instruction;
✓ = Next instruction can enter E1 during cycle; Xr = Next instruction cannot enter E1 during cycle-read/decode constraint; Xw =
Next instruction cannot enter E1 during cycle-write constraint; Xrw = Next instruction cannot enter E1 during
cycle-read/decode/write constraint
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W = Destinations written for
the instruction;
✓ = Next instruction can enter E1 during cycle
Table 4-41 shows the instruction constraints for store instructions executing on the .D unit.
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W = Destinations written for
the instruction;
✓ = Next instruction can enter E1 during cycle
Table 4-42 shows the instruction constraints for single-cycle instructions executing on the .D unit.
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W = Destinations written for
the instruction;
✓ = Next instruction can enter E1 during cycle
Table 4-43 shows the instruction constraints for LDDW instructions executing on the .D unit.
LEGEND: Shaded text = E1 phase of the single-cycle instruction; R = Sources read for the instruction; W = Destinations written for
the instruction;
✓ = Next instruction can enter E1 during cycle; Xw = Next instruction cannot enter E1 during cycle-write constraint
instruction A ; EP k FP n
|| instruction B ;
instruction C ; EP k + 1 FP n
|| instruction D
|| instruction E
instruction F ; EP k + 2 FP n
|| instruction G
|| instruction H
instruction I ; EP k + 3 FP n + 1
|| instruction J
|| instruction K
|| instruction L
|| instruction M
|| instruction N
|| instruction O
|| instruction P
... continuing with EPs k+4 through k+8, which have eight instructions in parallel,
like k+3.
Figure 4-30. Pipeline Operation: Fetch Packets With Different Numbers of Execute Packets
Clock cycle
n k PG PS PW PR DP DC E1 E2 E3 E4 E5
n k+1 DP DC E1 E2 E3 E4 E5
n k+2 DP DC E1 E2 E3 E4 E5
n+1 k+3 PG PS PW PR DP DC E1 E2 E3 E4
n+4 k+6 PG PS PW PR DP DC E1
n+5 k+7 PG PS PW PR DP DC
n+6 k+8 PG PS PW PR DP
In Figure 4-30, fetch packet n, which contains three execute packets, is shown followed by six fetch
packets (n + 1 through n + 6), each with one execute packet (containing eight parallel instructions). The
first fetch packet (n) goes through the program fetch phases during cycles 1-4. During these cycles, a
program fetch phase is started for each of the fetch packets that follow.
In cycle 5, the program dispatch (DP) phase, the CPU scans the p-bits and detects that there are three
execute packets (k through k + 2) in fetch packet n. This forces the pipeline to stall, which allows the DP
phase to start for execute packets k + 1 and k + 2 in cycles 6 and 7. Once execute packet k + 2 is ready
to move on to the DC phase (cycle 8), the pipeline stall is released.
The fetch packets n + 1 through n + 4 were all stalled so the CPU could have time to perform the DP
phase for each of the three execute packets (k through k + 2) in fetch packet n. Fetch packet n + 5 was
also stalled in cycles 6 and 7: it was not allowed to enter the PG phase until after the pipeline stall was
released in cycle 8. The pipeline continues operation as shown with fetch packets n + 5 and n + 6 until
another fetch packet containing multiple execution packets enters the DP phase, or an interrupt occurs.
i+3
i+4
Cycle
Execute packet LD ADD MPY NOP 5 i
(b)
i+1
i+2
i+3
i+4
Figure 4-32 shows how a multicycle NOP can be affected by a branch. If the delay slots of a branch finish
while a multicycle NOP is still dispatching NOPs into the pipeline, the branch overrides the multicycle NOP
and the branch target begins execution five delay slots after the branch was issued.
1 EP1 B ... E1 PG
(A)
2 EP2 EP without branch PS
(A)
3 EP3 EP without branch PW
(A)
4 EP4 EP without branch PR
(A)
5 EP5 EP without branch DP
(A)
6 EP6 LD MPY ADD NOP5 DC
Branch
7 Branch will execute here E1
EP7
10
In one case, execute packet 1 (EP1) does not have a branch. The NOP 5 in EP6 forces the CPU to wait
until cycle 11 to execute EP7.
In the other case, EP1 does have a branch. The delay slots of the branch coincide with cycles 2 through
6. Once the target code reaches E1 in cycle 7, it executes.
To understand the memory accesses, compare data loads and instruction fetches/dispatches. The
comparison is valid because data loads and program fetches operate on internal memories of the same
speed on the DSP and perform the same types of operations (listed in Table 4-44) to accommodate those
memories. Table 4-44 shows the operation of program fetches pipeline versus the operation of a data
load.
Depending on the type of memory and the time required to complete an access, the pipeline may stall to
ensure proper coordination of data and instructions.
A memory stall occurs when memory is not ready to respond to an access from the CPU. This access
occurs during the PW phase for a program memory access and during the E3 phase for a data memory
access. The memory stall causes all of the pipeline phases to lengthen beyond a single clock cycle,
causing execution to take additional clock cycles to finish. The results of the program execution are
identical whether a stall occurs or not. Figure 4-34 illustrates this point.
Fetch packet
(FP) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
n PG PS PW PR DP DC E1 E2 E3 E4 E5
n+1 PG PS PW PR DP DC E1 E2 E3 E4
n+2 PG PS PW PR DP Program DC E1 E2 E3
n+5 PG PS PW PR DP DC
n+6 PG PS PW PR DP
n+7 PG PS PW PR
n+8 PG PS PW
n+9 PG PS
n+10 PG
Interrupts
This chapter describes CPU interrupts, including reset and the nonmaskable interrupt (NMI). It details the
related CPU control registers and their functions in controlling interrupts. It also describes interrupt
processing, the method the CPU uses to detect automatically the presence of interrupts and divert
program execution flow to your interrupt service code. Finally, this chapter describes the programming
implications of interrupts.
5.1 Overview
Typically, DSPs work in an environment that contains multiple external asynchronous events. These
events require tasks to be performed by the DSP when they occur. An interrupt is an event that stops the
current process in the CPU so that the CPU can attend to the task needing completion because of the
event. These interrupt sources can be on chip or off chip, such as timers, analog-to-digital converters, or
other peripherals.
Servicing an interrupt involves saving the context of the current process, completing the interrupt task,
restoring the registers and the process context, and resuming the original process. There are eight
registers that control servicing interrupts.
An appropriate transition on an interrupt pin sets the pending status of the interrupt within the interrupt flag
register (IFR). If the interrupt is properly enabled, the CPU begins processing the interrupt and redirecting
program flow to the interrupt service routine.
NOTE: The nonmaskable interrupt (NMI) is not supported on all C6000 devices, see your
device-specific data manual for more information.
These first three types are differentiated by their priorities, as shown in Table 5-1. The reset interrupt has
the highest priority and corresponds to the RESET signal. The nonmaskable interrupt (NMI) has the
second highest priority and corresponds to the NMI signal. The lowest priority interrupts are interrupts
4-15 corresponding to the INT4-INT15 signals. RESET, NMI, and some of the INT4-INT15 signals are
mapped to pins on C6000 devices. Some of the INT4-INT15 interrupt signals are used by internal
peripherals and some may be unavailable or can be used under software control. Check your
device-specific datasheet to see your interrupt specifications.
The CPU supports exceptions as another type of interrupt. When exceptions are enabled, the NMI input
behaves as an exception. This chapter does not deal in depth with exceptions, as it assumes for
discussion of NMI as an interrupt that they are disabled. Chapter 6 discusses exceptions including NMI
behavior as an exception.
CAUTION
Code Compatibility
The CPU code compatibility with existing code compiled for the CPU using NMI
as an interrupt is only assured when exceptions are not enabled. Any additional
or modified code requiring the use of NMI as an exception to ensure correct
behavior will likely require changes to the pre-existing code to adjust for the
additional functionality added by enabling exceptions.
NOTE: The nonmaskable interrupt (NMI) is not supported on all C6000 devices, see your
device-specific data manual for more information.
NMI is the second-highest priority interrupt and is generally used to alert the CPU of a serious hardware
problem such as imminent power failure.
For NMI processing to occur, the nonmaskable interrupt enable (NMIE) bit in the interrupt enable register
(IER) must be set to 1. If NMIE is set to 1, the only condition that can prevent NMI processing is if the NMI
occurs during the delay slots of a branch (whether the branch is taken or not).
NMIE is cleared to 0 at reset to prevent interruption of the reset. It is cleared at the occurrence of an NMI
to prevent another NMI from being processed. You cannot manually clear NMIE, but you can set NMIE to
allow nested NMIs. While NMI is cleared, all maskable interrupts (INT4-INT15) are disabled.
On the CPU, if an NMI is recognized within an SPLOOP operation, the behavior is the same as for an NMI
with exceptions enabled. The SPLOOP operation terminates immediately (loop does not wind down as it
does in case of an interrupt). The SPLX bit in the NMI/exception task state register (NTSR) is set for
status purposes. The NMI service routine must look at this as one of the factors on whether a return to the
interrupted code is possible. If the SPLX bit in NTSR is set, then a return to the interrupted code results in
incorrect operation. See Section 7.13 for more information.
Program memory
NOTE: The ISFP should be exactly 8 words long. To prevent the compiler from using compact
instructions (see Section 3.10), the interrupt service table should be preceded by a .nocmp
directive. See the TMS320C6000 Assembly Language Tools User’s Guide (SPRU186).
If the NOP 5 was not in the routine, the CPU would execute the next five execute packets
(some of which are likely to be associated with the next ISFP) because of the delay slots
associated with the B IRP instruction. See Section 4.2.6 for more information.
Program memory
If the interrupt service routine for an interrupt is too large to fit in a single fetch packet, a branch to the
location of additional interrupt service routine code is required. Figure 5-3 shows that the interrupt service
routine for INT4 was too large for a single fetch packet, and a branch to memory location 1234h is
required to complete the interrupt service routine.
NOTE: The instruction B LOOP branches into the middle of a fetch packet and processes code
starting at address 1234h. The CPU ignores code from address 1220h−1230h, even if it is in
parallel to code at address 1234h.
Figure 5-3. Interrupt Service Table With Branch to Additional Interrupt Service Code
Located Outside the IST
IST
1248h Instr14
1258h -
125Ch -
IST
0
RESET ISFP
Program memory
;Assume GIE = 1
MVC CSR,B0 ; (1) Get CSR
AND -2,B0,B0 ; (2) Get ready to clear GIE
MVC B0,CSR ; (3) Clear GIE
ADD A0,A1,A2 ; (4)
ADD A3,A4,A5 ; (5)
In Example 5-2, the CPU may service an interrupt between instructions 1 and 2, between instructions 2
and 3, or between instructions 3 and 4. The CPU will not service an interrupt between instructions 4 and
5.
If the CPU services an interrupt between instructions 1 and 2 or between instructions 2 and 3, the PGIE
bit will hold the value 1 when arriving at the interrupt service routine. If the CPU services an interrupt
between instructions 3 and 4, the PGIE bit will hold the value 0. Thus, when the interrupt service routine
resumes the interrupted code, it will resume with GIE set as the interrupted code intended.
On the CPU, programs must directly manipulate the GIE bit in CSR to disable and enable interrupts.
Example 5-3 and Example 5-4 show code examples for disabling and enabling maskable interrupts
globally, respectively.
The CPU handles this process differently, in a manner that is backward compatible with the techniques
that the CPU requires. When it begins processing of a maskable interrupt, the CPU copies TSR to ITSR,
thereby, saving the old value of GIE. It then clears TSR.GIE. (ITSR.GIE is physically the same bit as
CSR.PGIE and TSR.GIE is physically the same bit as CSR.GIE.) When returning from an interrupt with
the B IRP instruction, the CPU restores the TSR state by copying ITSR back to TSR.
The CPU provides two new instructions that allow for simpler and safer manipulation of the GIE bit.
• The DINT instruction disables interrupts by:
– Copies the value of CSR.GIE (and TSR.GIE) to TSR.SGIE
– Clears CSR.GIE and TSR.GIE to 0 (disabling interrupts immediately)
The CPU will not service an interrupt between the execute packet containing DINT and the execute
packet that follows it.
• The RINT instruction restores interrupts to the previous state by:
– Copies the value of TSR.SGIE to CSR.GIE (and TSR.GIE)
– Clears TSR.SGIE to 0
If SGIE bit in TSR when RINT executes, interrupts are enabled immediately and the CPU may service an
interrupt in the cycle immediately following the execute packet containing RINT.
Example 5-5 illustrates the use and timing of the DINT instruction in disabling maskable interrupts globally
and Example 5-6 shows how to enable maskable interrupts globally using the complementary RINT
instruction.
;Assume GIE = 1
ADD B0,1,B0 ; Interrupt possible between ADD and DINT
DINT ; No interrupt between DINT and SUB
SUB B0,1,B0 ;
Example 5-7 shows a code fragment in which a load/modify/store is executed with interrupts disabled so
that the register cannot be modified by an interrupt between the read and write operation. Since the DINT
instruction saves the CSR.GIE bit to the TSR.SGIE bit and the RINT instruction copies the TSR.SGIE bit
back to the CSR.GIE bit, if interrupts were disabled before the DINT instruction, they will still be disabled
after the RINT instruction. If they were enabled before the DINT instruction, they will be enabled after the
RINT instruction.
NOTE: The use of DINT and RINT instructions in a nested manner, like the following code:
DINT
DINT
RINT
RINT
leaves interrupts disabled after the second RINT instruction. The successive use of the DINT
instruction leaves the TSR.SGIE bit cleared to 0, so the RINT instructions copy zero to the
GIE bit.
NOTE: Any write to the ISR or ICR (by the MVC instruction) effectively has one delay slot because
the results cannot be read (by the MVC instruction) in IFR until two cycles after the write to
ISR or ICR.
Any write to ICR is ignored by a simultaneous write to the same bit in ISR.
Example 5-10 and Example 5-11 show code examples to set and clear individual interrupts.
Example 5-10. Code to Set an Individual Interrupt (INT6) and Read the Flag Register
MVK 40h,B3
MVC B3,ISR
NOP
MVC IFR,B4
Example 5-11. Code to Clear an Individual Interrupt (INT6) and Read the Flag Register
MVK 40h,B3
MVC B3,ICR
NOP
MVC IFR,B4
CPU bdry
INTm at
IFm
EXC
TSR.GIE
TSR.XEN
TSR.INT
TSR.EXC
TSR
v
ITSR
Execute
packet
n DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+1 DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+2 PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 Contains no branch
n+3 PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+4 PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+5 PG PS PW PR DP DC E1
n+6 PG PS PW PR DP E2
n+7 PG PS PW PR DP
n+8 PG PS PW PR Annulled Instructions
n+9 PG PS PW
PG PS
n+10 PG
n+11 Cycles 6-14: Nonreset (A)
interrupt processing is disabled
ISFP PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
A After this point, interrupts are still disabled. All nonreset interrupts are disabled when NMIE = 0. All maskable
interrupts are disabled when GIE = 0.
Figure 5-5. Return from Interrupt Execution and Processing: Pipeline Operation
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
EXC
TSR.GIE
TSR.EXC
ITSR.SPLX 1 if SPLOOP was interrupted; sampled for return target in SPLOOP state machine
Execute
packet
n DC E1 E2 E3 E4 E5 E6 E7
B IRP DP DC E1 E2 E3 E4 E5 E6
n+2 PR DP DC E1 E2 E3 E4 E5
n+3 PW PR DP DC E1 E2 E3 E4
n+4 PS PW PR DP DC E1 E2 E3
n+5 PG PS PW PR DP DC E1 E2
n+6 PG PS PW PR DP DC E1
IRP target PG PS PW PR DP DC E1
t+1 PG PS PW PR DP DC E1
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Figure 5-6. CPU Nonmaskable Interrupt Detection and Processing: Pipeline Operation
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
NMI at
CPU bdry
NMIF
EXC
IER.NMIE
TSR.GEE
TSR.INT
TSR.EXC
TSR
v
NTSR
Execute
packet
n DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+1 DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+2 PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 Contains no branch
n+3 PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+4 PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+5 PG PS PW PR DP DC E1
n+6 PG PS PW PR DP E2
n+7 PG PS PW PR DP
n+8 PG PS PW PR Annulled Instructions
n+9 PG PS PW
n+10 PG PS
n+11 PG
Cycles 6-14: Nonreset
interrupt processing is disabled (A)
ISFP PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
A After this point, interrupts are still disabled. All nonreset interrupts are disabled when NMIE = 0. All maskable
interrupts are disabled when GIE = 0.
Figure 5-7. CPU Return from Nonmaskable Interrupt Execution and Processing: Pipeline Operation
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
EXC
TSR.XEN
TSR.EXC
Execute
packet
n DC E1 E2 E3 E4 E5 E6 E7
B NRP DP DC E1 E2 E3 E4 E5 E6
n+2 PR DP DC E1 E2 E3 E4 E5
n+3 PW PR DP DC E1 E2 E3 E4
n+4 PS PW PR DP DC E1 E2 E3
n+5 PG PS PW PR DP DC E1 E2
n+6 PG PS PW PR DP DC E1
IRP target PG PS PW PR DP DC E1
t+1 PG PS PW PR DP DC E1
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
NOTE: Code that starts running after reset must explicitly enable the GIE bit, the NMIE bit, and IER
to allow interrupts to be processed.
IF0
Execute
packet
n E1 E2
n+1 DC E1
n+2 DP DC
n+3 PR DP Pipeline flush
n+4 PW PR
n+5 PS PW Cycles 15 - 21:
Nonreset interrupt (B)
n+6 PG PS processing is disabled
n+7 PG
Reset ISFP PG PS PW PR DP DC E1
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
A IF0 is set on the next CPU cycle boundary after a 4-clock cycle delay after the rising edge of .
B After this point, interrupts are still disabled. All nonreset interrupts are disabled when NMIE = 0. All maskable
interrupts are disabled when GIE = 0.
In Example 5-15, the single assignment method is used. The register A1 is assigned only to the ADD
input and not to the result of the LDW. Regardless of the value of A6 with or without an interrupt, A1 does
not change before it is summed with A2. Result A3 is equal to A2.
Another method for preventing problems with nonsingle-assignment programming would be to disable
interrupts before using multiple assignment, then reenable them afterwards. Of course, you must be
careful with the tradeoff between high-speed code that uses multiple-assignment and increasing interrupt
latency. When using multiple assignment within software pipelined code, the SPLOOP buffer on the CPU
can help you deal with the tradeoff between performance and interruptibility. See Chapter 7 for more
information.
Prior to returning from the interrupt service routine, the code must restore the registers saved above as
follows:
1. The GIE bit must be first cleared to 0
2. The PGIE bit saved value must be restored
3. The contents of ITSR must be restored
4. The IRP (or NRP) saved value must be restored
Although steps 2, 3, and 4 above may be performed in any order, it is important that the GIE bit is cleared
first. This means that the GIE and PGIE bits must be restored with separate writes to CSR. If these bits
are not restored separately, then it is possible that the PGIE bit is overwritten by nested interrupt
processing just as interrupts are being disabled.
NOTE: When coding nested interrupts for the CPU, the ITSR should be saved and restored to
prevent corruption by the nested interrupt.
5.6.4 Traps
A trap behaves like an interrupt, but is created and controlled with software. The trap condition can be
stored in any one of the conditional registers: A0, A1, A2, B0, B1, or B2. If the trap condition is valid, a
branch to the trap handler routine processes the trap and the return.
Example 5-17 and Example 5-18 show a trap call and the return code sequence, respectively. In the first
code sequence, the address of the trap handler code is loaded into register B0 and the branch is called. In
the delay slots of the branch, the context is saved in the B0 register, the GIE bit is cleared to disable
maskable interrupts, and the return pointer is stored in the B1 register.
The trap is processed with the code located at the address pointed to by the label TRAP_HANDLER. If
the B0 or B1 registers are needed in the trap handler, their contents must be stored to memory and
restored before returning. The code shown in Example 5-18 should be included at the end of the trap
handler code to restore the context prior to the trap and return to the TRAP_RETURN address.
B B1 ; return
MVC B0,CSR ; restore CSR
NOP 4 ; delay slots
Often traps are used to handle unexpected conditions in the execution of the code. The CPU provides
explicit exception handling support which may be used for this purpose.
Another alternative to using traps as software triggered interrupts is the software interrupt capability (SWI)
provided by the DSP/BIOS real-time kernel.
CPU Exceptions
This chapter describes CPU exceptions on the CPU. It details the related CPU control registers and their
functions in controlling exceptions. It also describes exception processing, the method the CPU uses to
detect automatically the presence of exceptions and divert program execution flow to your exception
service code. Finally, the chapter describes the programming implications of exceptions.
6.1 Overview
The exception mechanism on the CPU is intended to support error detection and program redirection to
error handling service routines. Error signals generated outside of the CPU are consolidated to one
exception input to the CPU. Exceptions generated within the CPU are consolidated to one internal
exception flag with information as to the cause in a register. Fatal errors detected outside of the CPU are
consolidated and incorporated into the NMI input to the CPU.
Figure 6-1. Interrupt Service Table With Branch to Additional Exception Service Code
Located Outside the IST
IST
xxxx 000h RESET ISFP ISFP for exceptions
xxxx 020h NMI ISFP 020h Instr1
xxxx 040h Reserved 024h Instr2
The exception service routine xxxx 060h Reserved 028h B 1234h
includes this instruction extension
of the exception ISFP. xxxx 080h INT4 ISFP 02Ch Instr4
xxxx 0A0h INT5 ISFP 030h Instr5
1220h -
xxxx 0C0h INT6 ISFP 034h Instr6
1224h -
xxxx 0E0h INT7 ISFP 038h Instr7
1228h -
xxxx 100h INT8 ISFP 03Ch Instr8
122Ch -
xxxx 120h INT9 ISFP
1230h -
xxxx 140h INT10 ISFP
1234h Instr9
xxxx 160h INT11 ISFP
1238h B NRP
xxxx 180h INT12 ISFP
123Ch Instr11
xxxx 1A0h INT13 ISFP
xxxx 1C0h INT14 ISFP
1240h Instr12
xxxx 1E0h INT15 ISFP
1244h Instr13
1248h Instr14
124Ch Instr15 Additional ISFP for NMI
Figure 6-2. External Exception (EXCEP) Detection and Processing: Pipeline Operation
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
EXCEP at
CPU bdry
EFR.EXF
EXC
IER.NMIE
TSR.XEN
TSR.EXC
TSR
v
NTSR
Execute
packet
n DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+1 DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+2 PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+3 PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+4 PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+5 PG PS PW PR DP DC E1
n+6 PG PS PW PR DP E2
n+7 PG PS PW PR DP
n+8 PG PS PW PR Annulled Instructions
n+9 PG PS PW
n+10 PG PS
n+11 PG
Cycles 6-14: Nonreset
interrupt processing is disabled
ISFP PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
EXC
IER.NMIE
TSR.EXC
Execute
packet
n DC E1 E2 E3 E4 E5 E6 E7
B NRP DP DC E1 E2 E3 E4 E5 E6
n+2 PR DP DC E1 E2 E3 E4 E5
n+3 PW PR DP DC E1 E2 E3 E4
n+4 PS PW PR DP DC E1 E2 E3
n+5 PG PS PW PR DP DC E1 E2
n+6 PG PS PW PR DP DC E1
NRP target PG PS PW PR DP DC E1
t+1 PG PS PW PR DP DC E1
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
NMI at CPU
bdry
EFR.NXF
EXC
IER.NMIE
TSR.XEN
TSR.EXC
TSR
v
NTSR
Execute
packet
n DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+1 DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+2 PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+3 PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+4 PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+5 PG PS PW PR DP DC E1
n+6 PG PS PW PR DP E2
n+7 PG PS PW PR DP
n+8 PG PS PW PR Annulled Instructions
n+9 PG PS PW
n+10 PG PS
n+11 PG
Cycles 6-14: Nonreset
interrupt processing is disabled
ISFP PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
EXCEP at
CPU bdry
EFR.EXF
NMI at CPU
bdry
EFR.NXF
EXC
IER.NMIE
TSR.XEN
TSR.EXC
TSR
Execute NTSR
packet
n DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+1 DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+2 PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+3 PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+4 PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
n+5 PG PS PW PR DP DC E1
n+6 PG PS PW PR DP DC TSR
n+7 PG PS PW PR DP
Annulled Instructions ITSR
n+8 PG PS PW PR
n+9 PG PS PW &(ISR)
n+10 PG PS
n+11 IRP
PG
ISR PG PS PW PR DP DC E1
ISR+1 &(n+5) PG PS PW PR DP DC
ISR+2 PG PS PW PR DP
NRP PG PS PW PR Annulled Instructions
ISR+3
PG PS PW
ISR+4 PG PS
ISR+5 PG
CPU cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
This chapter describes the software pipelined loop (SPLOOP) buffer hardware and software mechanisms.
Under normal circumstances, the compiler/assembly optimizer will do a good job coding SPLOOPs and it
will not be necessary for the programmer to hand code usage of the SPLOOP buffer. This chapter is
intended to describe the functioning of the buffer hardware and the instructions that control it.
iter 0
P0 Stage 0 iter 1
iter n-1
7.3 Terminology
The following terminology is used in the discussion in this chapter.
• Iteration interval (ii) is the interval (in instruction cycles) between successive iterations of the loop.
• A stage is the code executed in one iteration interval.
• Dynamic length (dynlen) is the length (in instruction cycles) of a single iteration of the loop. It is
therefore equal to the number of stages times the iteration interval.5
• The kernel is the period when the loop is executing in a steady state with the maximum number of loop
iterations executing simultaneously. For example: in Figure 7-1 the kernel is the set of instructions
contained in stage 0, stage 1, stage, 2, and stage 3.
• The prolog is the period before the loop reaches the kernel in which the loop is winding up. The length
of the prolog will by the dynamic length minus the iteration interval (dynlen - ii).
• The epilog is the period after the loop leaves the kernel in which the loop is winding down. The length
of the prolog will by the dynamic length minus the iteration interval (dynlen - ii).
7.4.5 Task State Register (TSR), Interrupt Task State Register (ITSR), and
NMI/Exception Task State Register (NTSR)
The SPLX bit in the task state register (TSR) indicates whether an SPLOOP is currently executing or not
executing.
When an interrupt occurs, the contents of TSR (including the SPLX bit) is copied to the interrupt task state
register (ITSR).
When an exception or non-maskable interrupt occurs, the contents of TSR (including the SPLX bit) is
copied to the NMI/Exception task state register (NTSR).
See Section 2.9.15 for more information on TSR. See Section 2.9.9 for more information on ITSR. See
Section 2.9.10 for more information on NTSR.
The ii parameter is the iteration interval which specifies the interval (in instruction cycles) between
successive iterations of the loop.
The SPLOOP instruction is used when the number of loop iterations is known in advance. The number of
loop iterations is determined by the value loaded to the inner loop count register (ILC). ILC should be
loaded with an initial value 4 cycles before the SPLOOP instruction is encountered.
The (optional) conditional predication is used to indicate when and if a nested loop should be reloaded.
The contents of the reload inner loop counter (RILC) is copied to ILC when either a SPKERNELR or a
SPMASKR instruction is executed with the predication condition on the SPLOOP instruction true. If the
loop is not nested, then the conditional predication should not be used.
The ii parameter is the iteration interval which specifies the interval (in instruction cycles) between
successive iterations of the loop.
The SPLOOPD instruction is used to initiate a loop buffer operation when the known minimum iteration
count of the loop is great enough that the inner loop count register (ILC) can be loaded in parallel with the
SPLOOPD instruction and the 4 cycle latency will have passed before the last iteration of the loop.
Unlike the SPLOOP instruction, the load of ILC is performed in parallel with the SPLOOPD instruction.
Due to the inherent latency of the load to ILC, the value to ILC should be predecremented to account for
the 4 cycle latency. The amount of the predecrement is given in Table 7-4.
The number of loop iterations is determined by the value loaded to ILC.
The (optional) conditional predication is used to indicate when and if a nested loop should be reloaded.
The contents of the reload inner loop counter (RILC) is copied to ILC when either a SPKERNELR or a
SPMASKR instruction is executed with the predication condition on the SPLOOP instruction true. If the
loop is not nested, then the conditional predication should not be used.
The use of the SPLOOPD instruction can result in reducing the time spent in setting up the loop by
eliminating up to 4 cycles that would otherwise be spent in setting up ILC. The tradeoff is that the
SPLOOPD instruction cannot be used if the loop is not long enough to accommodate the 4 cycle delay.
The ii parameter is the iteration interval which specifies the interval (in instruction cycles) between
successive iterations of the loop.
The SPLOOPW instruction is used to initiate a loop buffer operation when the total number of loops
required is not known in advance. The SPLOOPW instruction must be predicated. The loop terminates if
the predication condition is not true. The value in the inner loop count register (ILC) is not used to
determine the number of loops.
Unlike the SPLOOP and SPLOOPD instructions, predication on the SPLOOPW instruction does not imply
a nested SPLOOP operation. The SPLOOPW instruction cannot be used in a nested SPLOOP operation.
When using the SPLOOPW instruction, the predication condition is used to determine the exit condition
for the loop. The ILC is not used for this purpose when using the SPLOOPW instruction.
When the SPLOOPW instruction is used to initiate a loop buffer operation, the epilog is skipped when the
loop terminates.
The (optional) fstg and fcyc parameters specify the delay interval between the SPKERNEL instruction and
the start of the post epilog code. The fstg specifies the number of complete stages and the fcyc specifies
the number of cycles in the last stage in the delay.
The SPKERNEL instruction has arguments that instruct the SPLOOP hardware to begin execution of
post-SPLOOP instructions by an amount of delay (stages/cycles) after the start of the epilog.
Note that the post-epilog instructions are fetched from program memory and overlaid with the epilog
instructions fetched from the SPLOOP buffer. Functional unit conflicts can be avoided by either coding for
a sufficient delay using the SPKERNEL instruction arguments or by using the SPMASK instruction to
inhibit the operation of instructions from the buffer that might conflict with the instructions from the epilog.
The SPKERNELR instruction is used to support nested SPLOOP execution where a loop needs to be
restarted with perfect overlap of the prolog of the second loop with the epilog of the first loop.
If a reload is required with a delay between the SPKERNEL and the point of reload (that is, nonperfect
overlap) use the SPMASKR instruction with the SPKERNEL (not SPKERNELR) to indicate the point of
reload.
The SPKERNELR instruction has no arguments. The execution of post-SPLOOP instructions commences
simultaneous with the first cycle of epilog. If the predication of the SPLOOP instruction indicates that the
loop is being reloaded, the instructions are fetched from both the SPLOOP buffer and of program memory.
The SPKERNELR instruction cannot be used in the same SPLOOP operation as the SPMASKR
instruction.
The unitmask parameter specifies which functional units are masked by the SPMASK or SPMASKR
instruction. The units may alternatively be specified by marking the instructions with a caret (^) symbol.
The following two forms are equivalent and will each mask the .D1 unit. Example 7-1 and Example 7-2
show the two ways of specifying the masked instructions.
Example 7-5 is an alternate implementation of the same loop using the SPLOOPD instruction. The load of
the inner loop count register (ILC) can be made in the same cycle as the SPLOOPD instruction, but due to
the inherent delay between loading the ILC and its use, the value needs to be predecremented to account
for the 4 cycle delay.
Table 7-1. SPLOOP Instruction Flow for Example 7-4 and Example 7-5
Loop
Cycle 1 2 3 4 5 6 7 8
1 LDW
2 NOP LDW
3 NOP NOP LDW
4 NOP NOP NOP LDW
5 NOP NOP NOP NOP LDW
6 MV NOP NOP NOP NOP LDW
7 STW MV NOP NOP NOP NOP LDW
8 STW MV NOP NOP NOP NOP LDW
9 STW MV NOP NOP NOP NOP
10 STW MV NOP NOP NOP
11 STW MV NOP NOP
12 STW MV NOP
13 STW MV
14 STW
do {
I--;
dest[i]=source[i];
} while (I);
Table 7-3. Software Pipeline Instruction Flow Using the Loop Buffer
Execution Flow CPU Pipeline SPL Buffer
P0 stage0 -
P1 stage1 stage0
P2 stage2 stage0 stage1
K0 stage3 stage0 stage1 stage2
Kn - stage0 stage1 stage2 stage3
E0 - - stage1 stage2 stage3
E1 - - - stage2 stage3
E2 - - - - stage3
There is one case where the SPLX bit is set to 1 when the loop buffer is idle. When executing a B IRP
instruction to return to an interrupted SPLOOP, the ITSR is copied back into TSR in the E1 stage of the
branch. The SPLX bit is set to 1 beginning in the E2 stage of the branch, which is before the loop buffer
has restarted. If the loop buffer state machine is started in the branch delay slots of a B IRP or B NRP
instruction, it uses the SPLX bit to determine if this is a restart of an interrupted SPLOOP. The SPLX bit is
not checked if starting an SPLOOP outside the delay slots of one of these branches.
Instructions in the loop buffer indexed by LBC are marked as invalid if their loading counter value (from
when they were loaded into the loop buffer) is equal to the draining counter value.
When the draining counter is equal to (dynlen - ii), draining is complete. Any remaining valid instructions
for the loop (with a loading counter > (dynlen - ii)) are all marked as invalid.
If the loop is interrupt draining, then program memory fetch remains disabled until the interrupt is taken. If
the loop is normal draining, program memory fetch is enabled after a delay specified by the
SPKERNEL(R) instruction.
P0
P1
Load
P2
K0
Fetch
Kn
E0
Drain
E1
E2
P0
P1
Load
P2
K0
Fetch
E0
Drain
E1
E2
P0
P1
Load
P2 E0
Fetch
Drain
K0 E1
E2
P0
P1 E0 Load
P2 E1 Fetch
K0 E2 Drain
P01
P11
Load
P21
K01
Fetch
Kn1
E01 P02
Drain
E11 P12
Load
E21 P22
K02
Fetch
Kn2
E02
Drain
E12
E22
P01
P11
Load
P21
K01
Fetch
Kn1
E01 P02
Drain
E11 P12
Load
E21 P22
Fetch
K02 E02
Drain
E12
E22
7.9.3 Using SPLOOPD for Loops with Known Minimum Iteration Counts
For loops with known iteration counts, the unconditional SPLOOPD instruction is used to compensate for
the 4-cycle latency to the assignment of ILC. The unconditional SPLOOPD instruction differs from the
SPLOOP instruction in the following ways:
• The initial termination condition test is always false and the initial ILC decrement is disabled. The loop
must execute at least one iteration.
• The stage boundary termination condition is forced to false, and ILC decrement is disabled for the first
3 cycles of the loop.
• The loop cannot be interrupted for the first 3 cycles of the loop.
The SPLOOPD will test the SPLX bit in the TSR to determine if it is already set to one (indicating a return
from interrupt). In this case the SPLOOPD instruction executes like an unconditional SPLOOP instruction.
The SPLOOPD instruction is used when the loop is known to execute for a minimum number of loop
iterations. The required minimum of number of iterations is a function of ii, as shown in Table 7-4.
When using the SPLOOPD instruction, ILC must be loaded with a value that is biased to compensate for
the required minimum number of loop iterations. As shown in Example 7-10, for a loop with an ii equal to 1
that will execute 100 iterations, ILC is loaded with 96.
For loops initiated with a conditional SPLOOP or SPLOOPD instruction, an exception (detected by the
assembler) occurs if:
• There is not a valid outer loop branch instruction after the SPKERNEL(R) instruction.
• A reload has not been initiated by an SPMASKR instruction before the delay slots of the outer branch
have completed.
• There is a branch instruction after the SPKERNEL instruction that may execute when the loop is
reloading that is neither a valid outer loop branch nor a valid post loop branch.
• An SPMASKR is encountered for a loop that uses SPKERNELR.
• An SPMASKR is encountered for an unconditional (nonreload) loop.
Example 7-11 is a nested loop using the reload condition. Figure 7-8 shows the instruction execution flow
for an invocation of the inner loop, the outer loop code, and then another inner loop. Notice that the reload
starts after the first epilog stage of the inner loop as specified by the SPMASKR instruction in the last
cycle of that stage.
;*------------------------------
;* for (j=0; j<32; j++)
;* for (I=0; i<32; I++)
;* y[j] += x[i+j] * h[i]
;*------------------------------
;* x=a4, h=b4, y=a6
MVK .S2 32,B0
MVC .S2 B0,ILC ;Inner loop count
NOP 3
[B0] SPLOOP 2
|| MVC .S2 B0,RILC ;Reload inner loop count
|| SUB .D2 B0,1,B0 ;Outer loop count
|| MVK .S1 62,A5 ;X delta
|| MV .L2 B4,B5 ;Copy h
|| ZERO .D1 A7 ;Sum = 0
;*------------Start of loop-------------------------
LDH .D1T1 *A4++,A2 ;t1 = *x
|| LDH .D2T2 *B4++,B2 ;t2 = *h
NOP 4
MPY .M1X A2,B2,A2 ;p = t1*t2
NOP 1
SPKERNEL 0
|| ADD .L1 A2,A7,A7 ;sum += p
outer:
;*--------start epilog
SUB .D1 A4,A5,A4 ;x -= 62
|| MV .D2 B4,B5 ;h -= 64
SPMASKR
;*------------------------------
;* do {
; sum += *x++ * *y++;
; n -= m;
; } while (n >= 0)
;*------------------------------
[!A1] SPLOOPW 1
|| MVK .S1 0x0,A1 ;C = false
LDH .D1T1 *A5++,A3 ;t1 = *x++
|| LDH .D2T2 *B5++,B6 ;t2 = *y++
NOP 2
SUB .L2 B4,B7,B4 ;n -=m
CMPLT .L2 B4,0,A1 ;c = n < 0 // term_cond = !A1
MPY .M1X B6,A3,A4 ;p = t1 * t2 // delay slot 1
NOP 1 // delay slot 2
ADD .L1 A4,A6,A6 ;sum += p; // delay slot 3
SPKERNEL ;if (c) break; // cycle term_cond used
;*------------------------------
;* do {
; t = *src++;
; *dst++ = t;
; } while (t != 0)
;*------------------------------
[A0] SPLOOPW 1
|| MVK .S2 1,B0
|| MVK .S1 1,A0
[A0] LDB .D1 *A4++,A0 ;t = *src++
NOP 4
[B0] MV .L2X A0,B0 ;if (!t) break;
NOP 2 ;Ensure A0 set 4 cycles early
SPKERNEL
|| [B0] STB .D2 B0,*B4++ ;*dest++ = t
STB B0,*B4 ;*t = '/0'
The initial setup, the post loop operations, and adjusting the setup for the reloaded loop are all overhead
that may be minimized by moving their execution to within the same instruction cycles as the operation of
the SPLOOP.
If some setup code is required to do some initialization that is not used until late in the loop; you can save
instruction cycles by using the SPMASK instruction to overlay the setup code with the first few cycles of
the SPLOOP. The SPMASK will cause the masked instructions to be executed once without being loaded
to the SPLOOP buffer. Example 7-14 shows how this might be done.
If the SPMASK is used in the outer loop code (that is, post epilog code), it will force the substitution of the
SPMASKed instructions in the outer loop code for the instruction using the same functional unit in the
SPLOOP buffer for the first iteration of the reloaded inner loop. For example, if pointers need to be reset
at the point that a loop is reloaded, the instructions that do the reset can be inhibited using the SPMASK
instruction so that the instructions that originally adjusted the pointers are replaced in the execution flow
with instruction in the outer loop that are marked with the SPMASK instruction. Example 7-15 shows how
this might be done.
Example 7-14. Using the SPMASK Instruction to Merge Setup Code with SPLOOPW
;*------------------------------
; dst=&(dst[n])
;* do {
; t = *src++;
; *dst++ = t;
; } while (count--)
;
;A4 = Source address
;B4 = Destination address
;A6 = Number of words to copy
;B6 = Offset into destination to do copy
;*------------------------------
[A1] SPLOOPW 1
|| ADD .L1 A6,1,A1 ;Position loop cnt to valid reg
|| SHL .S2 B6,2,B6 ;Adjust offset for size of WORD
SPMASK
||^ ADD .L2 B6,B4,B4 ;Add offset into buffer to dest
|| LDW .D1 *A4++,A0 ;Load word and inc ptr
NOP 1 ;Wait for portion of delay
[A1] SUB .S1 A1,1,A1 ;Decrement loop count
NOP 2 ;Complete necessary wait
MV .L2X A0,B0 ;Position Word for write
SPKERNEL 0,0
|| STW .D2 B0,*B4++ ;Store word
Table 7-5. SPLOOP Instruction Flow for First Three Cycles of Example 7-14
Loop
Cycle 1 2 3 Notes
0 ADD Instructions are in parallel with the SPLOOP, so they execute only once.
SHL
1 ADD The ADD is SPMASKed so it executes only once. The LDW is loaded to the SPLOOP
LDW buffer.
2 NOP LDW The ADD was not added to the SPLOOP buffer in cycle 2, so it is not executed here.
3 SUB NOP LDW The SUB is a conditional instruction and may not execute.
4 NOP SUB NOP The SUB is a conditional instruction and may not execute.
5 NOP NOP SUB The SUB is a conditional instruction and may not execute.
6 MV NOP NOP
7 STW MV NOP
8 STW MV
9 STW
7.11.2 Some Points About the SPMASK to Merge Setup Code Example
Note the following points about the execution of Example 7-14:
• The ADD and SHL instructions in the same execute packet as the SPLOOPW instruction are only
executed once. They are not loaded to the SPLOOP buffer.
• Because of the SPMASK instruction in the execute packet, the ADD in the same execute packet as
the SPMASK instruction is executed only once and is not loaded to the SPLOOP buffer. Without the
SPMASK, the ADD would conflict with the MV instruction.
• The SHL and the 2nd ADD instructions could have been placed before the start of the SPLOOP, but
by placing the SHL in parallel with the SPLOOP instruction and by using the SPMASK to restrict the
ADD to a single execution, you have saved a couple of instruction cycles.
Example 7-15. Using the SPMASK Instruction to Merge Reset Code with SPLOOP
;*------------------------------
; dst=&(dst[n])
;* do {
; t = *src++;
; *dst++ = t;
; } while (count--)
; adjust buffer pointers
;* do {
; t = *src++;
; *dst++ = t;
; } while (count--)
;
;A4 = 1st source address
;B4 = 1st destination address
;A6 = 2nd source address
;B6 = 2nd destination address
;A8 = number of locations to copy from each buffer
;*------------------------------
MVC A8,ILC ;Setup number of loops
MVC A8,RILC ;Reload count
MVK 1,A1 ;Reload flag
NOP 3 ;Wait for ILC load to complete
[A1] SPLOOP 1 ;Start SPLOOP with ii=1
LDW .D1 *A4++,A0 ;Load value from buffer
NOP 4 ;Wait for it to arrive
MV .L2X A0,B0 ;Move it to other side for xfer
SPKERNELR ;End of SPLOOP, immediate reload
STW .D2 B0,*B4++ ;...and store value to buffer
BR_TARGET:
SPMASK D1 ;Mask LDW instruction
|| [A1] B BR_TARGET ;Branch to start if post-epilog
|| [A1] SUB .S1 A1, 1, A1 ;Adjust reload flag
|| [A1] LDW .D1 *A6,A0 ;Load first word of 2nd buffer
|| [A1] ADD .L1 A6,4,A4 ;Select new source buffer
NOP 4 ;Keep in sync with SPLOOP body
OR .S2 B6,0,B4 ;Adjust destination to 2nd buffer
NOP
7.11.4 Some Points About the SPMASK to Merge Reset Code Example
Note the following points about the execution of Example 7-15 (see Table 7-6 for the instruction flow)::
• The loop begins reloading from the SPLOOP buffer immediately after the SPKERNELR instruction with
no delay. In Table 7-6, the SPKERNELR is in cycle 7 and the reload happens in cycle 8.
• Because of the SPMASK instruction, the LDW instruction in the post epilog code replaces the LDW
instruction within the loop, so that the first word copied in the reloaded loop is from the new input
buffer. The ADD instruction is used to adjust the source buffer address for subsequent iterations within
the SPLOOP body. In Table 7-6, this happens in loop 8. Note that the D1 operand in the SPMASK
instruction indicates that the SPMASK applies to the .D1 unit. This could have been indicated by
marking the LDW instruction with a caret (^) instead.
• The OR instructions are used to adjust the destination address. It is positioned in the post-epilog code
as the MV instruction is within the SPLOOP body so that it will not corrupt the data from the STW
instructions within the SPLOOP epilog still executing from before the reload. In Table 7-6, this happens
in cycle 13 (loop 8).
• The B instruction is used to reset the program counter to the start of the epilog between executions of
the inner loop.
7.13 Interrupts
When an SPLOOP(D/W) instruction is encountered, the address of the execute packet containing the
SPLOOP(D/W) instruction is recorded. If the loop buffer is interrupted, the address stored in the interrupt
return pointer register (IRP) is the address of the execute packet containing the SPLOOP(D/W)
instruction.
Interrupt service routines must save and restore the ITSR or NTSR, ILC, and RILC registers. A B IRP
instruction copies ITSR to TSR, and a B NRP restores TSR from NTSR. The value of the SPLX bit in
ITSR or NTSR when the return branch is executed is used to alter the behavior of SPLOOP(D/W) when it
is restarted upon returning from the interrupt.
7.13.3 Exceptions
If an internal or external exception occurs while the loop buffer is active, then the following occur:
• The exception is recognized immediately and the loop buffer becomes idle.
• The loop buffer does not execute an epilog to drain the currently executing loop.
• TSR is copied into NTSR with the SPLX bit set to 1 in NTSR and cleared to 0 in TSR.
[!A0] B around
|| MVC A0,ILC
NOP 3
SPLOOP ii
; loop body
. . .
; end of loop body
around:
; code following loop
CPU Privilege
8.1 Overview
The CPU includes support for a form of protected-mode operation with a two-level system of privileged
program execution.
The privilege system is designed to support several objectives:
• Support the emergence of higher capability operating systems on the C6000 family architecture.
• Support more robust end-equipment, especially in conjunction with exceptions.
• Provide protection to support system features such as memory protection.
The support for powerful operating systems is especially important. By dividing operation into privileged
and unprivileged modes, the operating mode for the operating system is differentiated from applications,
allowing the operating system to have special privilege to manage the processor and system. In particular,
privilege allows the operating system to:
• control the operation of unprivileged software
• protect access to critical system resources (that is, interrupts)
• control entry to itself
The privilege system allows two distinct types of operation.
• Supervisor-only execution. This is used for programs that require full access to all control registers,
and have no need to run unprivileged (User mode) programs.
• Two-tiered system. This is where the OS and trusted applications execute in Supervisor mode, and
less trusted applications execute in User mode.
Instruction Compatibility
Table A-1 lists the instructions that are common to the C62x, C64x, C64x+, C67x, C67x+, and C674x
DSPs.
Table A-1. Instruction Compatibility Between C62x, C64x, C64x+, C67x, C67x+, and C674x DSPs
Instruction C62x DSP C64x DSP C64x+ DSP C67x DSP C67x+ DSP C674x DSP
ABS ✓ ✓ ✓ ✓ ✓ ✓
ABS2 ✓ ✓ ✓
ABSDP ✓ ✓ ✓
ABSSP ✓ ✓ ✓
ADD ✓ ✓ ✓ (1) ✓ ✓ ✓
ADDAB ✓ ✓ ✓ ✓ ✓ ✓
ADDAD ✓ ✓ ✓ ✓ ✓
ADDAH ✓ ✓ ✓ ✓ ✓ ✓
ADDAW ✓ ✓ ✓ (1) ✓ ✓ ✓
ADDDP ✓ ✓ ✓
(1)
ADDK ✓ ✓ ✓ ✓ ✓ ✓
ADDKPC ✓ ✓ ✓
ADDSP ✓ ✓ ✓
ADDSUB ✓ ✓
ADDSUB2 ✓ ✓
ADDU ✓ ✓ ✓ ✓ ✓ ✓
ADD2 ✓ ✓ ✓ ✓ ✓ ✓
ADD4 ✓ ✓ ✓
AND ✓ ✓ ✓ (1) ✓ ✓ ✓
ANDN ✓ ✓ ✓
AVG2 ✓ ✓ ✓
AVGU4 ✓ ✓ ✓
B displacement ✓ ✓ ✓ ✓ ✓ ✓
B register ✓ ✓ ✓ ✓ ✓ ✓
B IRP ✓ ✓ ✓ ✓ ✓ ✓
B NRP ✓ ✓ ✓ ✓ ✓ ✓
BDEC ✓ ✓ ✓
BITC4 ✓ ✓ ✓
BITR ✓ ✓ ✓
BNOP displacement ✓ ✓ (1) ✓
BNOP register ✓ ✓ ✓
BPOS ✓ ✓ ✓
CALLP ✓ (1) ✓
(1)
CLR ✓ ✓ ✓ ✓ ✓ ✓
CMPEQ ✓ ✓ ✓ (1) ✓ ✓ ✓
CMPEQ2 ✓ ✓ ✓
(1)
Instruction also available in compact form, see Section 3.10.
Table A-1. Instruction Compatibility Between C62x, C64x, C64x+, C67x, C67x+, and C674x DSPs
(continued)
Instruction C62x DSP C64x DSP C64x+ DSP C67x DSP C67x+ DSP C674x DSP
CMPEQ4 ✓ ✓ ✓
CMPEQDP ✓ ✓ ✓
CMPEQSP ✓ ✓ ✓
CMPGT ✓ ✓ ✓ (1) ✓ ✓ ✓
CMPGT2 ✓ ✓ ✓
CMPGTDP ✓ ✓ ✓
CMPGTSP ✓ ✓ ✓
CMPGTU ✓ ✓ ✓ (1) ✓ ✓ ✓
CMPGTU4 ✓ ✓ ✓
(1)
CMPLT ✓ ✓ ✓ ✓ ✓ ✓
CMPLT2 ✓ ✓ ✓
CMPLTDP ✓ ✓ ✓
CMPLTSP ✓ ✓ ✓
CMPLTU ✓ ✓ ✓ (2) ✓ ✓ ✓
CMPLTU4 ✓ ✓ ✓
CMPY ✓ ✓
CMPYR ✓ ✓
CMPYR1 ✓ ✓
DDOTP4 ✓ ✓
DDOTPH2 ✓ ✓
DDOTPH2R ✓ ✓
DDOTPL2 ✓ ✓
DDOTPL2R ✓ ✓
DEAL ✓ ✓ ✓
DINT ✓ ✓
DMV ✓ ✓
DOTP2 ✓ ✓ ✓
DOTPN2 ✓ ✓ ✓
DOTPNRSU2 ✓ ✓ ✓
DOTPNRUS2 ✓ ✓ ✓
DOTPRSU2 ✓ ✓ ✓
DOTPRUS2 ✓ ✓ ✓
DOTPSU4 ✓ ✓ ✓
DOTPUS4 ✓ ✓ ✓
DOTPU4 ✓ ✓ ✓
DPACK2 ✓ ✓
DPACKX2 ✓ ✓
DPINT ✓ ✓ ✓
DPSP ✓ ✓ ✓
DPTRUNC ✓ ✓ ✓
EXT ✓ ✓ ✓ (2) ✓ ✓ ✓
EXTU ✓ ✓ ✓ (2) ✓ ✓ ✓
GMPY ✓ ✓
GMPY4 ✓ ✓ ✓
IDLE ✓ ✓ ✓ ✓ ✓ ✓
INTDP ✓ ✓ ✓
(2)
Instruction also available in compact form, see Section 3.10.
Table A-1. Instruction Compatibility Between C62x, C64x, C64x+, C67x, C67x+, and C674x DSPs
(continued)
Instruction C62x DSP C64x DSP C64x+ DSP C67x DSP C67x+ DSP C674x DSP
INTDPU ✓ ✓ ✓
INTSP ✓ ✓ ✓
INTSPU ✓ ✓ ✓
LDB ✓ ✓ ✓ (2) ✓ ✓ ✓
LDB (15-bit offset) ✓ ✓ ✓ (2) ✓ ✓ ✓
(2)
LDBU ✓ ✓ ✓ ✓ ✓ ✓
LDBU (15-bit offset) ✓ ✓ ✓ ✓ ✓ ✓
LDDW ✓ ✓ (2) ✓ ✓ ✓
LDH ✓ ✓ ✓ (2) ✓ ✓ ✓
LDH (15-bit offset) ✓ ✓ ✓ ✓ ✓ ✓
LDHU ✓ ✓ ✓ (2) ✓ ✓ ✓
LDHU (15-bit offset) ✓ ✓ ✓ ✓ ✓ ✓
(2)
LDNDW ✓ ✓ ✓
LDNW ✓ ✓ (2) ✓
LDW ✓ ✓ ✓ (3) ✓ ✓ ✓
LDW (15-bit offset) ✓ ✓ ✓ ✓ ✓ ✓
LMBD ✓ ✓ ✓ ✓ ✓ ✓
MAX2 ✓ ✓ ✓
MAXU4 ✓ ✓ ✓
MIN2 ✓ ✓ ✓
MINU4 ✓ ✓ ✓
MPY ✓ ✓ ✓ (3) ✓ ✓ ✓
MPYDP ✓ ✓ ✓
MPYH ✓ ✓ ✓ (3) ✓ ✓ ✓
MPYHI ✓ ✓ ✓
MPYHIR ✓ ✓ ✓
MPYHL ✓ ✓ ✓ (3) ✓ ✓ ✓
MPYHLU ✓ ✓ ✓ ✓ ✓ ✓
MPYHSLU ✓ ✓ ✓ ✓ ✓ ✓
MPYHSU ✓ ✓ ✓ ✓ ✓ ✓
MPYHU ✓ ✓ ✓ ✓ ✓ ✓
MPYHULS ✓ ✓ ✓ ✓ ✓ ✓
MPYHUS ✓ ✓ ✓ ✓ ✓ ✓
MPYI ✓ ✓ ✓
MPYID ✓ ✓ ✓
MPYIH ✓ ✓ ✓
MPYIHR ✓ ✓ ✓
MPYIL ✓ ✓ ✓
MPYILR ✓ ✓ ✓
(3)
MPYLH ✓ ✓ ✓ ✓ ✓ ✓
MPYLHU ✓ ✓ ✓ ✓ ✓ ✓
MPYLI ✓ ✓ ✓
MPYLIR ✓ ✓ ✓
MPYLSHU ✓ ✓ ✓ ✓ ✓ ✓
MPYLUHS ✓ ✓ ✓ ✓ ✓ ✓
MPYSP ✓ ✓ ✓
(3)
Instruction also available in compact form, see Section 3.10.
Table A-1. Instruction Compatibility Between C62x, C64x, C64x+, C67x, C67x+, and C674x DSPs
(continued)
Instruction C62x DSP C64x DSP C64x+ DSP C67x DSP C67x+ DSP C674x DSP
MPYSPDP ✓ ✓ ✓
MPYSP2DP ✓ ✓ ✓
MPYSU ✓ ✓ ✓ ✓ ✓ ✓
MPYSU4 ✓ ✓ ✓
MPYU ✓ ✓ ✓ ✓ ✓ ✓
MPYU4 ✓ ✓ ✓
MPYUS ✓ ✓ ✓ ✓ ✓ ✓
MPYUS4 ✓ ✓ ✓
MPY2 ✓ ✓ ✓
MPY2IR ✓ ✓
MPY32 (32-bit result) ✓ ✓
MPY32 (64-bit result) ✓ ✓
MPY32SU ✓ ✓
MPY32U ✓ ✓
MPY32US ✓ ✓
(4)
MV ✓ ✓ ✓ ✓ ✓ ✓
MVC ✓ ✓ ✓ (4) ✓ ✓ ✓
MVD ✓ ✓ ✓
MVK ✓ ✓ ✓ (4) ✓ ✓ ✓
MVKH ✓ ✓ ✓ ✓ ✓ ✓
MVKL ✓ ✓ ✓ ✓ ✓ ✓
MVKLH ✓ ✓ ✓ ✓ ✓ ✓
(4)
NEG ✓ ✓ ✓ ✓ ✓ ✓
NOP ✓ ✓ ✓ (4) ✓ ✓ ✓
NORM ✓ ✓ ✓ ✓ ✓ ✓
NOT ✓ ✓ ✓ ✓ ✓ ✓
OR ✓ ✓ ✓ (4) ✓ ✓ ✓
PACK2 ✓ ✓ ✓
PACKH2 ✓ ✓ ✓
PACKH4 ✓ ✓ ✓
PACKHL2 ✓ ✓ ✓
PACKLH2 ✓ ✓ ✓
PACKL4 ✓ ✓ ✓
RCPDP ✓ ✓ ✓
RCPSP ✓ ✓ ✓
RINT ✓ ✓
ROTL ✓ ✓ ✓
RPACK2 ✓ ✓
RSQRDP ✓ ✓ ✓
RSQRSP ✓ ✓ ✓
SADD ✓ ✓ ✓ (4) ✓ ✓ ✓
SADD2 ✓ ✓ ✓
SADDSUB ✓ ✓
SADDSUB2 ✓ ✓
SADDSU2 ✓ ✓ ✓
SADDUS2 ✓ ✓ ✓
(4)
Instruction also available in compact form, see Section 3.10.
Table A-1. Instruction Compatibility Between C62x, C64x, C64x+, C67x, C67x+, and C674x DSPs
(continued)
Instruction C62x DSP C64x DSP C64x+ DSP C67x DSP C67x+ DSP C674x DSP
SADDU4 ✓ ✓ ✓
SAT ✓ ✓ ✓ ✓ ✓ ✓
(4)
SET ✓ ✓ ✓ ✓ ✓ ✓
SHFL ✓ ✓ ✓
SHFL3 ✓ ✓
(4)
SHL ✓ ✓ ✓ ✓ ✓ ✓
SHLMB ✓ ✓ ✓
SHR ✓ ✓ ✓ (4) ✓ ✓ ✓
SHR2 ✓ ✓ ✓
SHRMB ✓ ✓ ✓
SHRU ✓ ✓ ✓ (4) ✓ ✓ ✓
SHRU2 ✓ ✓ ✓
(4)
SMPY ✓ ✓ ✓ ✓ ✓ ✓
SMPYH ✓ ✓ ✓ (4) ✓ ✓ ✓
SMPYHL ✓ ✓ ✓ (5) ✓ ✓ ✓
(5)
SMPYLH ✓ ✓ ✓ ✓ ✓ ✓
SMPY2 ✓ ✓ ✓
SMPY32 ✓ ✓
SPACK2 ✓ ✓ ✓
SPACKU4 ✓ ✓ ✓
SPDP ✓ ✓ ✓
SPINT ✓ ✓ ✓
(5)
SPKERNEL ✓ ✓
SPKERNELR ✓ ✓
SPLOOP ✓ (5) ✓
(5)
SPLOOPD ✓ ✓
SPLOOPW ✓ ✓
SPMASK ✓ (5) ✓
SPMASKR ✓ (5) ✓
SPTRUNC ✓ ✓ ✓
SSHL ✓ ✓ ✓ (5) ✓ ✓ ✓
SSHVL ✓ ✓ ✓
SSHVR ✓ ✓ ✓
SSUB ✓ ✓ ✓ (5) ✓ ✓ ✓
SSUB2 ✓ ✓
(5)
STB ✓ ✓ ✓ ✓ ✓ ✓
STB (15-bit offset) ✓ ✓ ✓ ✓ ✓ ✓
STDW ✓ ✓ (5) ✓
STH ✓ ✓ ✓ (5) ✓ ✓ ✓
STH (15-bit offset) ✓ ✓ ✓ ✓ ✓ ✓
STNDW ✓ ✓ (5) ✓
STNW ✓ ✓ (5) ✓
(5)
STW ✓ ✓ ✓ ✓ ✓ ✓
STW (15-bit offset) ✓ ✓ ✓ (5) ✓ ✓ ✓
SUB ✓ ✓ ✓ (5) ✓ ✓ ✓
SUBAB ✓ ✓ ✓ ✓ ✓ ✓
(5)
Instruction also available in compact form, see Section 3.10.
Table A-1. Instruction Compatibility Between C62x, C64x, C64x+, C67x, C67x+, and C674x DSPs
(continued)
Instruction C62x DSP C64x DSP C64x+ DSP C67x DSP C67x+ DSP C674x DSP
SUBABS4 ✓ ✓ ✓
SUBAH ✓ ✓ ✓ ✓ ✓ ✓
(5)
SUBAW ✓ ✓ ✓ ✓ ✓ ✓
SUBC ✓ ✓ ✓ ✓ ✓ ✓
SUBDP ✓ ✓ ✓
SUBSP ✓ ✓ ✓
SUBU ✓ ✓ ✓ ✓ ✓ ✓
SUB2 ✓ ✓ ✓ ✓ ✓ ✓
SUB4 ✓ ✓ ✓
SWAP2 ✓ ✓ ✓
SWAP4 ✓ ✓ ✓
SWE ✓ ✓
SWENR ✓ ✓
UNPKHU4 ✓ ✓ ✓
UNPKLU4 ✓ ✓ ✓
(6)
XOR ✓ ✓ ✓ ✓ ✓ ✓
XORMPY ✓ ✓
XPND2 ✓ ✓ ✓
XPND4 ✓ ✓ ✓
ZERO ✓ ✓ ✓ ✓ ✓ ✓
(6)
Instruction also available in compact form, see Section 3.10.
Table B-1 lists the instructions that execute on each functional unit.
(1)
S2 only
SPRUFE8B – July 2010 Mapping Between Instruction and Functional Unit 715
716 Mapping Between Instruction and Functional Unit SPRUFE8B – July 2010
SPRUFE8B – July 2010 Mapping Between Instruction and Functional Unit 717
(4)
S2 only
718 Mapping Between Instruction and Functional Unit SPRUFE8B – July 2010
(5)
D2 only
SPRUFE8B – July 2010 Mapping Between Instruction and Functional Unit 719
720 Mapping Between Instruction and Functional Unit SPRUFE8B – July 2010
This appendix lists the instructions that execute in the .D functional unit and illustrates the opcode maps
for these instructions.
ld/st Mnemonic
0 STW (.unit) src,*B15[ucst5]
1 LDW (.unit)*B15[ucst5], dst
op Mnemonic
0 ADD (.unit) src1, src2, dst (src1 = dst)
1 SUB (.unit) src1, src2, dst (src1 = dst, dst = src1 - src2
Mnemonic
ADDAW (.unit)B15, ucst5, dst
op Mnemonic
0 ADDAW (.unit)B15, ucst5, B15
1 SUBAW (.unit)B15, ucst5, B15
op Mnemonic
0 0 0 see LSDx1, Figure G-4
0 0 1 see LSDx1, Figure G-4
0 1 0 Reserved
0 1 1 SUB (.unit) src2, 1, dst (src2 = dst, dst = src2 - 1)
1 0 0 Reserved
1 0 1 see LSDx1, Figure G-4
1 1 0 Reserved
1 1 1 see LSDx1, Figure G-4
dw ld/st Mnemonic
0 0 STW (.unit) src,*B15--[ucst2]
0 1 LDW (.unit)*++B15[ucst2], dst
1 0 STDW (.unit) src,*B15--[ucst2]
1 1 LDDW (.unit)*++B15[ucst2], dst
This appendix lists the instructions that execute in the .L functional unit and illustrates the opcode maps
for these instructions.
op SAT Mnemonic
0 0 ADD (.unit) src1, src2, dst
0 1 SADD (.unit) src1, src2, dst
1 0 SUB (.unit) src1, src2, dst (dst = src1 - src2)
1 1 SSUB (.unit) src1, src2, dst (dst = src1 - src2)
Mnemonic
ADD (.unit) scst5, src2, dst
op Mnemonic
0 0 0 AND (.unit) src1, src2, dst
0 0 1 OR (.unit) src1, src2, dst
0 1 0 XOR (.unit) src1, src2, dst
0 1 1 CMPEQ (.unit) src1, src2, dst
1 0 0 CMPLT (.unit) src1, src2, dst (dst = src1 < src2 , signed compare)
1 0 1 CMPGT (.unit) src1, src2, dst (dst = src1 > src2 , signed compare)
1 1 0 CMPLTU (.unit) src1, src2, dst (dst = src1 < src2 , unsigned compare)
1 1 1 CMPGTU (.unit) src1, src2, dst (dst = src1 > src2 , unsigned compare)
Mnemonic
MVK (.unit) scst5, dst
Mnemonic
CMPEQ (.unit) ucst3, src2, dst
op Mnemonic
0 0 CMPLT (.unit) ucst1, src2, dst (dst = ucst1 < src2 , signed compare)
0 1 CMPGT (.unit) ucst1, src2, dst (dst = ucst1 > src2 , signed compare)
1 0 CMPLTU (.unit) ucst1, src2, dst (dst = ucst1 < src2 , unsigned compare)
1 1 CMPGTU (.unit) ucst1, src2, dst (dst = ucst1 > src2 , unsigned compare)
op Mnemonic
0 0 0 see LSDx1, Figure G-4
0 0 1 see LSDx1, Figure G-4
0 1 0 SUB (.unit)0, src2, dst (src2 = dst; dst = 0 - src2)
0 1 1 ADD (.unit)-1, src2, dst (src2 = dst)
1 0 0 Reserved
1 0 1 see LSDx1, Figure G-4
1 1 0 Reserved
1 1 1 see LSDx1, Figure G-4
This appendix lists the instructions that execute in the .M functional unit and illustrates the opcode maps
for these instructions.
SAT op Mnemonic
0 0 0 MPY (.unit) src1, src2, dst
0 0 1 MPYH (.unit) src1, src2, dst
0 1 0 MPYLH (.unit) src1, src2, dst
0 1 1 MPYHL (.unit) src1, src2, dst
1 0 0 SMPY (.unit) src1, src2, dst
1 0 1 SMPYH (.unit) src1, src2, dst
1 1 0 SMPYLH (.unit) src1, src2, dst
1 1 1 SMPYHL (.unit) src1, src2, dst
This appendix lists the instructions that execute in the .S functional unit and illustrates the opcode maps
for these instructions.
Figure F-12. Call Nonconditional, Immediate with Implied NOP 5 Instruction Format
31 30 29 28 27 7 6 5 4 3 2 1 0
0 0 0 1 cst21 0 0 1 0 0 s p
21 1 1
BR Mnemonic
1 BNOP (.unit) scst7, N3
BR Mnemonic
1 BNOP (.unit) ucst8, 5
BR Mnemonic
1 CALLP (.unit) scst10, 5
BR s z Mnemonic
1 0 0 [A0] BNOP .S1 scst7, N3
1 0 1 [!A0] BNOP .S1 scst7, N3
1 1 0 [B0] BNOP .S2 scst7, N3
1 1 1 [!B0] BNOP .S2 scst7, N3
BR s z Mnemonic
1 0 0 [A0] BNOP .S1 ucst8, 5
1 0 1 [!A0] BNOP .S1 ucst8, 5
1 1 0 [B0] BNOP .S2 ucst8, 5
1 1 1 [!B0] BNOP .S2 ucst8, 5
BR SAT op Mnemonic
0 0 0 ADD (.unit) src1, src2, dst
0 1 0 SADD (.unit) src1, src2, dst
0 x 1 SUB (.unit) src1, src2, dst (dst = src1 - src2)
BR op Mnemonic
0 0 SHL (.unit) src2, ucst5, dst
0 1 SHR (.unit) src2, ucst5, dst
Mnemonic
MVK (.unit) ucst8, dst
SAT op Mnemonic
x 0 0 SHL (.unit) src2, ucst5, dst (src2 = dst)
x 0 1 SHR (.unit) src2, ucst5, dst (src2 = dst)
0 1 0 SHRU (.unit) src2, ucst5, dst (src2 = dst)
1 1 0 SSHL (.unit) src2, ucst5, dst (src2 = dst)
x 1 1 see S2sh, Figure F-26
op Mnemonic
0 0 SHL (.unit) src2, src1, dst (src2 = dst, dst = src2 << src1)
0 1 SHR (.unit) src2, src1, dst (src2 = dst, dst = src2 >> src1)
1 0 SHRU (.unit) src2, src1, dst (src2 = dst, dst = src2 << src1)
1 1 SSHL (.unit) src2, src1, dst (src2 = dst, dst = src2 sshl src1)
op Mnemonic
0 0 EXTU (.unit) src2, ucst5,31, A0/B0
0 1 SET (.unit) src2, ucst5, ucst5, dst (src = dst, ucst5 = ucst5)
1 0 CLR (.unit) src2, ucst5, ucst5, dst (src = dst, ucst5 = ucst5)
1 1 see S2ext, Figure F-28
op Mnemonic
0 0 EXT (.unit) src,16, 16, dst
0 1 EXT (.unit) src,24, 24, dst
1 0 EXTU (.unit) src,16, 16, dst
1 1 EXTU (.unit) src,24, 24, dst
op Mnemonic
0 ADD (.unit) src1, src2, dst (src1 = dst)
1 SUB (.unit) src1, src2, dst (src1 = dst, dst = src1 - src2)
Mnemonic
ADDK (.unit) ucst5, dst
op Mnemonic
0 0 0 see LSDx1, Figure G-4
0 0 1 see LSDx1, Figure G-4
0 1 0 SUB (.unit)0, src2, dst (src2 = dst, dst = 0 - src2)
0 1 1 ADD (.unit)-1, src2, dst (src2 = dst)
1 0 0 Reserved
1 0 1 see LSDx1, Figure G-4
1 1 0 MVC (.unit) src, ILC (s = 1)
1 1 1 see LSDx1, Figure G-4
Mnemonic
BNOP (.unit) src2, N3
This appendix illustrates the opcode maps that execute in the .D, .L, or .S functional units.
For a list of the instructions that execute in the .D functional unit, see Appendix C. For a list of the
instructions that execute in the .L functional unit, see Appendix D. For a list of the instructions that execute
in the .S functional unit, see Appendix F.
Table G-1. .D, .L, and .S Units Opcode Map Symbol Definitions
Symbol Meaning
CC
dst destination
dstms
op opfield; field within opcode that specifies a unique instruction
s side A or B for destination; 0 = side A, 1 = side B
src source
src2 source 2
srcms
ucstn n-bit unsigned constant field
unit unit decode
x cross path for src2; 0 = do not use cross path, 1 = use cross path
unit Mnemonic
0 0 MV (.Ln) src, dst
0 1 MV (.Sn) src, dst
1 0 MV (.Dn) src, dst
unit Mnemonic
0 0 MV (.Ln) src, dst
0 1 MV (.Sn) src, dst
1 0 MV (.Dn) src, dst
CC Mnemonic
0 0 [A0] MVK (.unit) ucst1, dst
0 1 [!A0] MVK (.unit) ucst1, dst
1 0 [B0] MVK (.unit) ucst1, dst
1 1 [!B0] MVK (.unit) ucst1, dst
CC unit Mnemonic
0 0 0 0 [A0] MVK (.Ln) ucst1, dst
0 1 [A0] MVK (.Sn) ucst1, dst
1 0 [A0] MVK (.Dn) ucst1, dst
CC unit Mnemonic
0 1 0 0 [!A0] MVK (.Ln) ucst1, dst
0 1 [!A0] MVK (.Sn) ucst1, dst
1 0 [!A0] MVK (.Dn) ucst1, dst
CC unit Mnemonic
1 0 0 0 [B0] MVK (.Ln) ucst1, dst
0 1 [B0] MVK (.Sn) ucst1, dst
1 0 [B0] MVK (.Dn) ucst1, dst
CC unit Mnemonic
1 1 0 0 [!B0] MVK (.Ln) ucst1, dst
0 1 [!B0] MVK (.Sn) ucst1, dst
1 0 [!B0] MVK (.Dn) ucst1, dst
op Mnemonic
0 0 0 MVK (.unit)0, dst
0 0 1 MVK (.unit)1, dst
0 1 0 See Dx1, Figure C-20; Lx1, Figure D-11; and Sx1, Figure F-31
0 1 1 See Dx1, Figure C-20; Lx1, Figure D-11; and Sx1, Figure F-31
1 0 0 See Dx1, Figure C-20; Lx1, Figure D-11; and Sx1, Figure F-31
1 0 1 ADD (.unit) src, 1, dst (src = dst)
1 1 0 See Dx1, Figure C-20; Lx1, Figure D-11; and Sx1, Figure F-31
1 1 1 XOR (.unit) src, 1, dst (src = dst)
op unit Mnemonic
0 0 0 0 0 MVK (.Ln)0, dst
0 1 MVK (.Sn)0, dst
1 0 MVK (.Dn)0, dst
op unit Mnemonic
0 0 1 0 0 MVK (.Ln)1, dst
0 1 MVK (.Sn)1, dst
1 0 MVK (.Dn)1, dst
op unit Mnemonic
1 0 1 0 0 ADD (.Ln) src, 1, dst
0 1 ADD (.Sn) src, 1, dst
1 0 ADD (.Dn) src, 1, dst
op unit Mnemonic
1 1 1 0 0 XOR (.Ln) src, 1, dst
0 1 XOR (.Sn) src, 1, dst
1 0 XOR (.Dn) src, 1, dst
This appendix lists the instructions that execute with no unit specified and illustrates the opcode maps for
these instructions.
For a list of the instructions that execute in the .D functional unit, see Appendix C. For a list of the
instructions that execute in the .L functional unit, see Appendix D. For a list of the instructions that execute
in the .M functional unit, see Appendix E. For a list of the instructions that execute in the .S functional unit,
see Appendix F.
SPRUFE8B – July 2010 No Unit Specified Instructions and Opcode Maps 763
764 No Unit Specified Instructions and Opcode Maps SPRUFE8B – July 2010
Figure H-1. DINT and RINT, SWE and SWENR Instruction Format
31 30 29 28 27 24 23 22 21 20 19 18 17 16 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 Reserved (0) 0 0 0 0 0 0 0 op 0 0 0 0 0 0 0 0 0 0 0 0 p
4 4 1
op Mnemonic
0 SPLOOP ii (ii = real ii - 1)
1 SPLOOPD ii
SPRUFE8B – July 2010 No Unit Specified Instructions and Opcode Maps 765
op Mnemonic
0 [A0] SPLOOPD ii (ii = real ii - 1)
1 [B0] SPLOOPD ii
Mnemonic
SPKERNEL ii/stage
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
D2 D1 1 0 1 1 S2 S1 L2 1 1 0 0 1 1 L1
1 1 1 1 1 1
NOTE: Supports masking of D1, D2, L1, L2, S1, and S2 instructions (not M1 or M2)
Mnemonic
SPMASK unitmask
b) SPMASKR Instruction
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
D2 D1 1 1 1 1 S2 S1 L2 1 1 0 0 1 1 L1
1 1 1 1 1 1
NOTE: Supports masking of D1, D2, L1, L2, S1, and S2 instructions (not M1 or M2)
Mnemonic
SPMASKR unitmask
766 No Unit Specified Instructions and Opcode Maps SPRUFE8B – July 2010
Mnemonic
NOP N3
SPRUFE8B – July 2010 No Unit Specified Instructions and Opcode Maps 767
Revision History
Table I-1 lists the changes made since the previous version of this document.
Products Applications
Amplifiers amplifier.ti.com Audio www.ti.com/audio
Data Converters dataconverter.ti.com Automotive www.ti.com/automotive
DLP® Products www.dlp.com Communications and www.ti.com/communications
Telecom
DSP dsp.ti.com Computers and www.ti.com/computers
Peripherals
Clocks and Timers www.ti.com/clocks Consumer Electronics www.ti.com/consumer-apps
Interface interface.ti.com Energy www.ti.com/energy
Logic logic.ti.com Industrial www.ti.com/industrial
Power Mgmt power.ti.com Medical www.ti.com/medical
Microcontrollers microcontroller.ti.com Security www.ti.com/security
RFID www.ti-rfid.com Space, Avionics & www.ti.com/space-avionics-defense
Defense
RF/IF and ZigBee® Solutions www.ti.com/lprf Video and Imaging www.ti.com/video
Wireless www.ti.com/wireless-apps
Mailing Address: Texas Instruments, Post Office Box 655303, Dallas, Texas 75265
Copyright © 2010, Texas Instruments Incorporated