MA C6000 2DAY Student Guide Rev2.3
MA C6000 2DAY Student Guide Rev2.3
MA C6000 2DAY Student Guide Rev2.3
Notice
These materials, slides, labs, solutions are essentially creative commons license
because they are stored on a public website. However, the most current author,
Mindshare Advantage LLC, must be contacted before these materials are used in any
other form for presentations, college course material or for any other purpose. These
materials are being updated and kept current by Mindshare Advantage LLC but are
used in association with Texas Instruments with their permission to update and
maintain as current.
Mindshare Advantage reserves the right to update this Student (and Lab) Guide to
reflect the most current product information for the spectrum of users. If there are
any differences between this Guide and a technical reference manual, references
should always be made to the most current reference manual and/or datasheet.
Information contained in this publication is believed to be accurate and reliable.
However, responsibility is assumed neither for its use nor any infringement of patents
or rights of others that may result from its use. No license is granted by implication or
otherwise under any patent or patent right of Texas Instruments or Mindshare
Advantage.
If you have any questions pertaining to this material, please contact Mindshare
Advantage at:
www.MindshareAdvantage.com
Revision History
2.00 March 2016 entire workshop updated to the latest tools (slides, code, labs, etc.)
During those past two days, some specific C6000 architecture items were skipped in favor of
covering all TI EP processors with the same focus. Now, it is time to dive deeper into the C6000
specifics.
The first part of this chapter focuses on the C6000 family of devices. The 2 nd part dives deeper
into topics already discussed in the previous two days of the TI-RTOS Kernel workshop. In a way,
this chapter is catching up all the C6000 users to understand this target environment
specifically.
After this chapter, we plan to dive even deeper into specific parts of the architecture like
optimizations, cache and EDMA.
Objectives
Objectives
Module Topics
C6000 Introduction .................................................................................................................... 11-1
Module Topics ......................................................................................................................... 11-2
TI EP Product Portfolio............................................................................................................ 11-3
DSP Core ................................................................................................................................ 11-4
Devices & Documentation ....................................................................................................... 11-6
Peripherals .............................................................................................................................. 11-7
PRU ..................................................................................................................................... 11-8
SCR / EDMA3 .................................................................................................................... 11-9
Pin Muxing......................................................................................................................... 11-10
Example Device: C6748 DSP ............................................................................................... 11-11
Choosing a Device ................................................................................................................ 11-12
C6000 Arch Catchup .......................................................................................................... 11-13
C64x+ Interrupts................................................................................................................ 11-13
Event Combiner ................................................................................................................ 11-14
Target Config Files ............................................................................................................ 11-15
Creating Custom Platforms ............................................................................................... 11-16
Quiz ....................................................................................................................................... 11-19
Quiz - Answers .................................................................................................................. 11-20
Using Double Buffers ............................................................................................................ 11-21
Lab 11: An Hwi-Based Audio System ................................................................................... 11-23
Lab 11 Procedure ............................................................................................................... 11-24
Import Existing Project ...................................................................................................... 11-24
Application (FIR Audio) Overview ..................................................................................... 11-25
Source Code Overview ..................................................................................................... 11-26
Add Hwi to the Project ....................................................................................................... 11-27
Optional OMAP-L138 LCDK Users ONLY ..................................................................... 11-28
Build, Load, Run. ............................................................................................................... 11-29
Debug Interrupt Problem ................................................................................................... 11-29
Using the Profiler Clock ..................................................................................................... 11-31
TI EP Product Portfolio
TIs Embedded
Microcontrollers (MCU)Processor Portfolio
Application (MPU)
MSP430 C2000 Tiva-C Hercules Sitara DSP Multicore
16-bit 32-bit 32-bit 32-bit 32-bit 16/32-bit 32-bit
Ultra Low Real-time All-around Safety Linux All-around Massive
Power & Cost MCU Android DSP Performance
DSP Core
What Problem Are We Trying To Solve?
x Y
ADC DSP DAC
C674
C64x+
Fixed and Floating
L1 RAM/Cache Point
Fixed Point C64x
Compact Instrs Lower power
EDMA3 EDMA3
Video/Imaging PRU
Enhanced
EDMA2
Available on the most
recent releases
C621x
C67x+
C62x C671x
Floating Point
C67x
10
C62x C671x
C67x
C641x
C64x DM642
C67x+ C672x
DM643x DM64xx,
C64x+ C645x C647x OMAP35x, DM37x
C6748 OMAP-L138*
C674x C6A8168
C667x
C66x Future
C665x
13
Peripherals
Graphics
C6x DSP
ARM
Accelerator
Video Accelerator(s)
PRU
Programmable Realtime Unit (PRU)
PRU consists of: Use as a soft peripheral to imple-
2 Independent, Realtime RISC Cores ment addl on-chip peripherals
Access to pins (GPIO)
Examples implementations
Its own interrupt controller
include:
Access to memory (master via SCR)
Device power mgmt control Soft UART
(ARM/DSP clock gating) Soft CAN
Create custom peripherals or
setup non-linear DMA moves.
No C compiler (ASM only)
Implement smart power
controller:
Allows switching off both ARM and
DSP clocks
Maximize power down time by
evaluating system events before
waking up DSP and/or ARM
18
SCR / EDMA3
System Architecture SCR/EDMA
Switched
Masters Central Slaves
SCR Switched Central Resource Resource
ARM C64 Mem
Masters initiate accesses to/from
slaves via the SCR DDR2
DSP
Most Masters (requestors) and Slaves EMIF64
(resources) have their own port
to the SCR EDMA3 TCP
TC0
Lower bandwidth masters (HPI, VCP
PCI66, etc) share a port CC TC1
EMAC SCR
21
Pin Muxing
What is Pin Multiplexing?
Pin Mux Example
HPI
uPP
SATA 128
13x13mm nPBGA & 16x16mm
PBGA
Timers 128
128
Pin-to-pin compatible w/OMAP
I2C, SPI, UART L138 (+ARM9), 361-pin pkg
LCD, PWM, eCAP 32KB L1D Cache/SRAM Dynamic voltage/freq scaling
uPP Total Power < 420mW
27
Choosing a Device
DSP & ARM MPU Selection Tool
http://focus.ti.com/en/multimedia/flash/selection_tools/dsp/dsp.html 29
32
1 2 3 4
C6748 has 128 possible interrupt sources (but only 12 CPU interrupts)
4-Step Programming:
1. Interrupt Selector choose which of the 128 sources are tied to the 12 CPU ints
2. IER enable the individual interrupts that you want to listen to (in BIOS .cfg)
3. GIE enable global interrupts (turned on automatically if BIOS is used)
4. Note: HWI Dispatcher performs smart context save/restore (automatic for BIOS Hwi)
Note: NMIE must also be enabled. BIOS automatically sets NMIE=1. If
BIOS is NOT used, the user must turn on both GIE and NMIE manually.
33
Event Combiner
Event Combiner (ECM)
Use only if you need more than 12 interrupt events
ECM combines multiple events (e.g. 4-31) into one event (e.g. EVT0)
EVTx ISR must parse MEVTFLAG to determine which event occurred
EVT 4-127
35
Advanced Tab
click
A GEL file is basically a batch file that sets up the CCS debug
environment including:
Memory Map
Watchdog
UART
Other periphs
40
41
42
43
Quiz
Chapter Quiz
1. How many functional units does the C6000 CPU have?
256
CPU
128
Quiz - Answers
Chapter Quiz
1. How many functional units does the C6000 CPU have?
8 functional units or execution units
2. What is the size of a C6000 instruction word?
256 bits (8 units x 32-bit instructions per unit)
3. What is the name of the main bus arbiter in the architecture?
Switched Central Resource (SCR)
4. What is the main difference between a bus master and slave?
Masters can initiate a memory transfer (e.g. EDMA, CPU)
5. Fill in the names of the following blocks of memory and bus:
L1P
256
S
C L2 CPU
R 128
L1D
46
This lab also employs triple buffers (another version of ping/pong, with an extra pang). Both the
RCV and XMT sides have triple buffers. The concept here is that when you are processing one,
the NEXT buffer (in line) is being filled. This lab is based on the C6748 StarterWare audio
application from TI that was converted to the latest TI-RTOS and FIR filter.
Lab 11 Procedure
If you cant remember how to perform some of these steps, please refer back to the previous labs
for help. Or, if you really get stuck, ask your neighbor. If you AND your neighbor get stuck, then
ask the instructor (who is probably doing absolutely NOTHING important) for help.
This starter file contains all the starting source files for the audio project including the setup
code for the A/D and D/A on the C6748 LCDK (or OMAP-L138 LCDK). It also has UIA
activated but this wont be used until the next lab.
3. Check the Properties to ensure you are using the latest XDC, BIOS and UIA.
For every imported project in this workshop, ALWAYS check to make sure the latest tools
(XDC, BIOS and UIA) are being used. The author created these projects at time x and you
may have updated the tools on your student PC at x+1 some time later. The author used
the tools available at time x to create the starter projects and solutions which may or may
not match YOUR current set of tools.
Therefore, you may be importing a project that is NOT using the latest versions of the tools
(XDC, BIOS, UIA) or the compiler.
Check ALL settings for the Properties of the project (XDC, BIOS, UIA) and the compiler
and update the imported project to the latest tools before moving on and save all settings.
coeffs_MA_TIRTOS.c contains the FIR filter coefficients low pass, high pass and
all pass. To change the values, simply comment out one set and uncomment another set.
ALL_PASS is set by default.
Led_MA_TIRTOS.c contains the Task that toggles the LED on the LCDK. The code
uses the StarterWare library calls.
system_MA_TIRTOS.h this is the main header file used by all other files. It contains
the #define statements that control almost everything in the code along with the function
prototypes.
Then fill in the following dialogue boxes to match what is shown below:
Make sure Enable at startup is NOT checked (this sets the IER bit
on the C6748). This will provide us with something to debug later.
The following information will help users of the OMAP-L138 LCDK get these labs to work
properly.
First, the devices are very similar. For build, you can target either the C6748 LCDK or the OMAP-
L138 LCDK in (right-click on the project) PropertiesGeneral. The author chose to simplify things
and just keep the C6748 LCDK as the target. That works fine.
Second, the OMAP device contains an ARM9 CPU that must be powered up FIRST before the
DSP (C674x). So, OMAP-L138 users must do two things in addition to the C6748 LCDK users:
Refer back to the early steps of Lab 1 (TI-RTOS workshop) if you dont remember how to import
target config files. Also, if you use a different emulator than the Spectrum Digital XDS510, you will
need to update your .ccxml file to reflect that. The board should be set to LCDKOMAPL138.
When you launch a debug session, use this file instead of the C6748 version. It uses a different
GEL file that configures both the ARM9 and the DSP which is also located in the TI_RTOS.zip file
as long as you placed that folder at the root: C:\TI_RTOS. If not, you will have to change the
location of the GEL file used by the target config file.
First, click on the ARM9_0 CPU and select RunConnect Target or just click the Connect button:
Then connect to the C674x_0 CPU the same way. Then load the .out file to the C674x CPU. You
should see the GEL file output to the Console window as it runs.
Hint: The StarterWare application has a unique send zeroes if McASP Xmt underruns
feature. Normally, the McASP on the C6748 cannot be restarted after a halt i.e. you
cant just hit halt, then Run. However, in this application, if a halt occurs and underruns
the XMT side of the McASP, the application continues to send ZEROES to the output to
keep it alive vs simply dying. This is a nice feature. You may hear static when you halt,
but you can simply click Play again to keep running.
In the bottom right-hand part of the screen, you should see a little CLK symbol that looks like
this:
Run to the first breakpoint, then double-click on the clock symbol to zero it. Run again and
the number of CPU cycles will display.
One place to set breakpoints is just before the FIR filter starts and ends basically
benchmarking how long the FIR filter takes to run. Like this:
RAISE YOUR HAND and get the instructors attention when you
have completed this lab.
Objectives
Objectives
Module Topics
C6000 CPU Architecture ........................................................................................................... 12-1
Module Topics ......................................................................................................................... 12-2
What Does A DSP Do? ........................................................................................................... 12-3
CPU From the Inside Out .............................................................................................. 12-4
Instruction Sets ..................................................................................................................... 12-10
MAC Instructions ................................................................................................................ 12-12
C66x MAC Instructions .................................................................................................... 12-14
Hardware Pipeline ................................................................................................................. 12-15
Software Pipelining ............................................................................................................... 12-16
Chapter Quiz ......................................................................................................................... 12-19
Quiz - Answers .................................................................................................................. 12-20
x Y
ADC DSP DAC
Y = coeffi * xi
i = 1
40
y = cn * xn
Mult
.M n = 1
The C6000
MPY .M c, x, prod
Designed to ALU
.L ADD .L y, prod, y
handle DSPs
math-intensive
calculations
Note:
You dont have to
specify functional
units (.M or .L)
Register File A 40
c y = cn * xn
x .M n = 1
16 or 32 registers
MPY .M c, x, prod
prod .L ADD .L y, prod, y
y
..
.
32-bits
Making Loops
1. Program flow: the branch instruction
B loop
10
Register File A 40
c y = cn * xn
x .S n = 1
cnt
loop:
prod .M MPY .M c, x, prod
y ADD .L y, prod, y
SUB .L cnt, 1, cnt
.. .L B .S loop
.
32-bits
[condition] B loop
Register File A 40
c y = cn * xn
x .S n = 1
cnt
loop:
prod .M MPY .M c, x, prod
y ADD .L y, prod, y
SUB .L cnt, 1, cnt
.. .L [cnt] B .S loop
.
32-bits
Register File A 40
c y = cn * xn
x .S n = 1
cnt
loop:
prod .M LDH .D *cp , c
y LDH .D *xp , x
*cp MPY .M c, x, prod
.L ADD .L y, prod, y
*xp
SUB .L cnt, 1, cnt
*yp [cnt] B .S loop
.D
Data Memory:
x(40), a(40), y
How do we increment through the arrays?
15
Auto-Increment of Pointers
Register File A 40
c y = cn * xn
x .S n = 1
cnt
loop:
prod .M LDH .D *cp++, c
y LDH .D *xp++, x
*cp MPY .M c, x, prod
.L ADD .L y, prod, y
*xp
SUB .L cnt, 1, cnt
*yp [cnt] B .S loop
.D
Data Memory:
x(40), a(40), y
How do we store results back to memory?
16
Register File A 40
c y = cn * xn
x .S n = 1
cnt
loop:
prod .M LDH .D *cp++, c
y LDH .D *xp++, x
*cp MPY .M c, x, prod
.L ADD .L y, prod, y
*xp
SUB .L cnt, 1, cnt
*yp [cnt] B .S loop
STW .D y, *yp
.D
Data Memory:
x(40), a(40), y
But wait - thats only half the story...
17
Register File A y = cn * xn
n = 1
A0 cn
.S1 MVK .S1 40, A2
A1 xn
loop: LDH .D1 *A5++, A0
A2 cnt
LDH .D1 *A6++, A1
A3 prd .M1
MPY .M1 A0, A1, A3
A4 sum
*c ADD .L1 A4, A3, A4
A5
A6 *x .L1 SUB .S1 A2, 1, A2
A7 *y [A2] B .S1 loop
.. .. STW .D1 A4, *A7
.D1
A15
or
32-bits Its easier to use symbols rather than
A31 register names, but you can use
either method.
19
Instruction Sets
C62x RISC-like instruction set
.S Unit .L Unit
ADD NEG ABS NOT
ADDK NOT ADD OR
.S ADD2 OR AND SADD
AND SET CMPEQ SAT
B SHL CMPGT SSUB
CLR SHR CMPLT SUB
EXT SSHL LMBD SUBC
.L MV SUB MV XOR
MVC SUB2 NEG ZERO
MVK XOR NORM
MVKH ZERO
.D
.M Unit
MPY SMPY
.D Unit MPYH SMPYH
.M ADD NEG MPYLH
ADDAB (B/H/W) STB (B/H/W) MPYHL
LDB (B/H/W) SUB
SUBAB (B/H/W) No Unit Used
MV ZERO NOP IDLE
21
22
C64x+ Additions
CALLP DINT ADDSUB
.S DMV None RINT .L ADDSUB2
RPACK2 SPKERNEL DPACK2
SPKERNELR DPACKX2
SPLOOP SADDSUB
SPLOOPD SADDSUB2
SPLOOPW SHFL3
SPMASK SSUB2
SPMASKR
SWE CMPY
SWENR CMPYR
CMPYR1
DDOTP4
None DDOTPH2
.D DDOTPH2R
.M DDOTPL2
DDOTPL2R
GMPY
MPY2IR
MPY32 (32-bit result)
MPY32 (64-bit result)
MPY32SU
MPY32U
MPY32US
SMPY32
XORMPY
24
MAC Instructions
DOTP2 with LDDW
a3 a2 : a1 a0 A1:A0 LDDW .D1 *A4++,A1:A0
B2 = A2
DOTP2 A0,B0,A2
a3*x3 + a2*x2 a1*x1 + a0*x0
|| DOTP2 A1,B1,B2
+ +
B3 A3
intermediate sum A5
intermediate sum ADD A2,A3,A3
|| ADD B2,B3,B3
26
loop Iteration
[0,0] [0,1]
[i,j]
d0c0
+
d1c0
d1c1
+ Four 16x16 multiplies
d2c2 d2c1 In each .M unit every cycle
--------------------------------------
d3c3 d3c2 adds up to 8 MACs/cycle, or
. 8000 MMACS
. Bottom Line: Two loop
. iterations for the price of one
27
single .M unit
single .M unit
30
src1 r1 i1 : r2 i2
single .M unit
31
Hardware Pipeline
Pipeline Phases
Program Decode
Fetch Execute
PG PS PW PR DP DC E1
PG PS PW PR DP DC E1
PG PS PW PR DP DC E1
PG PS PW PR DP DC E1
PG PS PW PR DP DC E1
PG PS PW PR DP DC E1
PG PS PW PR DP DC E1
Pipeline Full
Pipeline Phases
Full Pipe
34
Software Pipelining
Instruction Delays
All 'C64x instructions require only one cycle to
execute, but some results are delayed ...
Branch B 5
36
Register File A y = cn * xn
n = 1
A0 cn
xn .S1 MVK .S1 40, A2
A1
cnt loop: LDH .D1 *A5++, A0
A2
prd LDH .D1 *A6++, A1
A3 .M1
sum MPY .M1 A0, A1, A3
A4
*c ADD .L1 A4, A3, A4
A5
A6 *x .L1 SUB .S1 A2, 1, A2
A7 *y [A2] B .S1 loop
.. .. STW .D1 A4, *A7
.D1
A15 Need to add NOPs to get this
or
32-bits code to work properly
A31 NOP = Not Optimized Properly
How many instructions can this CPU
execute every cycle?
37
38
39
Chapter Quiz
Chapter Quiz
1. Name the four functional units and types of instructions they execute:
2. How many 16x16 MACs can a C674x CPU perform in 1 cycle? C66x ?
3. Where are CPU operands stored and how do they get there?
5. What is the purpose of s/w pipelining, which tool does this for you?
Quiz - Answers
Chapter Quiz
1. Name the four functional units and types of instructions they execute:
M unit Multiplies (fixed, float)
L unit ALU arithmetic and logical operations
S unit Branches and shifts
D unit Data loads and stores
2. How many 16x16 MACs can a C674x CPU perform in 1 cycle? C66x ?
C674x 8 MACs/cycle, C66x 32 MACs/cycle
3. Where are CPU operands stored and how do they get there?
Register Files (A and B), Load (LDx) data from memory
5. What is the purpose of s/w pipelining, which tool does this for you?
Maximize performance use as many functional units as possible in
every cycle, the COMPILER/OPTIMIZER performs SW pipelining
42
Outline
Objectives
Module Topics
C and System Optimizations ................................................................................................... 13-1
Module Topics ......................................................................................................................... 13-2
Introduction Optimal and Optimization ............................................................................ 13-3
C Compiler and Optimizer ....................................................................................................... 13-5
Debug vs. Optimized ...................................................................................................... 13-5
Levels of Optimization ......................................................................................................... 13-6
Build Configurations ............................................................................................................ 13-7
Code Space Optimization (ms) ......................................................................................... 13-8
File and Function Specific Options ..................................................................................... 13-9
Coding Guidelines ............................................................................................................. 13-10
Data Types and Alignment .................................................................................................... 13-11
Data Types ........................................................................................................................ 13-11
Data Alignment .................................................................................................................. 13-12
Forcing Data Alignment..................................................................................................... 13-13
Restricting Memory Dependencies (Aliasing) ....................................................................... 13-14
Access Hardware Features Using Intrinsics ...................................................................... 13-16
Give Compiler MORE Information ........................................................................................ 13-17
Pragma Unroll() .............................................................................................................. 13-17
Pragma MUST_ITERATE() ............................................................................................ 13-18
Keyword - Volatile ............................................................................................................. 13-18
Setting MAX interrupt Latency (-mi option) ....................................................................... 13-19
Compiler Directive - _nassert() ......................................................................................... 13-20
Using Optimized Libraries ..................................................................................................... 13-21
Libraries Download and Support .................................................................................... 13-23
System Optimizations ........................................................................................................... 13-24
Custom Sections ............................................................................................................... 13-24
Use EDMA......................................................................................................................... 13-25
Use Cache......................................................................................................................... 13-26
System Architecture SCR .............................................................................................. 13-26
Chapter Quiz ......................................................................................................................... 13-27
Quiz - Answers .................................................................................................................. 13-28
Lab 13 C Optimizations ...................................................................................................... 13-29
Lab 13 C Optimizations Procedure ................................................................................. 13-30
PART A Goals and Using Compiler Options.................................................................. 13-30
Determine Goals and CPU Min ..................................................................................... 13-30
Using Release Configuration (o2, g) ......................................................................... 13-33
Using Opt Configuration ............................................................................................. 13-36
Part B Code Tuning ........................................................................................................ 13-39
Part C Minimizing Code Size (ms) ............................................................................... 13-40
Part D Using DSPLib ...................................................................................................... 13-41
Conclusion......................................................................................................................... 13-42
Goals:
A typical goal of any systems algo is to meet real-time
You might also want to approach or achieve CPU Min in
order to maximize #channels processed
Optimization Intro
Optimization is:
Continuous process of refinement in which code being optimized executes faster
and takes fewer cycles, until a specific objective is achieved (real-time execution).
Bottom Line:
Learn as many optimization techniques as possible try them all (if necessary)
This is the GOAL of this chapter
Benchmarks:
Algo FIR (256, 64) DOTP (256-term)
Debug (no opt, g) 817K 4109
Opt (-o3, no g) 18K 42
Addl pragmas 7K 42
(DSPLib) 7K 42
CPU Min 4096 42
Debug get your code LOGICALLY correct first (no optimization)
Opt increase performance using compiler options (easier)
CPU Min it depends. Could require extensive time
10
Levels of Optimization
Levels of Optimization
FILE1.C
-o0, -o1 -o2 -o3 -pm -o3
{
{
} LOCAL
{ ... single block
}
} FUNCTION
across
blocks
{ . . .
} FILE
across
functions
PROGRAM
across files
12
Build Configurations
Two Default Configurations
For new projects, CCS always creates two default build configurations:
Note: these are simply sets or containers for build options. If you set a path in one,
it does NOT copy itself to the other (e.g. includes). Also, you can make your own! 15
20
Coding Guidelines
Programming the C6000
Source Efficiency* Effort
C Compiler 80 - 100% Low
C ++ Optimizer
22
Data Alignment
Data Alignment in Memory
DataType.C Byte (LDB) Boundaries
0 z
char z = 1;
short x = 7; 1
int y; 2
double w; 3
4
void main (void)
{ 5
y = child(x, 5); 6
} 7
8
Hint: all single data items are
aligned on type boundaries
9
Alignment of Structures
2. Use unions
typedef union algn_t{ typedef struct ex2_t{
short a2[80]; short b;
long long a8[10]; algn_t a3;
}; } ex2;
Forcing Alignment
#pragma DATA_ALIGN(x, 4)
short z;
short x;
35
43
Aliasing?
void fcn(*in, *out)
in {
LDW *in++, A0
a ADD A0, 4, A1
in + 4
STW A1, *out++
b }
out
c out0 Intent: no aliasing (ASM code?)
*in and *out point to different
d out1 memory locations
Reads are not the problem,
e out2 WRITES are. *out COULD
point anywhere
... ... Compiler is paranoid it assumes
aliasing unless told otherwise.
ASM code is the key (pipelining)
Use restrict keyword (more soon)
44
Aliasing?
What happens if the function is void fcn(*in, *out)
called like this? {
fcn(*myVector, *myVector+1) LDW *in++, A0
ADD A0, 4, A1
STW A1, *out++
in }
...
45
Alias Solutions
1. Compiler solves most aliasing on its own.
If in doubt, the result will be correct
even if the most optimal method wont be used
48
Intrinsics - Examples
Think of intrinsic functions
Intrinsics as a specialized function
library written by TI
#include <c6x.h>
_add2( ) _sadd ( ) has prototypes for all the
_clr ( ) _set ( ) intrinsic functions
Intrinsics are great for
_ext/u ( ) _smpy ( ) accessing the hardware
_lmbd ( ) _smpyh ( ) functionality which is
unsupported by the C
_mpy ( ) _sshl ( ) language
To run your C code on
_mpyh ( ) _ssub ( ) another compiler,
_mpylh ( ) _subc ( ) download intrinsic C-
source:
_mpyhl ( ) _sub2 ( ) spra616.zip
_nassert ( ) _sat ( ) Example:
_norm ( ) int x, y, z;
Refer to C Compiler Users Guide for more information
z = _lmbd(x, y);
49
Pragma Unroll()
3. UNROLL(# of times to unroll)
#pragma UNROLL(2);
for(i = 0; i < count ; i++) {
sum += a[i] * x[i];
}
52
Pragma MUST_ITERATE()
4. MUST_ITERATE(min, max, %factor)
#pragma UNROLL(2);
#pragma MUST_ITERATE(10, 100, 2);
for(i = 0; i < count ; i++) {
sum += a[i] * x[i];
}
Keyword - Volatile
5. Use Volatile Keyword
Ifa variable changes OUTSIDE the optimizers scope, it will
remove/delete the variable and any associated code.
For example, lets say *ctrl points to an EMIF address:
int *ctrl;
-mi Details
-mi0
Compilers code is not interruptible
User must guarantee no interrupts will occur
-mi1
Compiler uses single assignment and never produces a loop less
than 6 cycles
-mi1000 (or any number > 1)
Tells the compiler your system must be able to see interrupts every
1000 cycles
When not using mi (compilers default)
Compiler will software pipeline (when using o2 or o3)
Interrupts are disabled for s/w pipelined loops
Notes:
Be aware that the compiler is unaware of issues such as memory
wait-states, etc.
Using mi, the compiler only counts instruction cycles
56
58
61
FastRTS (C67x)
Optimized floating-point math function library for C programmers using
TMS320C67x devices
Includes all floating-point math routines currently in existing C6000 run-
time-support libraries
Single Precision Double Precision
The FastRTS library features: atanf atan
C-callable atan2f atan2
Hand-coded assembly-optimized cosf cos
Tested against C model and
expf exp
exp2f exp2
existing run-time-support functions exp10f exp10
logf log
FastRTS must be installed per log2f log2
log10f log10
directions in its Users Guide powf pow
(SPRU100a.PDF) recipf recip
rsqrtf rsqrt
sinf sin
62
FastRTS (C62x/C64x)
Optimized floating-point math function library for C programmers
enhances floating-point performance on C62x and C64x fixed-point devices
63
64
System Optimizations
Custom Sections
Custom Placement of Data and Code
Problem #1: have three arrays, two have to be linked into L1D and
one can be linked to DDR2. How do you split the .far section??
.far
rcvPing
L1D
rcvPing DDR2
rcvPong SlowBuf
rcvPong
SlowBuf
Problem #2: have two fxns, one has to be linked into L1P and the
other can be linked to DDR2. How do you split the .text section??
.text
filter L1P DDR2
filter SlowCode
SlowCode
72
SECTIONS userlinker.cmd
{ .far:rcvBuff: > FAST_RAM
.text:_filter: > FAST_RAM app.out
}
Use EDMA
Using EDMA
External
Memory
Internal 0x8000 func1
RAM Program
func2
func3
76
Use Cache
Using Cache Memory
DDR2
Cache 0x8000 func1
Program
func2
func3
Cache
CPU mDDR
H/W
EMAC
Chapter Quiz
Chapter Quiz
1. How do you turn ON the optimizer ?
Quiz - Answers
Chapter Quiz
1. How do you turn ON the optimizer ?
Project -> Properties, use o2 or o3 for best performance
85
Lab 13 C Optimizations
In the following lab, you will gain some experience benchmarking the use of optimizations using
the C optimizer switches. While your own mileage may vary greatly, you will gain an
understanding of how the optimizer works and where the switches are located and their possible
affects on speed and size.
Procedure EDMA3CCComplIsr()
{
1. Import existing project (Lab13) // Rx & Tx Handlers
// post Rx SEM
2. Part A Determine goal & CPU Min }
Apply compiler options
3. Part B Code tuning (using pragmas)
4. Part C Optimize for Space (-ms) Clk1
5. Part D Use DSPLIB FIR filter Tick 500ms
Time = 75min 49
Then click the Filter button above and view the Live Session tab. It should look something
like this:
Write down your actual benchmark below. The author, at the time of this writing, calculated
about 639K cycles as you can see.
The author saw about 83%. Goodness a high powered DSP performing a simple FIR filter
and it is almost out of steam. Whoa. Maybe we need to OPTIMIZE this thing.
What were your results? Write the down below:
Yes, we met the real-time goal because the audio sounds fine.
But hey, its using the Debug Configuration. And if we wanted to single step our code, we
can. It is a very nice debug-friendly environment although the performance is abysmal. This
is to be expected.
Nope. It is off. So this is that standard Debug configuration. Ok, nice fluffy debug environment
to make sure were getting the right answers, but not very high performance. Lets kick it up
a notch
Normally, the author NEVER uses the Release build configuration at all. Why? Because it
doesnt contain all of the build paths that now work perfect in the Debug configuration. Yes,
we could simply copy each one over manually but that is a pain. The author uses Debug first,
gets the code logically correct and then creates a new configuration (OPT), copy over the
Debug settings (paths) and then begin adding optimizations one by one.
However, in this lab, we just want to test what Release does and the author already copied
over the settings for you to the Release configuration to make it easy on you.
And then select Load Program, the following dialogue pops up:
The .out file shown above was the LAST file that was loaded. Now that we have switched
configurations (or maybe even switched projects), if we just select OK, we will get the
WRONG file loaded. Always, always, ALWAYS click the Browse project button and
specifically choose the file you want:
Once built and loaded, your audio should sound fine now that is, if you like to hear music
with no treble. Remember, just run it for 5-10 seconds.
Ok, now were talkin it went from 639K to 27K just by switching to the release
configuration. So, the bottom line is TURN ON THE OPTIMIZER !!
Wow from 83% down to about 5%. What a difference and we arent even CLOSE to the
best benchmark yet. This begins to show you the real difference between NO optimization
and simply using -o2.
13. Study release configuration build properties.
Find these locations in Properties:
The biggie is that o2 is selected. But we still have -g turned on which is fine.
Can we improve on this benchmark a little? Maybe
Click New and when the following dialogue pops up, name your new configuration Opt,
change the Copy settings from option to use Existing Configuration and make sure you
choose the Release configuration as shown:
Rebuild your code and benchmark as before. Also look at the CPU Load.
The authors benchmark was: And the CPU Load was about 3%:
5263 cycles. Is that incredible or what? Just about 4 years ago (in 2012), at this point in the
lab, the benchmark was 18K cycles. This means that the compiler team continues to work
hard on interpreting your code and finding ways to cut cycles. My hat is tipped to the TI
compiler team.
So, in just 30 minutes of work, we have reduced our benchmark from 639K cycles to just
about 5K cycles.
The down side is that there isnt much else we can do to optimize our code. We WILL do
some more optimizations in the next part, but they wont have much affect. But remember,
everyones mileage will vary, so that is why we go through each step anyway. You will need
all the tools possible for your own application.
Just for kicks and grins, try single stepping your code and/or adding breakpoints in the
middle of a function (like cfir). Is this more difficult with g turned OFF and o3 applied? Yep.
Note: With g turned OFF, you still get symbol capability i.e. you can enter symbol
names into the watch and memory windows. However, it is nearly impossible to single
step C code hence the suggestion to create test vectors at function boundaries to
check the LOGICAL part of your code when you build with the Debug Configuration.
When you turn off g, you need to look at the answers on function boundaries to make
sure it is working properly.
16. Turn on verbose and interlist and then see what the .asm file looks like for fir.asm.
As noted in the discussion material, to see it all, you need to turn on three switches. Turn
them on now, then build, then peruse the fir.asm file. You will see some interesting
information about software pipelining for the loops in mcaspPlayBk_MA_TIRTOS.c.
Turn on:
RunTime Model Options Verbose pipeline info (-mw)
Advanced Optimizations:
This is the information you will need in order to check to see if SPLOOP was disqualified and
why. If SPLOOP is being used, you know that the loops are small enough to fit in the buffer
and that you are getting maximum performance.
You can re-check the ASM files as you do each step in the next part
KEEP this benchmark in mind as you do the next cache lab. We will compare the results.
The authors results were:
Ok, so the benchmark is similar if not identical. Thats ok. Your mileage may vary in terms of
your own system. Also, if you were paying attention to the generated ASM files, after using
MUST_ITERATE, the tools only created ONE loop instead of two because we told it what the
min/max trip counts were. We helped the compiler become even more efficient.
18. Use restrict keyword on the results array.
You actually have a few options to tell the compiler there is NO ALIASING. The first method
is to tell the compiler that your entire project contains no aliasing (using the mt compiler
option). However, it is best to narrow the scope and simply tell the compiler that the results
array has no aliasing (because the WRITES are destructive, we RESTRICT the output array).
Comment out the old cfir() declaration and uncomment the new one that contains the
restrict keyword as shown below:
Build, then run again. Now benchmark your code again. Did it improve?
Opt + MUST_ITERATE + restrict (-o3, no g) cfir()? __________ cycles
Because aliasing was already figured out by the tools earlier, there was not much
improvement. The author saw 5720 cycles (a slight increase).
Open the .map file generated by the linker. Hmmm. Where is it located?
Try to find it yourself without asking anyone else. Hint: which build config did you use
when you hit build ?
20. Add ms3 to Opt Config.
Open the build properties and add ms3 to the compiler options (under Optimization). We
will just put the pedal to the metal for code size optimizations and go all the way to ms3
first. Note here that we also have o3 set also (which is required for the ms option).
In this scenario, the compiler may choose to keep the slow version of the redundant loops
(fast or slow) due to the presence of ms.
Rebuild and run.
Did your benchmark get worse with ms3? How much code size did you save? What
conclusions would you draw from this?
____________________________________________________________________
____________________________________________________________________
Keep in mind that you can also apply ms3 (or most of the basic options) to a specific
function using #pragma FUNCTION_OPTIONS( ).
FYI the author saved about 3.3K bytes total out of the .text section and the benchmark was
about 24K.
Also remember that you can apply ms3 on a FILE BY FILE basis. So, a smart way to apply
this is to use it on init routines and keep it far away from your algos that require the best
performance.
25. Build, load, verify and BENCHMARK the new FIR routine in DSPLib.
26. What are the best-case benchmarks?
Wow, for what we wanted in THIS system (a fast simple FIR routine), we would have been
better off just using DSPLib. Yep. But, in the process, youve learned a great deal about
optimization techniques across the board that may or may not help your specific system.
Remember, your mileage may vary.
Conclusion
Hopefully this exercise gave you a feel for how to use some of the basic compiler/optimizer
switches for your own application. Everyones mileage may vary and there just might be a
magic switch that helps your code and dosent help someone elses. Thats the beauty of trial
and error.
Conclusion? TURN ON THE OPTIMIZER ! Was that loud enough?
Heres what the author came up with how did your results compare?
Optimizations Benchmark
Debug Bld Config No opt 639K
Release (-o2, -g) 27K
Opt (-o3, no g) 5260
Opt + MUST_ITERATE 5260
Opt + MUST_ITERATE + restrict 5720 (slight increase)
DSPLib (FIR) 4384
Regarding ms3, use it wisely. It is more useful to add this option to functions that are large
but not time critical like IDLE functions, init code, maintenance type items.You can save
some code space (important) and lose some performance (probably a dont care). For your
time-critical functions, do not use ms ANYTHING. This is just a suggestion again, your
mileage may vary.
CPU Min was 4K cycles. We got close, but didnt quite reach it. The author believe that it is
possible to get closer to the 4K benchmark by using intrinsics and the DDOTP instruction.
However, the DSPLIB function did quite a nice job.
Keep in mind that these benchmarks are not exactly perfect. Why? Because we never
subtracted out the number of cycles to perform a Timestamp_get32(). Author thinks that
would lower the benchmarks by ~100-150 more cycles. But relative to each other is what
you were keeping track of.
The biggest limiting factor in optimizing the cfir routine is the sliding window. The processor
is only allowed ONE non-aligned load each cycle. This would happen 75% of the time. So,
the compiler is already playing some games and optimizing extremely well given the
circumstances. It would require hand-tweaking via intrinsics and intimate knowledge of the
architecture to achieve much better.
27. Terminate the Debug session, close the project and close CCS. Power-cycle the board.
Throw something at the instructor to let him know that youre done with the
lab. Hard, sharp objects are most welcome
Most systems will have more code and data than the internal memory can hold. As such, placing
everything off-chip is another option, and can be implemented easily, but most users will find the
performance degradation to be significant. As such, the ability to enable caching to accelerate the
use of off-chip resources will be desirable.
For optimal performance, some systems may beneifit from a mix of on-chip memory and cache.
Fine tuning of code for use with the cache can also improve performance, and assure reliability in
complex systems. Each of these constructs will be considered in this chapter,
Objectives
Objectives
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory 14 - 1
Module Topics
Module Topics
Cache & Internal Memory ......................................................................................................... 14-1
Module Topics ......................................................................................................................... 14-2
Why Cache? ............................................................................................................................ 14-3
Cache Basics Terminology .................................................................................................. 14-4
Cache Example ....................................................................................................................... 14-7
L1P Program Cache........................................................................................................... 14-10
L1D Data Cache................................................................................................................. 14-14
L2 RAM or Cache ?............................................................................................................ 14-16
Cache Coherency (or Incoherency?) .................................................................................... 14-18
Coherency Example .......................................................................................................... 14-18
Cache Functions Summary ............................................................................................ 14-22
Coherency Summary ..................................................................................................... 14-23
Cache Alignment ............................................................................................................... 14-23
MAR Bits Turn On/Off Cacheability ................................................................................... 14-24
Additional Topics ................................................................................................................... 14-26
Chapter Quiz ......................................................................................................................... 14-29
Quiz Answers ................................................................................................................. 14-30
Lab 14 Using Cache........................................................................................................... 14-31
Lab 14 Using Cache Procedure ...................................................................................... 14-32
A. Run System From Internal RAM .................................................................................. 14-32
B. Run System From External DDR2 (no cache)............................................................. 14-34
C. Run System From DDR2 (cache ON) .......................................................................... 14-37
Notes ..................................................................................................................................... 14-40
14 - 2 C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
Why Cache?
Why Cache?
Parking Dilemma
10 minute walk
Parking Choices:
0 minute walk @ $100 for close-in parking
10 minute walk @ $5 for distant parking
or
Why Cache?
Cache Bulk
Memory Memory
Sports Fast Slower
Arena Small Larger
Works like Cheaper
Big, Fast
Memory
Memory Choices:
Small, fast memory
Large, slow memory
or Use Cache:
Combines advantages of both
Like valet, data movement is automatic
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory 14 - 3
Cache Basics Terminology
Cache
CPU EMIF
H/W
0xF 0x8010
Index
0x8020
Conceptually, a cache divides the entire
memory into blocks equal to its size
A cache is divided into smaller storage
Block
locations called lines
The term Index or Line-Number is used to
specify a specific cache line
14 - 4 C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
Cache Basics Terminology
Cache Tags
Tag Index Cache External
800 0 Memory
801 1
.. 0x8000
.
0xF 0x8010
Valid Bits
Valid Tag Index Cache External
1 800 0 Memory
1 801 1
.. .. 0x8000
. .
0
0 721 0xF 0x8010
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory 14 - 5
Cache Basics Terminology
Direct-Mapped Cache
Index Cache External
0 Memory
.. 0x8000
.
0xF 0x8010
14 - 6 C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
Cache Example
Cache Example
Direct-Mapped Cache Example
Valid Tag Index Cache External
0 Memory
1
.. 0x8000
.
E
0xF 0x8010
16
Arbitrary Direct-Mapped
Cache Example
The following example uses:
16-line cache
16-bit addresses, and
Stores one 32-bit instruction per line
C6000 caches have different cache and
line sizes than this example
It is only intended as a simple cache
example to reinforce cache concepts
17
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory 14 - 7
Cache Example
0026h L2 ADD
0027h SUB cnt
0028h [!cnt] B L1
15 4 3 0
Tag Index
18
14 - 8 C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
Cache Example
Types of Misses
Compulsory
Miss when first accessing an new address
Conflict
Lineis evicted upon access of an address whose
index is already cached
Solutions:
Change memory layout
Allow more lines for each index
Capacity (we didnt see this in our example)
Lineis evicted before it can be re-used because
capacity of the cache is exhausted
Solution: Increase cache size
35
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory 14 - 9
L1P Program Cache
CPU L2 EMIF
L1P Size
Device Scheme Size Linesize
Direct 64 bytes
C62x/C67x Mapped
4K bytes
(16 instr)
Direct 32 bytes
C64x Mapped
16K bytes
(8 instr)
C64x+
Direct 32 bytes
C674x Mapped
32K bytes
(8 instr)
C66x
All L1P memories provide zero waitstate access
14 - 10 C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
L1P Program Cache
0x8010
0xF
0x8010
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory 14 - 11
L1P Program Cache
Direct 64 bytes
C62x/C67x Mapped
4K bytes
(16 instr)
N/A
Direct 32 bytes
C64x Mapped
16K bytes
(8 instr)
N/A
C64x+ Cache/RAM
Direct 32 bytes
C674x 32K bytes Cache Freeze
C66x Mapped (8 instr)
Memory Protection
RAM Memory
Cache
Cache Freeze...
43
14 - 12 C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
L1P Program Cache
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory 14 - 13
L1D Data Cache
DDR2
32K
x
One instruction may access multiple
data elements:
for( i = 0; i < 4; i++ ) {
sum += x[i] * y[i];
} y
What would happen if x and y ended up at
the following addresses?
x = 0x0000
y = 0x8000
They would end up overwriting each other in
the cache --- called thrashing
Increasing the associativity of the cache will
reduce this problem
How do you increase associativity?
46
Increased Associativity
Valid Tag Data Cache
0 DDR2
Way 0 0x00000
16K
0 0x08000
Way 1
16K
0x10000
Split a Direct-Mapped Cache in half
Each half is called a cache way
Multiple ways make data caches more efficient 0x18000
What is a set?
47
14 - 14 C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
L1D Data Cache
What is a Set?
The lines from each way that map to the
same index form a set
DDR2
0x8000
Data Cache
0
Set of index zeroes, 0x8008
i.e. Set 0
0 0x8010
Set 1
0x8018
L1D Summary
Device Scheme Size Linesize New Features
2-Way
C62x/C67x Set Assoc. 4K bytes 32 bytes N/A
2-Way
C64x Set Assoc.
16K bytes 64 bytes N/A
C64x+ Cache/RAM
2-Way C6455: 32K
C674x 64 bytes Cache Freeze
C66x Set Assoc. DM64xx: 80K
Memory Protection
50
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory 14 - 15
L2 RAM or Cache ?
L2 RAM or Cache ?
Internal Memory (L2)
L1 Device Size L2 Features
Program
(L1P) Unified (code or data)
64KB -
C671x Config as Cache or RAM
128K
None, or 1 to 4 way cache
L2
CPU Program
& Data Unified (code or data)
C64x 64KB - 1MB Config as Cache or RAM
Cache is always 4-way
L1
Data Unified (code or data)
(L1D) Config as Cache or RAM
C64x+ 64KB - 2MB Cache is always 4-way
Cache Freeze
Memory Protection
L2 linesize for all devices is 128 bytes
L2 caches are Read/Write Allocate memories
L2 Cache Configuration...
52
Performance
L2 L1P
1-8 Cycles
L2 L1D
L2 SRAM hit: 12.5 cycles
0 32K 64K 128K 256K
L2 Cache hit: 14.5 cycles
Pipelined: 4 cycles
When required, minimize
latency by using L1D RAM
Using the Config Tool...
53
14 - 16 C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
L2 RAM or Cache ?
Or set them via the BIOS .CFG file (what we will do in the lab)
54
55
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory 14 - 17
Cache Coherency (or Incoherency?)
XmtBuf
CPU
EDMA
CPU
14 - 18 C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
Cache Coherency (or Incoherency?)
XmtBuf XmtBuf
CPU
EDMA
XmtBuf XmtBuf
CPU
EDMA
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory 14 - 19
Cache Coherency (or Incoherency?)
XmtBuf XmtBuf
CPU writeback
EDMA
When the CPU is finished with the data (and has written it to XmtBuf in L2), it can
be sent to ext. memory with a cache writeback
A writeback is a copy operation from cache to memory, writing back the modified
(i.e. dirty) memory locations all writebacks operate on full cache lines
Use BIOS Cache APIs to force a writeback:
BIOS: Cache_wb (XmtBuf, BUFFSIZE, L2, CACHE_NOWAIT);
What happens with the "next" RCV buffer? 63
CPU
14 - 20 C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
Cache Coherency (or Incoherency?)
CPU
To get the new data, you must first invalidate the old data before trying to read
the new data (clears cache lines valid bits)
Again, cache operations (writeback, invalidate) operate on cache lines
BIOS provides an invalidate option:
BIOS: Cache_inv (RcvBuf, BUFFSIZE, L2, CACHE_WAIT);
65
XmtBuf
CPU
EDMA
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory 14 - 21
Cache Coherency (or Incoherency?)
14 - 22 C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
Cache Coherency (or Incoherency?)
Coherency Summary
Coherence Summary
Internal (L1/L2) Cache Coherency is Maintained
Coherence between L1D and L2 is maintained by cache controller
No Cache_fxn operations needed for data stored in L1D or L2 RAM
L2 coherence operations implicitly operate upon L1, as well
DEBUG NOTE: An easy way to identify cache coherency problems is to allocate your
buffers in L2. Problem goes away? Its probably a cache coherency issue.
What about "cache alignment" ?
68
Cache Alignment
Cache Alignment
False Addresses Buffer
Cache Buffer
Lines
Buffer False Addresses
69
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory 14 - 23
MAR Bits Turn On/Off Cacheability
XmtBuf
CPU
14 - 24 C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
MAR Bits Turn On/Off Cacheability
74
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory 14 - 25
Additional Topics
Additional Topics
L1D: DATA_MEM_BANK Example
Only one L1D access per bank per cycle
Use DATA_MEM_BANK pragma to begin paired arrays in different banks
Note: sequential data are not down a bank, instead they are along a horizontal
line across banks, then onto the next horizontal line
Only even banks (0, 2, 4, 6) can be specified
3 2 1 0 7 6 5 4 B A 9 8 F E D C 13 12 11 10 17 16 15 14 1B 1A 19 18 1F 1E 1D 1C
23 22 21 20 27 26 25 24 2B 2A 29 28 2F 2E 2D 2C 33 32 31 30 37 36 35 34 3B 3A 39 38 3F 3E 3D 3C
Cache Optimization
Optimize for Level 1
Multiple Ways and wider lines maximize efficiency
TI did this for you!
Main Goal - maximize line reuse before eviction
Algorithms can be optimized for cache
Touch Loops can help with compulsory misses
Run once thru loop in init code
Touch buffers to pre-load data cache
Up to 4 write misses can happen sequentially, but the
next read or write will stall
Bus has 4 deep buffer between CPU/L1 and beyond
Be smart about data output by one function then read
by another (touch it first)
When data is output by first function, where does it go?
If you touch output buffer first, then where will output data go?
Docs...
77
14 - 26 C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
Additional Topics
Summary...
78
Read-allocate cache: only allocates space in the cache during a read miss.
C64x+ L1 cache is read-allocate only.
Write-allocate cache: only allocates space in the cache during a write miss.
Read-write-allocate cache: allocates space in the cache for a read miss or a
write miss. C64x+ L2 cache is read-write allocate.
80
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory 14 - 27
Additional Topics
14 - 28 C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
Chapter Quiz
Chapter Quiz
Chapter Quiz
1. How do you turn ON the cache ?
3. All cache operations affect an aligned cache line. How big is a line?
4. Which bit(s) turn on/off cacheability and where do you set these?
5. How do you fix coherency when two bus masters access extl mem?
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory 14 - 29
Chapter Quiz
Quiz Answers
Chapter Quiz
1. How do you turn ON the cache ?
Set size > 0 in platform package (or via Cache_setSize() during runtime)
3. All cache operations affect an aligned cache line. How big is a line?
L1P 32 bytes (256 bits), L1D 64 bytes, L2 128 bytes
4. Which bit(s) turn on/off cacheability and where do you set these?
MAR (Mem Attribute Register), affects 16MB Extl data space, .cfg
5. How do you fix coherency when two bus masters access extl mem?
Invalidate before a read, writeback after a write (or use L2 mem)
14 - 30 C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
Lab 14 Using Cache
This will provide a decent understanding of what you can expect when using cache in your own
application.
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory 14 - 31
Lab 14 Using Cache Procedure
Note: For all benchmarks throughout this lab, use the Opt build configuration when you build.
Do NOT use the Debug or Release config.
See the names of the regions? IRAM and DDR are ones we will use in the lab. IRAM points
to the L2 memory region.
Open the file RxTxBuf_MA_TIRTOS.cmd.
This is where the user-defined section names are allocated to the memory areas. Notice that
the buffers are allocated in L2. This is exactly where we want them for this part of the lab.
Later, we will change the region to DDR to move the buffers off chip in order to test the
cache performance.
14 - 32 C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
Lab 14 Using Cache Procedure
The benchmark from the Log_info should be around 5260 cycles. If not, clean the project,
delete the Opt folder and rebuild/load/run/benchmark.
Well compare this buffers in L2, cache ON benchmark to all external and all external with
cache ON numbers as we proceed through the lab. You just might be surprised
5. Check the size of the caches that TI-RTOS set for you using ROV.
As your code is halted at the moment, open up ROV and locate the cache sizes via Cache:
So, the sizes are exactly what we predicted. These were set by the platform file evmc6748.
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory 14 - 33
Lab 14 Using Cache Procedure
6. Place the buffers in external DDR2 memory and turn OFF the cache.
So you have a choice you can either write code (which is commented out at the bottom of
the hardwareInitTaskFxn() routine OR you can use the .cfg file. The author recommends
you use the .cfg file for two reasons: (1) you dont have to write code simply work with a
GUI; (2) the tools will take the sizes into consideration as it creates the .cmd file vs you
having to do this yourself.
So how do you use the .cfg file to specify sizes? Oh, and where are the MAR bits set? Or are
they set? Thankfully because the tools now (via the platform file) that the evmC6748 is
being used, the MAR bits are set automatically. But, in your own application, youll need to
know how to modify all of the MAR bits to match your application.
Lets go see the magical place where the cache sizes and MAR bits are set in the .cfg file
Open the .cfg file so you can see the Outline view and Available Products.
There are multiple ways to view the same thing, so the author has chosen the most direct
path to the information.
In the Available Products window, drag Cache over to the Outline view.
This is using the target specific (C6748) cache settings and placing them into the Outline
view so you can edit them.
14 - 34 C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
Lab 14 Using Cache Procedure
Once you have this module in your Outline view, click on it to configure it. Notice the cache
size settings below:
These settings MATCH the platform file defaults L1D/P is max at 32K and L2 cache is off.
We need to turn L1D/P OFF, so change the top two settings to L1D/P = 0K:
If we want all cache turned OFF, these are the proper settings. So just leave them this way.
We will come back to this later to turn ON the caches. Note that the L1P setting will have an
affect on performance because program memory (.text) is allocated in DDR according to the
linker.cmd file. We dont care about this for the moment. We just want to break the whole
thing and then turn on all the caches in the next section. The key performance problem will
be the buffers in DDR with no cache on.
Lets look at the MAR bits:
The bits in question are MAR 192-223 if you remember from the discussion material. MAR
bits 192-199 are set to 1 which covers the 128MB of external DDR memory starting at
address C000_0000h. Great. We dont have to touch those, but now you know where they
are located so you can change them for your own application.
Save your .cfg file. You will get a warning that says cache settings override platform
settings. Thats great. Just ignore it.
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory 14 - 35
Lab 14 Using Cache Procedure
Now that the cache is off, we need to allocate the buffers into the DDR memory area using
our user linker command file
Open RxTxBuf_MA_TIRTOS.cmd. Change IRAM to DDR:
If you look at the CPU load, it shows nothing. Why? The CPU is loaded more than 100%. So
the IDLE thread never runs and reports the CPU Load. Ok, this sounds reasonable.
The author saw the following cycle count for cfir():
Almost like the old Debug build configuration cycles. Our application is NOT meeting real
time, but that is to be expected. If you have important stuff in DDR2 memory and you dont
turn the cache on, youre in trouble.
14 - 36 C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
Lab 14 Using Cache Procedure
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory 14 - 37
Lab 14 Using Cache Procedure
Ok about the same. And this is to be expected. The cfir() routine is actually reading/writing
internal SRAM because the cache is on. The read buffers are cached in L1 (ONCE) and the
transmit buffers are written to L2 (ONCE) because the invalidate/writeback commands have
not been added. So this is really not a fair benchmark. In real systems, you need to add the
invalidate/writeback commands so it forces the CPU to read from DDR vs. internal SRAM.
That takes care of the READ, now lets take care of the WRITE
14 - 38 C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
Lab 14 Using Cache Procedure
Add the following line of code as indicated (around line 652 just AFTER the interleave of
the Tx data):
The authors audio sounded fine now and the benchmarks were:
Not consistent, but this has to do with cache snooping (L1 to L2), pipelined reads and re-use
of data. The average went up also expected due to having to access external DDR memory
for the first read followed by re-use in the cache. But the cache performance is extraordinary
given the fact that these buffers are in external memory. So if you need to use DDR, do so,
but TURN THE CACHE ON.
Again, your mileage may vary, but now you know the ins, outs and dollar signs associated
with cache.
Youre finished with this lab. Congrats. This is the last lab in the workshop.
C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory 14 - 39
Notes
Notes
14 - 40 C6000 Embedded Design Workshop Using BIOS - Cache & Internal Memory
Using EDMA3
Introduction
In this chapter, you will learn the basics of the EDMA3 peripheral. This transfer engine in the
C64x+ architecture can perform a wide variety of tasks within your system from memory to
memory transfers to event synchronization with a peripheral and auto sorting data into separate
channels or buffers in memory. No programming is covered. For programming concepts, see
ACPY3/DMAN3, LLD (Low Level Driver covered in the Appendix) or CSL (Chip Support
Library). Heck, you could even program it in assembly, but dont call ME for help.
Objectives
Objectives
Module Topics
Using EDMA3 ............................................................................................................................. 15-1
Module Topics ......................................................................................................................... 15-2
Overview ................................................................................................................................. 15-3
What is a DMA ? ............................................................................................................... 15-3
Multiple DMAs ................................................................................................................... 15-4
EDMA3 in C64x+ Device .................................................................................................... 15-5
Terminology ............................................................................................................................ 15-6
Overview ............................................................................................................................. 15-6
Element, Frame, Block ACNT, BCNT, CCNT .................................................................. 15-7
Simple Example .................................................................................................................. 15-7
Channels and PARAM Sets ................................................................................................ 15-8
Examples ................................................................................................................................ 15-9
Synchronization ..................................................................................................................... 15-12
Indexing ................................................................................................................................. 15-13
Events Transfers Actions ................................................................................................ 15-15
Overview ........................................................................................................................... 15-15
Triggers ............................................................................................................................. 15-16
Actions Transfer Complete Code ................................................................................... 15-16
EDMA Interrupt Generation .................................................................................................. 15-17
Linking ................................................................................................................................... 15-18
Chaining ................................................................................................................................ 15-19
Channel Sorting .................................................................................................................... 15-21
Architecture & Optimization .................................................................................................. 15-22
Programming EDMA3 Using Low Level Driver (LLD) ........................................................ 15-23
Chapter Quiz ......................................................................................................................... 15-25
Quiz Answers ................................................................................................................. 15-26
Additional Information ........................................................................................................... 15-27
Notes ..................................................................................................................................... 15-28
Overview
What is a DMA ?
What is DMA ?
When we say DMA, what do we mean? Well, there are MANY
forms of DMA (Direct Memory Access) on this device:
EDMA3 Enhanced DMA handles 64 DMA CHs and 4 QDMA CHs
DMA 64 channels that can be triggered manually or by events/chaining
QDMA 8 channels of Quick DMA triggered by writing to a trigger word
EDMA3
EVTx Q0 TC0
64
Chain DMA Q1 TC1 Switched
Manual SCR Central
Q2 TC2 Resource
4 TC3
Trigger Word QDMA Q3
Multiple DMAs
Multiple DMAs : EDMA3 and QDMA
VPSS EDMA3 C64x+ DSP
(System DMA)
L1P L1D
Master Periph DMA QDMA
(sync) (async)
L2
DMA QDMA
Enhanced DMA (version 3) Quick DMA
DMA to/from peripherals DMA between memory
Can be syncd to peripheral events Async must be started by CPU
Handles up to 64 events 4-16 channels available
M
SCR & EDMA3 S
Master Slave 32
EDMA3 SCR = Switched Central Resource L3
McASP
TC0
TC1
CC McBSP
x2 TC2 PCI L1P
DDR2/3
C64x+ MegaModule L1P PERIPH
EMAC EMIF Mem AET
HPI Ctrl M S
PCI Cfg
D D
L2 S S
L2 Mem IDMA CPU M M
ARM Ctrl
L L
External L1D
S M Mem
128 Cntl Mem
Ctrl PERIPH =
M S M All peripherals
128
Cfg registers
S
DATA L1D
32
SCR
CFG
EDMA3 is a master on the DATA SCR it can initiate data transfers
EDMA3s configuration registers are accessed via the CFG SCR (by the CPU)
SCR
Each TC has its own connection (and priority) to the DATA SCR. Refer to the connection matrix to determine valid connections
7
Terminology
Overview
Source
BCNTLengthACNT
Destination Transfer
Configuration
Transfer Configuration
Options
Source
B
Transfer CountA
Destination B Count (# Elements) A Count (Element Size)
Index 31 16 15 0
Cnt Reload Link Addr
Index Index
C Count (# Frames)
Rsvd C 31 16 15 0
Let's look at a simple example... 10
Simple Example
Example How do you VIEW the transfer?
Lets start with a simple example or is it simple?
We need to transfer 12 bytes from here to there.
Examples
EDMA Example : Simple (Horizontal Line)
loc_8 (bytes)
1 2 3 4 5 6 8
Goal: myDest:
7 8 9 10 11 12 9
Transfer 4 elements 13 14 15 16 17 18 10
from loc_8 to myDest 19 20 21 22 23 24 11
25 26 27 28 29 30
8 bits
Source = &loc_8
1= BCNT ACNT =4
Destination = &myDest
Source = &loc_8
4= BCNT ACNT =1
Destination = &myDest
1= DSTBIDX SRCBIDX =1
Why is this a less
0= DSTCIDX SRCCIDX =0 efficient version?
CCNT =1
15
Source = &loc_8
4= BCNT ACNT =1
Destination = &myDest
2= DSTBIDX SRCBIDX =6
0= DSTCIDX SRCCIDX =0
CCNT =1
16
Source = &loc_8
4= BCNT ACNT =2
Destination = &myDest
2= DSTBIDX SRCBIDX = 2 (2 bytes going from block 8 to 9)
Source = &loc_8
5= BCNT ACNT =8
Destination = &myDest
(4*2) is 8 = DSTBIDX SRCBIDX = 12 is (6*2) (from block 8 to 14)
0= DSTCIDX SRCCIDX =0
CCNT =1
18
Synchronization
A Synchronization
An event (like the McBSP receive register full), triggers
the transfer of exactly 1 array of ACNT bytes (2 bytes)
Frame 1
Array1 Array2 Array BCNT
Frame 2
Array1 Array2 Array BCNT
Frame CCNT
Array1 Array2 Array BCNT
20
AB Synchronization
An event triggers a two-dimensional transfer of BCNT arrays
of ACNT bytes (A*B)
EVTx
Frame 1
Array1 Array2 Array BCNT
Frame 2
Array1 Array2 Array BCNT
Frame CCNT
Array1 Array2 Array BCNT
21
Indexing
Indexing BIDX, CIDX
EDMA3 has two types of indexing: BIDX and CIDX
Each index can be set separately for SRC and DST (next slide)
BIDX = index in bytes between ACNT arrays (same for A-sync and AB-sync)
CIDX = index in bytes between BCNT frames (different for A-sync vs. AB-sync)
BIDX/CIDX: signed 16-bit, -32768 to +32767
A-Sync AB-Sync
EVTx EVTx EVTx EVTx
.. ..
BIDX BIDX
CIDXAB
CIDXA
.. ..
23
Indexed Transfers
EDMA3 has 4 indexes allowing higher flexibility for
complex transfers:
SRCBIDX = # bytes between arrays (Ex: SRCBIDX = 2)
SRCCIDX = # bytes between frames (Ex: SRCCIDXA = 2, SRCCIDXAB = 4)
Note: CIDX depends on the synchronization used A or AB
DSTBIDX = # bytes between arrays (Ex: DSTBIDX = 3)
DSTCIDX = # bytes between frames (Ex: DSTCIDXA = 5, DSTCIDXAB = 8)
SRCBIDX DSTBIDX
1 3 1 3
SRCCIDXA DSTCIDXA
5 7
9 11 5 7
13 15
SRC (8-bit) 9 11
(contiguous)
A, B, and C counts 13 14 15 16 17 18
19 20 21 22 23 24
Addresses the source & destination addresses 25 26 27 28 29 30
Index How far to increment the src/dst after each transfer
T
(xfer config)
Done
Options
Source
B
Transfer CountA
Destination
E T
Index
Cnt Reload Link Addr
A
(event) Index Index
(xfer config) (action)
Rsvd C
Triggers
How to TRIGGER a Transfer
There are 3 ways to trigger an EDMA transfer:
28
TC
TR TR TR TR Ack
29
64 Channels and ONE interrupt? How do you determine WHICH channel completed?
31
HWI_INT5
EDMA3CC_INT (#24)
How does the ISR Fxn Table (in #4 above) get loaded with the proper handler Fxn names?
Use EDMA3 LLD to program the proper callback fxn for this HWI.
32
Linking
Linking Action Overview
T Alias: Re-load
Auto-init
(xfer config) Done
Options
Source 1 2 3 4 5 6
B
Transfer CountA 7 8 9 10 11 12
Destination 13 14 15 16 17 18
E T
Index
Cnt Reload Link Addr A 19
25
20
26
21
27
22
28
23
29
24
30
(event) Index Index
(xfer config) (action)
Rsvd C
Need: auto-reload channel with new config How does linking work?
Ex1: do the same transfer again User must specify the LINK field
Ex2: ping/pong system (covered later) in the config to link to another PSET.
When the current xfr (0) is complete,
Solution: use linking to reload Ch config the EDMA auto reloads the new
config (1) from the linked PSET.
Concept:
Linking two or more channels together allows Config 0 Config 1
the EDMA to auto-reload a new configuration
when the current transfer is complete. reload
LINK LINK
Linking still requires a trigger to start the 1 NULL
transfer (manual, chain, event).
You can link as many PSETs as you like
it is only limited by the #PSETs on a device. Note: Does NOT start xfr !!
34
Chaining
Reminder Triggering Transfers
There are 3 ways to trigger an EDMA transfer:
Need: When one transfer completes, trigger How does chaining work?
another transfer to run Set the TCC field to match the next
Ex: ChX completes, kicks off ChY (i.e. chained) channel #
Turn ON chaining
Solution: Use chaining to kick off next xfr
When the current xfr (X) is complete,
Concept: it triggers the next Ch (Y) to run
Chaining actually refers to both both an action Ch X Ch Y
and an event the completed action from the
1st channel is the event for the next channel Y ?
TCC Done ? TCC
You can chain as many Chans as you like
it is only limited by the #Chs on a device RUN Y
EN DIS
Chaining does NOT reload current Chan config Chain EN Chain EN
that can only be accomplished by linking. It
simply triggers another channel to run.
37
0 5 5=0
0 5=0
7
0 6 6=0
1 6=1 EDMA3CC_GINT
0
1 7 7=1
0 7=0
6
0 55 55 = 0 55
0 55 = 0
OPT.TCCHEN CER
CER = Chain Evt Reg
0
Channel #5 5=1 7 ESR Evt Set Reg
Triggered manually by ESR 6=0 0 TCINTEN = Final TCC will
0 interrupt the CPU
Chains to Ch #7 (Ch #5s TCC = 7) 1 TCCHEN = Final TCC will
7=0 6 chain to next channel
Channel #7 55 = 0 55
0
Triggered by chaining from Ch #5
Notes:
Interrupts the CPU when finished
(sets TCC = 6) Any Ch can chain to any other Ch by enabling
OPT.TCCHEN and specifying the next TCC
ISR checks IPR (TCC=6) to determine which
channel generated the interrupt Any Ch can interrupt the CPU by enabling its
OPT.TCINTEN option (and specifying the TCC)
IPR bit set depends on previous Chs TCC setting
Channel Sorting
Channel Sort Transfer Config Overview
T
(xfer config) Done
Options
Source 1 2 3 4 5 6
B
Transfer CountA 7 8 9 10 11 12
Destination 13 14 15 16 17 18
E T
Index
Cnt Reload Link Addr A 19
25
20
26
21
27
22
28
23
29
24
30
(event) Index Index
(xfer config) (action)
Rsvd C
Need: De-interleave (sort) two (or more) How does channel sorting work?
channels User can specify the BIDX and CIDX
Ex: stereo audio (LRLR) into L & R buffers values to accomplish auto sorting
Solution: Use DMA indexing to perform PERIPH MEM
sorting automatically L0 EDMA L0
R0 L1
Concept: L1 BIDX L2
In many applications, data comes from the R1 CIDX
peripheral as interleaved data (LRLR, etc.) R0
L2 R1
Most algos that run on data require these R2 R2
channels to be de-interleaved
Indexing, built into the EDMA3, can auto-sort
these channels with no time penalty
40
EDMA consists of two parts: Channel Controller (CC) and Transfer Controller (TC)
An event (from periph-ER/EER, manual-ESR or via chaining-CER) sends the transfer
to 1 of 4 queues (Q0 is mapped to TC0, Q1-TC1, etc. Note: McBSP can use TC1 only)
Xfr mapped to 1 of 256 PSETs and submitted to the TC (1 TR transmit request per ACNT
bytes or A*B CNT bytes based on sync). Note: Dst FIFO allows buffering of writes while more reads occur.
The TC performs the transfer (read/write) and then sends back a transfer completion code (TCC)
The EDMA can then interrupt the CPU and/or trigger another transfer (chaining Chap 6)
42
Manage Priorities
Can adjust TC0-3 priority to the SCR (MSTPRI register)
In general, place small transfers at higher priorities
References Programming EDMA3 using LLD (wiki) + examples (see next slide)
TC Optimization Rules (SPRUE23)
EDMA3 User Guide (SPRU966)
EDMA3 Controller (SPRU234)
EDMA3 Migration Guide (SPRAAB9)
EDMA Performance (SPRAAG8)
43
44
Chapter Quiz
Chapter Quiz
1. Name the 4 ways to trigger a transfer?
3. Fill out the following values for this channel sorting example (5 min):
PERIPH MEM 16-bit stereo audio (interleaved)
L0 L0 Use EDMA to auto channel sort to memory
R0 L1
L1 L2 ACNT: _____
R1 L3 BCNT: _____
BUFSIZE
L2 R0 CCNT: _____
R2 R1 BIDX: _____
L3 R2 CIDX: _____
R3 R3 Could you calculate these ?
Quiz Answers
Chapter Quiz
1. Name the 4 ways to trigger a transfer?
Manual start, Event sync, chaining and (QDMA trigger word)
3. Fill out the following values for this channel sorting example (5 min):
PERIPH MEM 16-bit stereo audio (interleaved)
L0 L0 Use EDMA to auto channel sort to memory
R0 L1
L1 L2 2
ACNT: _____
R1 L3 2
BCNT: _____
BUFSIZE
L2 R0 CCNT: _____
4
R2 R1 BIDX: _____
8
L3 R2 -6
CIDX: _____
R3 R3 Could you calculate these ? 47
Additional Information
Notes