pg109-xfft - Copie
pg109-xfft - Copie
pg109-xfft - Copie
Transform v9.1
Chapter 1: Overview
Navigating Content by Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Core Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Licensing and Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 5: C Model
Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Unpacking and Model Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
FFT C Model Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
C Model Example Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Compiling with the FFT C Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Appendix A: Upgrading
Migrating to the Vivado Design Suite. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Upgrading in the Vivado Design Suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Appendix B: Debugging
Finding Help on Xilinx.com . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Debug Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Simulation Debug. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
AXI4-Stream Interface Debug . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Features Resources
Performance and Resource Utilization web
page
• Forward and inverse complex FFT, run time Provided with Core
configurable Design Files Encrypted RTL
Overview
• Hardware, IP, and Platform Development: Creating the PL IP blocks for the hardware
platform, creating PL kernels, subsystem functional simulation, and evaluating the
Vivado® timing, resource and power closure. Also involves developing the hardware
platform for system integration. Topics in this document that apply to this design
process include:
° Clocking in Chapter 3
° Resets in Chapter 3
Core Overview
The FFT core computes an N-point forward DFT or inverse DFT (IDFT) where N can be 2m,
m = 3–16.
For fixed-point inputs, the input data is a vector of N complex values represented as dual
b x-bit twos-complement numbers, that is, b x bits for each of the real and imaginary
components of the data sample, where b x is in the range 8 to 34 bits inclusive. Similarly, the
phase factors b w can be 8 to 34 bits wide.
For single-precision floating-point inputs, the input data is a vector of N complex values
represented as dual 32-bit floating-point numbers with the phase factors represented as
24- or 25-bit fixed-point numbers.
All memory is on-chip using either block RAM or distributed RAM. The N element output
vector is represented using by bits for each of the real and imaginary components of the
output data. Input data is presented in natural order and the output data can be in either
natural or bit/digit reversed order. The complex nature of data input and output is intrinsic
to the FFT algorithm, not the implementation.
The point size N, the choice of forward or inverse transform, the scaling schedule and the
cyclic prefix length are run time configurable. Transform type (forward or inverse), scaling
schedule and cyclic prefix length can be changed on a frame-by-frame basis. Changing the
point size resets the core.
Four architecture options are available: Pipelined Streaming I/O, Radix-4 Burst I/O, Radix-2
Burst I/O, and Radix-2 Lite Burst I/O. For detailed information about each architecture, see
Architecture Options.
The FFT is a computationally efficient algorithm for computing a Discrete Fourier Transform
(DFT) of sample sizes that are a positive integer power of 2. The DFT X ( k ), k = 0, , N − 1 of a
sequence x ( n ), n = 0, , N − 1 is defined as
N −1
X (k ) = x(n)e − jnk 2π / N k = 0, , N − 1 Equation 1-1
n=0
where N is the transform size and j = −1 . The inverse DFT (IDFT) is given by
N −1
1
x ( n) =
N
X (k )e
k =0
jnk 2π / N
n = 0, , N − 1 Equation 1-2
Algorithm
The FFT core uses the Radix-4 and Radix-2 decompositions for computing the DFT. For Burst
I/O architectures, the decimation-in-time (DIT) method is used, while the
decimation-in-frequency (DIF) method is used for the Pipelined Streaming I/O architecture.
When using Radix-4 decomposition, the N-point FFT consists of log4 (N) stages, with each
stage containing N/4 Radix-4 butterflies. Point sizes that are not a power of 4 need an extra
Radix-2 stage for combining data. An N-point FFT using Radix-2 decomposition has log 2 (N)
stages, with each stage containing N/2 Radix-2 butterflies.
The inverse FFT (IFFT) is computed by conjugating the phase factors of the corresponding
forward FFT. The FFT core does not implement the 1/N scaling for inverse FFT. The scaling is
therefore as per forward FFT, simply with conjugated phase factors (twiddle factors).
Information about this and other Xilinx LogiCORE IP modules is available at the Xilinx
Intellectual Property page. For information about pricing and availability of other Xilinx
LogiCORE IP modules and tools, contact your local Xilinx sales representative.
Product Specification
Resource Utilization
For details about resource utilization, visit Performance and Resource Utilization.
Port Descriptions
This section describes the core ports as shown in Figure 2-1 and described in Table 2-1.
X-Ref Target - Figure 2-1
s_axis_config_tvalid m_axis_data_tvalid
s_axis_config_tready m_axis_data_tready
s_axis_config_tdata m_axis_data_tdata
m_axis_data_tuser
m_axis_data_tlast
s_axis_data_tvalid
s_axis_data_tready
s_axis_data_tdata m_axis_status_tvalid
s_axis_data_tlast m_axis_status_tready
m_axis_status_tdata
aclk
aresetn event_frame_started
aclken event_tlast_unexpected
event_tlast_missing
event_fft_overflow
event_data_in_channel_halt
event_data_out_channel_halt
event_status_channel_halt
DS808_01_080910
Notes:
1. All AXI4-Stream port names are lowercase, but for ease of visualization, uppercase is used in this document when referring
to port name suffixes, such as TDATA or TLAST.
Clocking
The core uses a single clock, called aclk. All input and output interfaces and internal state
are subject to this single clock.
Resets
aresetn (Synchronous Clear)
If the aresetn pin is present on the core, driving the pin Low causes all output pins,
internal counters, and state variables to be reset to their initial values. The initial values
described in Table 3-1 are also the default values that the circuit adopts on power-on,
regardless of whether the core is configured for aresetn or not. All pending load
processes, transform calculations, and unload processes stop and are re-initialized. NFFT is
set to the largest FFT point size permitted (the Transform Length value set in the Vivado®
Integrated Design Environment (IDE)). The scaling schedule is set to 1/N. For the Radix-4
Burst I/O and Pipelined Streaming I/O architectures with a non-power-of-four point size,
the last stage has a scaling of 1, and the rest have a scaling of 2. See Table 3-1.
The aresetn pin takes priority over aclken. If aresetn is asserted, reset occurs
regardless of the value of aclken. A minimum aresetn active pulse of two cycles is
required, because the signal is internally registered for performance. A pulse of one cycle
resets the core, but the response to the pulse is not in the cycle immediately following.
Event Signals
The core provides some real-time non-AXI signals to report information about the core
status. These event signals are updated on a clock cycle by clock cycle basis, and are
intended for use by reactive components such as interrupt controllers. These signals are not
optionally configurable from the IDE, but are removed by synthesis tools if left
unconnected.
event_frame_started
This event signal is asserted for a single clock cycle when the core starts to process a new
frame. This signal is provided to allow you to count frames and to synchronize the
configuration of the core to a particular frame if required.
event_tlast_missing
This event signal is asserted for a single clock cycle when s_axis_data_tlast is Low on
a last incoming data sample of a frame. This shows a configuration mismatch between the
core and the upstream data source with regard to the frame size, and indicates that the
upstream data source is configured to a larger point size than the core.
This is only calculated when the core starts processing a frame, so the event can lag the
missing s_axis_data_tlast by a large number of clock cycles.
event_tlast_unexpected
This event signal is asserted for a single clock cycle when the core sees
s_axis_data_tlast High on any incoming data sample that is not the last one in a
frame. This shows a configuration mismatch between the core and the upstream data
source with regard to the frame size, and indicates that the upstream data source is
configured to a smaller point size than the core. This is only calculated when the core starts
processing a frame, so the event can lag the unexpected High on s_axis_data_tlast by
a large number of clock cycles.
If there are multiple unexpected highs on s_axis_data_tlast for a frame, then this is
asserted for each of them.
event_fft_overflow
This event signal is asserted on every clock cycle when an overflow is seen in the data
samples being transferred on m_axis_data_tdata.
event_data_in_channel_halt
This event is asserted on every cycle where the core needs data from the Data Input channel
and no data is available.
• In Realtime Mode the core continues processing the frame even though it is
unrecoverably corrupted.
• In Non-Realtime Mode, core processing halts and only continues when data is written
to the Data Input channel. The frame is not corrupted.
In both modes the event remains asserted until data is available in the Data Input channel.
event_data_out_channel_halt
This event is asserted on every cycle where the core needs to write data to the Data Output
channel but cannot because the buffers in the channel are full. When this occurs, the core
processing is halted and all activity stops until space is available in the channel buffers. The
frame is not corrupted.
event_status_channel_halt
This event is asserted on every cycle where the core needs to write data to the Status
channel but cannot because the buffers on the channel are full. When this occurs, the core
processing is halted, and all activity stops until space is available in the channel buffers. The
frame is not corrupted. The event pin is only available in Non-Realtime mode.
AXI4-Stream Considerations
The conversion to AXI4-Stream interfaces brings standardization and enhances interoperability
of Xilinx® IP LogiCORE solutions. Other than general control signals such as aclk, aclken
and aresetn, and event signals, all inputs and outputs to the core are conveyed on
AXI4-Stream channels. A channel always consists of TVALID and TDATA plus additional ports
(such as TREADY, TUSER and TLAST) when required and optional fields. Together, TVALID
and TREADY perform a handshake to transfer a message, where the payload is TDATA,
TUSER and TLAST. The core operates on the operands contained in the TDATA fields and
outputs the result in the TDATA field of the output channel.
For further details on AXI4-Stream Interfaces see the AMBA® AXI4-Stream Protocol
Specification (ARM IHI 0051A) [Ref 1] and the Xilinx Vivado AXI Reference Guide (UG1037)
[Ref 2].
Basic Handshake
Figure 3-1 shows the transfer of data in an AXI4-Stream channel. TVALID is driven by the
source (master) side of the channel and TREADY is driven by the receiver (slave). TVALID
indicates that the value in the payload fields (TDATA, TUSER and TLAST) is valid. TREADY
indicates that the slave is ready to receive data. When both TVALID and TREADY are TRUE in
a cycle, a transfer occurs. The master and slave set TVALID and TREADY respectively for the
next transfer appropriately.
X-Ref Target - Figure 3-1
ACLK
TVALID
TREADY
TDATA D1 D2 D3 D4
TLAST L1 L2 L3 L4
TUSER U1 U2 U3 U4
• All TDATA and TUSER fields are packed in little endian format. That is, bit 0 of a
sub-field is aligned to the same side as bit 0 of TDATA or TUSER.
• Fields are not included in TDATA or TUSER unless the core is configured in such a way
that it needs the fields to be present. For example, if the core is configured to have a
fixed-point size, no bits are allocated to the NFFT field that specifies the point size.
• All TDATA and TUSER vectors are multiples of 8 bits. When all fields in a TDATA or
TUSER vector have been concatenated, the overall vector is padded to bring it up to an
8-bit boundary.
Configuration Channel
Table 3-2 shows the Configuration channel pinout.
TDATA Fields
The Configuration channel (s_axis_config) is an AXI channel that carries the fields in
Table 3-3 in its TDATA vector.
All fields with padding should be extended to the next 8-bit boundary if they do not
already finish on an 8-bit boundary. The core ignores the value of the padding bits, so they
can be driven to any value. Connecting them to constant values might help reduce device
resource usage.
TDATA Format
The configuration fields are packed into the s_axis_config_tdata vector in the
following order (starting from the LSB):
s_axis_config_tdata[MSB downto 0]
DS808_02_080410
TDATA Example
A core has a configurable transform size with a maximum size of 128 points, cyclic prefix
insertion and 3 FFT channels. The core needs to be configured to do an 8 point transform,
with an inverse transform performed on channels 0 and 1, and a forward transform
performed on channel 2. A 4 point cyclic prefix is required. The fields take on the values in
Table 3-4.
This gives a vector length of 19 bits. As all AXI channels must be aligned to byte boundaries,
5 padding bits are required, giving an s_axis_config_tdata length of 24 bits.
X-Ref Target - Figure 3-3
s_axis_config_tdata[23 downto 0]
DS808_03_080410
Pinout
Table 3-5: Data Input Channel Pinout
Port Name Port Width I/O Description
Variable.
s_axis_data_tdata See the Vivado IDE when I Carries the sample data: XN_RE and XN_IM
configuring the core.
Asserted by the upstream master to signal that
s_axis_data_tvalid 1 I
it is able to provide data
Asserted by the upstream master on the last
sample of the frame. This is not used by the core
s_axis_data_tlast 1 I except to generate the events:
event_tlast_unexpected
event_tlast_missing events
Used by the core to signal that it is ready to
s_axis_data_tready 1 O
accept data
TDATA Fields
The Data Input channel (s_axis_data) is an AXI channel that carries the fields in Table 3-6
in its TDATA vector.
All fields with padding should be extended to the next 8-bit boundary if they do not already
finish on an 8-bit boundary. The core ignores the value of the padding bits, so they can be
driven to any value. Connecting them to constant values can help reduce device resource
usage.
These fields are then repeated for each channel in the design.
TDATA Format
The data fields are packed into the s_axis_data_tdata vector in the following order
(starting from the LSB):
s_axis_data_tdata[MSB downto 0]
Fields for remaining channels continue here if required Only fields for channel 0
are mandatory
DS808_04_080410
TDATA Example
The core has been configured to have two FFT data channels with 12-bit data. Channel 0 has
the following sample value:
s_axis_data_tdata[63 downto 0]
0000 0000 0000 0000 0000 0111 0000 0000 0000 0011 1110 0110 0000 0010 1101 1001
Channel 1 Channel 0
DS808_05_080410
1. XK_INDEX
Pinout
.
TDATA Fields
The Data Output channel (m_axis_data) is an AXI channel that carries the fields in
Table 3-9 in its TDATA vector.
All fields are sign extended to the next 8-bit boundary if they do not already finish on an
8-bit boundary.
These fields are then repeated for each FFT channel in the design.
TDATA Format
The data fields are packed into the s_axis_data_tdata vector in the following order
(starting from the LSB):
m_axis_data_tdata[MSB downto 0]
Fields for remaining channels continue here if required Only fields for channel 0
are mandatory
DS808_06_080410
TDATA Example
The core has been configured to have two FFT data channels with 12-bit output data. The
FFT produces the following sample result for channel 0:
m_axis_data_tdata[63 downto 0]
1111 1000 0000 0000 0000 0111 0000 0000 1111 1011 1110 0110 0000 0010 1101 1001
Channel 1 Channel 0
DS808_07_080410
TUSER Fields
The Data Output channel carries the fields in Table 3-11 in its TUSER vector.
All fields with padding should be 0 extended to the next 8-bit boundary if they do not
already finish on an 8-bit boundary.
TUSER Format
The data fields are packed into the m_axis_data_tuser vector in the following order
(starting from the LSB):
Note that the core cannot be configured to have both BLK_EXP and OVFLO.
X-Ref Target - Figure 3-8
m_axis_data_tuser[MSB downto 0]
DS808_08_080410
Optional fields are shown as dotted. As all fields are optional, it is possible to configure the
core such that TUSER would have no fields. In this case it is automatically removed from the
core interface.
TUSER Examples
Example 1
The core has been configured to have two FFT data channels, a 128 point transform size,
overflow, and XK_INDEX. The third sample (XK_INDEX = 3) has an overflow on channel 0
but not on channel 1. XK_INDEX is 7 bits long.
This gives a vector length of 10 bits. As all AXI channels must be aligned to byte boundaries,
6 padding bits are required, giving an m_axis_data_tuser length of 16 bits.
X-Ref Target - Figure 3-9
m_axis_data_tuser[15 downto 0]
The core has been configured to have two FFT data channels, block exponent, but no
XK_INDEX. The output sample for channel 0 has a block exponent of 4, and the output
sample for channel 1 has a block exponent of 31.
m_axis_data_tuser[15 downto 0]
BLK_EXP BLK_EXP
Channel 1 Channel 0
DS808_10_080410
Status Channel
The Status channel contains per-frame status information, that is, information that relates
to an entire frame of data. This is intended for downstream slaves that do not operate on
the data directly but might need to know the information to control another part of the
system. The exact position in the frame where the status is sent depends on the nature of
the status information. The following information is classed as per-frame status:
Note that the core cannot be configured to have both BLK_EXP and OVFLO.
BLK_EXP status information is sent at the start of the frame and OVFLO status information
is sent at the end of the frame.
Pinout
TDATA Fields
The Status channel carries the fields in Table 3-15 in its TDATA vector.
All fields with padding should be 0 extended to the next 8-bit boundary if they do not
already finish on an 8-bit boundary.
TDATA Format
The data fields are packed into the m_axis_status_tdata vector in the following order
(starting from the LSB):
1. (optional) BLK_EXP plus padding for channel 0
2. (optional) BLK_EXP plus padding for channel 1 etc.
3. (optional) OVFLO for channel 0
4. (optional) OVFLO for channel 1 etc.
5. Padding to make TDATA 8-bit aligned. Only needed when OVFLO is present
Note that the core cannot be configured to have both BLK_EXP and OVFLO.
m_axis_status_tdata[MSB downto 0]
DS808_11_080410
TDATA Example
Example 1
The core has been configured to have four FFT data channels and overflow. The current
frame contains an overflow in channels 2 and 3.
.
This gives a vector length of 4 bits. As all AXI channels must be aligned to byte boundaries,
4 padding bits are required, giving an m_axis_status_tdata length of 8 bits.
X-Ref Target - Figure 3-12
m_axis_status_tdata[7 downto 0]
0000 1 1 0 0
Example 2
The core has been configured to have one FFT data channel and overflow. The current frame
contains no overflow.
This gives a vector length of 1 bit. As all AXI channels must be aligned to byte boundaries,
7 padding bits are required, giving an m_axis_status_tdata length of 8 bits.
X-Ref Target - Figure 3-13
m_axis_status_tdata[7 downto 0]
000 0000 0
Theory of Operation
Finite Word Length Considerations
The Burst I/O architectures process an array of data by successive passes over the input data
array. On each pass, the algorithm performs Radix-4 or Radix-2 butterflies, where each
butterfly picks up four or two complex numbers, respectively, and returns four or two complex
numbers to the same memory. The numbers returned to memory by the core are potentially
larger than the numbers picked up from memory. A strategy must be employed to
accommodate this dynamic range expansion. A full explanation of scaling strategies and
their implications is beyond the scope of this document; for more information about this
topic; see A Simple Fixed-Point Error Bound for the Fast Fourier Transform [Ref 3] and Theory
and Application of Digital Signal Processing [Ref 4].
For a Radix-4 DIT FFT, the values computed in a butterfly stage can experience growth by a
factor of up to 1 + 3 2 ≈ 5.242 . This implies a bit growth of up to 3 bits.
For Radix-2, the growth is by a factor of up to 1 + 2 ≈ 2.414 . This implies a bit growth of up
to 2 bits. This bit growth can be handled in three ways:
• Performing the calculations with no scaling and carrying all significant integer bits to
the end of the computation
• Scaling at each stage using a fixed-scaling schedule
• Scaling automatically using block floating-point
All significant integer bits are retained when using full-precision unscaled arithmetic. The
width of the datapath increases to accommodate the bit growth through the butterfly. The
growth of the fractional bits created from the multiplication are truncated (or rounded)
after the multiplication. The width of the output is (input width + log 2(transform length) + 1).
This accommodates the worst case scenario for bit growth.
Consider an unscaled Radix-2 DIT FFT: the datapath in each stage must grow by 1 bit as the
adder and subtracter in the butterfly might add/subtract two full-scale values and produce
a sample which has grown in width by 1 bit. This yields the log2(transform length) part of
the increase in the output width relative to the input width. The complex multiplier
preserves the magnitude of an input (as it applies a rotation on the complex plane), but can
theoretically produce bit-growth when the magnitude of the input is greater than 1 (for
example, 1+j has a magnitude of 1.414). This means that the complex multiplier bit growth
must only be considered once in the entire FFT process, yielding the additional +1 increase
in the output width relative to the input width. For example, a 1024-point transform with an
input of 16 bits consisting of 1 integer bit and 15 fractional bits has an output of 27 bits with
12 integer bits and 15 fractional bits. Note that the core does not have a specific location
for the binary point. The output maintains the same binary point location as the input. For
the preceding example, a 16-bit input with 3 integer bits and 13 fractional bits would have
an unscaled output of 27 bits with 14 integer bits and 13 fractional bits.
log ( N – 1 )
bi
s = 2 i=0
Equation 3-1
The scaling results in the final output sequence being modified by the factor 1/s. For the
forward FFT, the output sequence X’ (k), k = 0,...,N - 1 computed by the core is defined as
1 1 N −1
X ' (k ) = X (k ) = x( n)e − jnk 2π / N k = 0, , N − 1
s s n=0 Equation 3-2
1 N −1
x (n) = X (k )e jnk 2π / N n = 0, , N − 1
s k =0 Equation 3-3
If a Radix-4 algorithm scales by a factor of 4 in each stage, the factor of 1/s is equal to the
factor of 1/N in the inverse FFT equation (Equation 1-2). For Radix-2, scaling by a factor of
2 in each stage provides the factor of 1/N.
With block floating-point, each stage applies sufficient scaling to keep numbers in range,
and the scaling is tracked by a block exponent.
As with unscaled arithmetic, for scaled and block floating-point arithmetic, the core does
not have a specific location for the binary point. The location of the binary point in the
output data is inherited from the input data and then shifted by the scaling applied.
Floating-Point Considerations
The FFT core optionally accepts data in IEEE-754 single-precision format with 32-bit words
consisting of a 1-bit sign, 8-bit exponent, and 23-bit fraction. The construction of the word
matches that of the Xilinx Floating-Point Operator core.
When comparing results against third party models, for example, MATLAB, it should be
noted that a scaling factor is usually required to ensure that the results match. The scaling
factor is data-dependent because the input data dictates the level of normalization required
prior to the internal fixed-point core. Because the core does not provide this scaling factor
in floating-point mode, you can apply scaling after the output of the core, if necessary.
RECOMMENDED: Xilinx recommends using the FFT C model and MEX function when
evaluating floating-point datasets.
All optimization options (memory types and DSP slice optimization) remain available when
floating-point input data is selected, allowing you to trade off resources with transform
time.
Transform time for Burst I/O architectures is increased by approximately N, the number of
points in the transform, due to the input normalization requirements. For the Pipelined
Streaming I/O architecture, the initial latency to fill the pipeline is increased, but data still
streams through the core with no gaps.
Denormalized Numbers
The floating-point interface to the FFT core does not support denormalized numbers. To
match the behavior of the Xilinx Floating-Point Operator core, the core treats denormalized
operands as zero, with a sign taken from the denormalized number.
Due to the finite wordlength effects described previously, noise is introduced during the
transform, resulting in the output data not being perfectly symmetric. The DIT and DIF FFT
algorithms have different noise effects due to the different calculation order.
For a thorough treatment of this topic, see Limited Dynamic Range of Spectrum Analysis Due
To Round off Errors Of The FFT [Ref 5] and Influence of Digital Signal Processing on Precision
of Power Quality Parameters Measurement [Ref 6].
The asymmetry between the two halves of the result is more noticeable at larger point sizes.
In addition, the noise is more prominent in the lower frequency bins. Therefore, Xilinx
recommends that the upper half (N/2+1 to N points) of the output data is used when
performing a real-valued FFT.
Rounding Implementation
An option is available, in all architectures, to apply convergent rounding to the data after
the butterfly stage. However, selecting this option does not apply convergent rounding to
all points in the datapath where wordlength reduction occurs.
In particular, the outputs of all complex multipliers in the FFT datapath are truncated to
reduce datapath width (while still maintaining adequate precision) and a simple rounding
constant added to the fractional bits. This constant implements non-symmetric,
round-towards-minus-infinity rounding, and can introduce a small bias to the FFT results
over a large number of samples.
This slot noise input data frame is fed to the FFT core to see how shallow the slot becomes
due to the finite precision arithmetic. The depth of the slot shows the dynamic range of the
FFT.
Figure 3-15 through Figure 3-24 show the effect of input data width on the dynamic range.
All FFTs have the same bit width for both data and phase factors. Block floating-point
arithmetic is used with rounding after the butterfly. The figures show the input data slot and
the output data slot for bit widths of 24, 20, 16, 12, and 8.
X-Ref Target - Figure 3-15
10
0
-10
-20
-30
-40
-50
-60
dB
-70
-80
-90
-100
-110
-120
-130
-140
100 200 300 400 500 600 700 800 900 1000
FFT BinNumber
10
0
-10
-20
-30
-40
-50
-60
dB
-70
-80
-90
-100
-110
-120
-130
-140
100 200 300 400 500 600 700 800 900 1000
FFT Bin Number
10
0
-10
-20
-30
-40
-50
-60
dB
-70
-80
-90
-100
-110
-120
-130
-140
100 200 300 400 500 600 700 800 900 1000
FFT BinNumber
10
0
-10
-20
-30
-40
-50
-60
dB
-70
-80
-90
-100
-110
-120
-130
-140
100 200 300 400 500 600 700 800 900 1000
FFT Bin Number
10
0
-10
-20
-30
-40
-50
-60
dB
-70
-80
-90
-100
-110
-120
-130
-140
100 200 300 400 500 600 700 800 900 1000
FFT BinNumber
10
0
-10
-20
-30
-40
-50
-60
dB
-70
-80
-90
-100
-110
-120
-130
-140
100 200 300 400 500 600 700 800 900 1000
FFT Bin Number
10
0
-10
-20
-30
-40
-50
-60
dB
-70
-80
-90
-100
-110
-120
-130
-140
100 200 300 400 500 600 700 800 900 1000
FFT BinNumber
10
0
-10
-20
-30
-40
-50
-60
dB
-70
-80
-90
-100
-110
-120
-130
-140
100 200 300 400 500 600 700 800 900 1000
FFT Bin Number
10
0
-10
-20
-30
-40
-50
-60
dB
-70
-80
-90
-100
-110
-120
-130
-140
100 200 300 400 500 600 700 800 900 1000
FFT BinNumber
10
0
-10
-20
-30
-40
-50
-60
dB
-70
-80
-90
-100
-110
-120
-130
-140
100 200 300 400 500 600 700 800 900 1000
FFT Bin Number
Figure 3-25, Figure 3-26, and Figure 3-27 display the results of using unscaled, scaled
(scaling of 1/1024), and block floating-point. All three FFTs are 1024 point, Radix-4 Burst
I/O transforms with 16-bit input, 16-bit phase factors, and convergent rounding.
X-Ref Target - Figure 3-25
10
0
-10
-20
-30
-40
-50
-60
dB
-70
-80
-90
-100
-110
-120
-130
-140
100 200 300 400 500 600 700 800 900 1000
FFT Bin Number
10
0
-10
-20
-30
-40
-50
-60
dB
-70
-80
-90
-100
-110
-120
-130
-140
100 200 300 400 500 600 700 800 900 1000
FFT Bin Number
10
0
-10
-20
-30
-40
-50
-60
dB
-70
-80
-90
-100
-110
-120
-130
-140
100 200 300 400 500 600 700 800 900 1000
FFT Bin Number
After the butterfly computation, the LSBs of the datapath can be truncated or rounded. The
effects of these options are shown in Figure 3-28 and Figure 3-29. Both transforms are 1024
points with 16-bit data and phase factors using block floating-point arithmetic.
X-Ref Target - Figure 3-28
10
0
-10
-20
-30
-40
-50
-60
dB
-70
-80
-90
-100
-110
-120
-130
-140
100 200 300 400 500 600 700 800 900 1000
FFT Bin Number
10
0
-10
-20
-30
-40
-50
-60
dB
-70
-80
-90
-100
-110
-120
-130
-140
100 200 300 400 500 600 700 800 900 1000
FFT Bin Number
For illustration purposes, the effect of point size on dynamic range is displayed Figure 3-30
through Figure 3-32. The FFTs in these figures use 16-bit input and phase factors along with
convergent rounding and block floating-point arithmetic.
X-Ref Target - Figure 3-30
10
0
-10
-20
-30
-40
-50
-60
dB
-70
-80
-90
-100
-110
-120
-130
-140
10 20 30 40 50 60
FFT Bin Number
10
0
-10
-20
-30
-40
-50
-60
dB
-70
-80
-90
-100
-110
-120
-130
-140
200 400 600 800 1000 1200 1400 1600 1800 2000
FFT Bin Number
10
0
-10
-20
-30
-40
-50
-60
dB
-70
-80
-90
-100
-110
-120
-130
-140
All of the preceding dynamic range plots show the results for the Radix-4 Burst I/O
architecture. Figure 3-33 and Figure 3-34 show two plots for the Radix-2 Burst I/O
architecture. Both use 16-bit input and phase factors along with convergent rounding and
block floating-point.
X-Ref Target - Figure 3-33
10
0
-10
-20
-30
-40
-50
dB -60
-70
-80
-90
-100
-110
-120
-130
-140
10 20 30 40 50 60
FFT Bin Number
10
0
-10
-20
-30
-40
-50
-60
dB
-70
-80
-90
-100
-110
-120
-130
-140
100 200 300 400 500 600 700 800 900 1000
FFT Bin Number
Architecture Options
The FFT core provides four architecture options to offer a trade-off between core size and
transform time.
• Radix-2 Lite Burst I/O – Based on the Radix-2 architecture, this variant uses a
time-multiplexed approach to the butterfly for an even smaller core, at the cost of
longer transform time.
Figure 3-35 illustrates the trade-off of throughput versus resource use for the four
architectures. As a rule of thumb, each architecture offers a factor of 2 difference in resource
from the next architecture. The example is for an even power of 2 point size. This does not
require the Radix-4 architecture to have an additional Radix-2 stage.
All four architectures can be configured to use a fixed-point interface with one of three
fixed-point arithmetic methods (unscaled, scaled or block floating-point) or might instead
use a floating-point interface.
X-Ref Target - Figure 3-35
In the Radix-2 Burst I/O, Radix-2 Lite Burst I/O, and Pipelined Streaming I/O architectures,
the Bit Reverse order is simple to calculate by taking the index of the data point, written in
binary, and reversing the order of the digits. Hence, 0000, 0001, 0010, 0011, 0100,...(0, 1, 2,
3, 4,...) becomes 0000, 1000, 0100, 1100, 0010,...(0, 8, 4, 12, 2,...).
In the case of the Radix-4 Burst I/O architecture, the reversal applies to digits and, therefore,
is called Digit Reversal. A digit in Radix-4 is two bits. Hence, 0000, 0001, 0010, 0011,
0100,...(0, 1, 2, 3, 4,...) becomes 0000, 0100, 1000, 1100, 0001,...(0, 4, 8, 12, 1,...), as the pairs
of digits are reversed. Where the transform size requires an odd number of index bits, the
odd digit in the least significant place is moved to the most significant place, so 00000,
00001, 00010, 00011, 00100,... (0, 1, 2, 3, 4,...) becomes 00000, 10000, 00100, 10100,
01000,...(0, 16, 4, 20, 8,...)
Note: The core can optionally output a data point index along with the data. See XK Index for more
information.
IMPORTANT: Continually streaming data does not imply that AXI4-Stream waitstates from the FFT
core can be ignored. There are situations where the FFT core might have to insert waitstates to pause
the incoming sample data.
In the scaled fixed-point mode, the data is scaled after every pair of Radix-2 stages. The
block floating-point mode might use significantly more resources than the scaled mode, as
it must maintain extra bits of precision to allow dynamic scaling without impacting
performance. Therefore, if the input data is well understood and is unlikely to exhibit large
amplitude fluctuation, using scaled arithmetic (with a suitable scaling schedule to avoid
overflow in the known worst case) is sufficient, and resources might be saved.
The input data is presented in natural order. The unloaded output data can either be in bit
reversed order or in natural order. When natural order output data is selected, additional
memory resource is utilized.
This architecture covers point sizes from 8 to 65536. You have the flexibility to select the
number of stages to use block RAM for data and phase factor storage. The remaining stages
use distributed memory.
Group 0 Group 1
Memory Memory
This architecture has lower resource usage than the Pipelined Streaming I/O architecture,
but a longer transform time, and supports point sizes from 64 to 65536. Data and phase
factors can be stored in block RAM or in distributed RAM (the latter for point sizes less than
or equal to 1024).
ROM for
Twiddles
Input Data
Data
RAM 0 RADIX-4
DRAGONFLY
Data
RAM 1
switch
switch
Data -
RAM 2
-
Data - -j -
RAM 3
Output Data
ROM for
Twiddles
Input Data
Data RADIX-2
RAM 0 BUTTERFLY
switch
switch
Data -
RAM 1
Output Data
Input Data
Data RADIX-2
DPM 0 BUTTERFLY
Data -
DPM 1
Generate one
Multiply real one cycle,
Output Data output each cycle
imaginary the next
ds260_05_102306
Transform Size
The transform point size can be set through the NFFT field in the Configuration channel if
the run time configurable transform length option is selected. Valid settings and the
corresponding transform sizes are provided in Table 3-18. If the NFFT value entered is too
large, the core sets itself to the largest available point size (selected in the IDE). If the value
is too small, the core sets itself to the smallest available point size: 64 for the Radix-4 Burst
I/O architecture and 8 for the other architectures.
A scaling schedule is not required (SCALE_SCH is ignored) when the FFT core is configured to
process floating-point data. Normalization and scaling are handled internally for
floating-point data.
The scaling performed during successive stages can be set using the appropriate
SCALE_SCH field in the Configuration channel. For the Radix-4, Burst I/O and Radix-2
architectures, the value of the SCALE_SCH field is used as pairs of bits [... N4, N3, N2, N1,
N0], each pair representing the scaling value for the corresponding stage. Stages are
computed starting with stage 0 as the two LSBs. There are log4(point size) stages for
Radix-4 and log 2(point size) stages for Radix-2. In each stage, the data can be shifted by 0,
1, 2, or 3 bits, which corresponds to SCALE_SCH values of 00, 01, 10, and 11. For example,
for Radix-4, when N = 1024, [01 10 00 11 10] translates to a right shift by 2 for stage 0, shift
by 3 for stage 1, no shift for stage 2, a shift of 2 for stage 3, and a shift of 1 for stage 4 (there
are log4(1024) = 5 Radix-4 stages). This scaling schedule scales by a total of 8 bits which
gives a scaling factor of 1/256. The conservative schedule SCALE_SCH = [10 10 10 10 11]
completely avoids overflows in the Radix-4, Burst I/O architecture. For the Radix-2, Burst I/
O and Radix-2 Lite, Burst I/O architectures, the conservative scaling schedule of [01 01 01 01
01 01 01 01 01 10] prevents overflow for N = 1024 (there are log 2(1024) = 10 Radix-2
stages).
For the Pipelined Streaming I/O architecture, consider every pair of adjacent Radix-2 stages
as a group. That is, group 0 contains stage 0 and 1, group 1 contains stage 2 and 3, and so
on. The value of the SCALE_SCH field is also used as pairs of bits [... N4, N3, N2, N1, N0].
Each pair represents the scaling value for the corresponding group of two stages. Groups
are computed starting with group 0 as the two LSBs. In each group, the data can be shifted
by 0, 1, 2, or 3 bits which corresponds to SCALE_SCH values of 00, 01, 10, and 11. For
example, when N = 1024, [10 10 00 01 11] translates to a right shift by 3 for group 0 (stages
0 and 1), shift by 1 for group 1 (stages 2 and 3), no shift for group 3 (stages 4 and 5), a shift
of 2 in group 3 (stages 6 and 7), and a shift of 2 for group 4 (stages 8 and 9). The
conservative schedule SCALE_SCH = [10 10 10 10 11] completely avoids overflows in the
Pipelined Streaming I/O architecture. When the point size is not a power of 4, the last group
only contains one stage, and the maximum bit growth for the last group is one bit.
Therefore, the two MSBs of the scaling schedule can only be 00 or 01. A conservative scaling
schedule for N = 512 is SCALE_SCH = [01 10 10 10 11].
The initial value and reset value of the FWD_INV field is forward = 1. The scaling schedule
is set to 1/N. That translates to [10 10 10 10... 10] for the Radix-4, Burst I/O and Pipelined
Streaming I/O architectures, and [01 01... 01] for the Radix-2 architectures. The core uses
the (2*number of stages) LSBs for the scaling schedule. So, when the point size decreases,
the leftover MSBs are ignored. However, all bits are programmed into the core and are used
in later transforms if the point size increases.
the end of the output data) followed by the complete output data, all in natural order. Cyclic
prefix insertion is only available when output ordering is Natural Order.
When cyclic prefix insertion is used, the length of the cyclic prefix can be set
frame-by-frame without interrupting frame processing. The cyclic prefix length can be any
number of samples from zero to one less than the point size. The cyclic prefix length is set
by the CP_LEN field in the Configuration channel. For example, when N = 1024, the cyclic
prefix length can be from 0 to 1023 samples, and a CP_LEN value of 0010010110 produces
a cyclic prefix consisting of the last 150 samples of the output data.
The initial value and reset value of CP_LEN is 0 (no cyclic prefix). The core uses the
log2(point size) MSBs of CP_LEN for the cyclic prefix length. So, when the point size
decreases, the leftover LSBs are ignored. This effectively scales the cyclic prefix length with
the point size, keeping them in approximately constant proportion. However, all bits of
CP_LEN are programmed into the core and are used in later transforms if the point size
increases.
Transform Status
Overflow
Fixed-Point Data
The Overflow (OVFLO) field in the Data Output and Status channels is only available when
the Scaled arithmetic is used. OVFLO is driven High during unloading if any point in the data
frame overflowed. For a multichannel core, there is a separate OVFLO field for each channel.
When an overflow occurs in the core, the data is wrapped rather than saturated, resulting in
the transformed data becoming unusable for most applications.
Floating-Point Data
The Overflow field is used to indicate an exponent overflow when the FFT is processing
floating-point data. The output sample which overflowed is set to ± ∞ , depending on the
sign of the internal result. The Overflow field is not asserted when a NaN value is present on
the output. NaN values can only occur at the FFT output when the input data frame contains
NaN or ± ∞ samples.
Block Exponent
The Block Exponent (BLK_EXP) field in the Data Output and the Status channels (used only
with the block floating-point option) contains the block exponent. For a multichannel core,
there is a separate BLK_EXP field for each channel. The value present in the field represents
the total number of bits the data was scaled during the transform. For example, if BLK_EXP
has a value of 00101 = 5, this means the associated output data (XK_RE, XK_IM) was scaled
by 5 bits (shifted right by 5 bits), or in other words, was divided by 32, to fully use the
available dynamic range of the output datapath without overflowing. Because block scaling
is performed based on the maximum value at each stage of processing, the BLK_EXP value
may differ from one architecture to another, even with identical input data, due to the
different inherent scaling performed per stage of processing in each architecture.
XK Index
The XK_INDEX field (if present in the Data Output channel) gives the sample number of the
XK_RE/XK_IM data being presented at the same time. In the case of natural order outputs,
XK_INDEX increments from 0 to (point size) -1. When bit reversed outputs are used,
XK_INDEX covers the same range of numbers, but in a bit (or digit) reversed manner.
For example, when you have an 8 point FFT, XK_INDEX takes on the values in Table 3-19.
If cyclic prefix insertion is used, the cyclic prefix is unloaded first and XK_INDEX counts
from (point_size) - (cyclic prefix length) up to (point size) -1. After the cyclic prefix has been
unloaded, or if the cyclic prefix length is zero, the whole frame of output data is unloaded.
XK_INDEX counts from 0 up to (point size) -1 as before. Cyclic Prefix Insertion is only
possible with natural order outputs.
TVALID is driven by the Master component to show that it has data to transfer, and TREADY
is driven by the Slave component to show that it is ready to accept data. When both TVALID
and TREADY are High, a transfer takes place. Points A in the diagram show clock cycles
where no data is transferred because neither the Master or the Slave is ready. Point B shows
two clock cycles where data is not transferred because the Master does not have any data
to transfer. This is known as a Master Waitstate. Point C shows a clock cycle where no data
is transferred because the Slave is not ready to accept data. This is known as a Slave
Waitstate. Master and Slave waitstates can extend for any number of clock cycles.
ACLK
TVALID A B A
TREADY C
TDATA D1 D2 D3 D4 D5 D6 D7 D8
To load a frame into the core, the upstream master supplying the XN_RE and XN_IM data
has to send it when it is ready. If the core can accept it (which is when
s_axis_data_tready = 1) then it is buffered by the core until it can be processed. If the
core cannot accept it (which is when s_axis_data_tready = 0), a slave waitstate exists in
the AXI channel and the master is stalled. Figure 3-40 shows the loading of the sample data
for an 8 point FFT. The upstream master drives TVALID and the core drives TREADY. In this
case, both the master and the core insert waitstates.
Unloading a frame works in a similar manner, except that the core is the master in this case.
When it has XK_RE and XK_IM data to unload, it asserts its TVALID signal
(m_axis_data_tvalid = 1). The downstream slave that consumes the processed sample
data can then accept the data (m_axis_data_tready = 1) or not
(m_axis_data_tready = 0). Figure 3-40 also shows the unloading of the sample data for
an 8 point FFT (with no cyclic prefix). The core drives TVALID and the downstream slave
drives TREADY. In this case, both the core and the slave insert waitstates.
The previous description only applies when the core is configured to use Non-Realtime
mode. The situation is different in Realtime mode, which is used to create a smaller and
faster design at the expense of flexibility in loading and unloading data. When the core is
configured to use Realtime mode, the following occurs:
The first two points mean that neither the downstream slave that consumes processed data,
or the downstream slave that consumes status information, can insert waitstates using
TREADY (m_axis_data_tready and m_axis_status_tready, respectively) as the pins
are not present on the core. Both slaves must be able to respond immediately on every
clock cycle where the core is producing data (m_axis_data_tvalid asserted High or
m_axis_status_tvalid asserted High). If the slave cannot respond immediately, then
data is lost.
Figure 3-41 shows the upstream master inserting waitstates while loading an 8 point frame
in Realtime mode. At point A, the master has sent one sample to the Data Input channel.
The core then inserts a waitstate while it waits for the FFT processing core to start the
transform. This is shown as one cycle here, but it could be longer in certain cases. At point
B, the master inserts two waitstates using TVALID. However, the core ignores them and uses
the previous data (D 3) for the missing data. It is likely that the processed frame will be
corrupted.
At point C, the master starts supplying the last samples of the frame (D 7 and later D 8) but
the core has already started processing the frame and inserts a waitstate. The Master and
the core are now out of synchronisation. When the core finishes processing the frame and
is ready for a new frame, it sees D7 as the first symbol of the new frame and starts to
consume another 8 samples.
X-Ref Target - Figure 3-41
ACLK
TVALID B
TREADY A C
TDATA D1 D2 D3 D4 D5 D6 D7
IMPORTANT: It is important that Realtime mode is only selected when the appropriate external
masters and slaves can meet the timing requirements on supplying and consuming data.
Transform Timing
The core starts to process a frame as soon as a) the upstream master asks it to by supplying
data to process, and b) when it is able to. The chosen architecture and cyclic prefix insertion
are the major configuration options that affect when the core is able to process a new
frame.
The following timing diagrams are generalizations of actual behavior used to show the
broad phases the core moves through when processing frames, and how these phases can
(or cannot) overlap. The lengths of the various phases are not to scale, and the processing
time might be much longer than the time required to input or output a frame.
In particular, the behavior of TREADY on the input data channel is not fully accurate because
the Data Input channel buffers the data (16 symbols in Non-Realtime mode and 1 symbol in
Realtime mode). However, this data waits in the buffer until the FFT processing core is ready
for it. The Data Input channel TREADY in these diagrams is used as an indication of when
the FFT processing core wants data rather than when the AXI channel (with its buffer) wants
data.
Figure 3-42 shows the general timing for back-to-back frames in the Pipelined Streaming
architecture.
X-Ref Target - Figure 3-42
s_axis_data_tvalid
s_axis_data_tready
m_axis_data_tvalid
m_axis_data_tready
Figure 3-42: Transform Timing for Entire Frames in Pipelined Streaming I/O with no Cyclic
Prefix Insertion
Note that there is a latency between a frame being loaded and the processed data for that
frame being available. This latency depends on the options chosen in the Vivado IDE to
parameterize the core. However, when that latency has passed, processed frames appear
back-to-back.
s_axis_data_tvalid
s_axis_data_tready
m_axis_data_tvalid
m_axis_data_tready
Figure 3-43: Transform Timing for Entire Frames in Pipelined Streaming I/O with Cyclic Prefix
Insertion
Note: This refers to the FFT processing core. As the Data In channel has a 16 element deep buffer on
its input, it can start to pre-buffer a frame while a frame is still being processed. In the case of 8 and
16 point FFTs, it can pre-buffer entire frames. However, this buffered data waits in the buffer until the
FFT engine has finished dealing with the current frame.
When bit-reversed outputs are used, the core only unloads data when a new frame is
loaded. This means that the loading of frame N+1 overlaps with (and actually causes) the
unloading of frame N. However, if the upstream master does not supply data to the core
when it is ready to start unloading a frame, the core will flush the frame out manually. If this
occurs, the loading and unloading phases do not overlap.
Figure 3-44 shows the general transform timing for a Burst I/O architecture with natural
ordered outputs. This requires distinct load, process and unload phases. The upstream
master is constantly attempting to stream data as is the downstream slave. These examples
do not show the effect of a cyclic prefix, which is to extend the unloading phase.
The Upstream Master loads all of the data for Frame A into the Data Input channel of the
FFT. As the FFT is loading this data to process it, the buffer in the channel never fills.
However, the master immediately starts sending data for Frame B. At point A in the
waveform, the buffer in the Data Input channel fills, because the FFT is processing frame A
and no longer draining the buffer. This can be seen externally as s_axis_data_tready
going Low. The Data Input channel remains in a slave waitstate situation, where the FFT
cannot accept data from the upstream Master, until point B. Now the FFT has unloaded
frame A and started loading Frame B into the processing core. This drains the buffer in the
Data Input channel, which unblocks the Upstream Master and allows it to send the
remaining data for Frame B. The situation then repeats itself with Frame C.
1. Activity on the AXI interface to the Data Input channel does not necessarily correlate to
the activity inside the FFT. For example, just before point A, the channel loads sample
data for frame B yet the FFT is internally processing Frame A.
2. The Upstream Master cannot always stream frame data without reference to
s_axis_data_tready.
3. The FFT unloads a frame before loading the subsequent frame.
Figure 3-45 is similar to Figure 3-44, except that the FFT is configured to have bit reversed
outputs. As the upstream master is always supplying data, the loading and unloading of
frames can overlap.
Figure 3-46 is similar to Figure 3-45, except that the upstream master does not supply data
for Frame B until the core has started flushing out Frame A. As the core has already started
flushing Frame A, it completes this before loading Frame B. The loading and unloading of
frames do not overlap.
In this example, s_axis_data_tready remains High at Point A. Loading Frame A into the
core drained the buffer in the Data Input channel, and because the Upstream Master did not
send any new data, the buffer is empty. The core is ready to accept new frame data at point
A although it is not able to do anything with it at this point. At point B the Upstream Master
starts to send data from Frame B. This fills the buffer in the Data Input channel, but because
the core is committed to flushing Frame A, the buffer fills and the core stalls the Upstream
Master with waitstates. At point C, the core has started loading Frame B to process it, so the
buffer drains and more data can be accepted to finish off Frame B.
The key difference between the situation in Figure 3-45 and Figure 3-46 is that the master
in Figure 3-45 has provided new frame data during the processing phase of the previous
frame. As a result, the core knows there is a new frame coming so when processing finishes,
it starts to load the new frame as this flushes the old frame out. In Figure 3-46, the master
did not provide data (and therefore did not tell the core that there would be a new frame)
during the processing phase, so when the core finishes processing the frame, it moves to a
flushing phase where it is no longer possible to load a new frame. Even if the master
provides a sample for the new frame a cycle after unloading has begun, that sample is not
loaded until the core is finished unloading the old frame.
Waiting on FFT to accept more data Waiting on FFT to accept more data
s_axis_data_tvalid
s_axis_data_tready A B A
FFT Load Frame A Process Frame A Unload Frame A Load Frame B Process Frame B Unload Frame B
m_axis_data_tvalid
m_axis_data_tready
Figure 3-44: Transform Timing for Entire Frames in Burst I/O Mode with Natural Ordered Outputs
X-Ref Target - Figure 3-45
Waiting on FFT to accept more data Waiting on FFT to accept more data
s_axis_data_tvalid
s_axis_data_tready A B A
FFT Load Frame A Process Frame A Unload A ~ Load B Process Frame B Unload B ~ Load C
m_axis_data_tvalid
m_axis_data_tready
Figure 3-45: Transform Timing for Entire Frames in Burst I/O Mode with Bit-Reversed Outputs
s_axis_data_tvalid
s_axis_data_tready A B C
FFT Load Frame A Process Frame A Flush Frame A Load Frame B Process Frame B Flush Frame B
m_axis_data_tvalid
m_axis_data_tready
Figure 3-46: Transform Timing for Entire Frames In Burst I/O Mode with Bit-Reversed Outputs (Core Flushes Frame)
The process of applying configuration data to a particular frame depends on the current
status of the core:
1. To apply a configuration to the very first frame after power on or after an idle period
2. To apply the configuration to the next frame in a sequence of frames
To ensure that the configuration data is applied before the frame is processed, the
configuration information should be written to the Configuration channel where the write
of configuration data to the Configuration channel must complete at least 1 clock cycle
before the write of the first Data Input channel. Failure to do so can result in the frame
being processed with the previous configuration options in use.
Perhaps the easiest way to satisfy this in a system context is to configure the core before
enabling the upstream data master.
This signal is asserted High when the core starts to load data for a frame into the FFT
processing core. This is a known safe point to send configuration information for the next
frame. Configuration data sent after this might or might not be applied to the subsequent
frame, depending on the frame size and the latency between event_frame_started
asserting and the configuration write occurring.
1. A Pipelined Streaming FFT is processing frames and the transform size (NFFT) is
changed.
2. A Burst I/O core with bit reversed outputs is processing a frame, and the master supplies
frame data in time to avoid the core automatically flushing the frame, and the transform
size (NFFT) is changed.
Both the Pipelined Streaming architecture and the Burst I/O architectures (when bit
reversed outputs are used) implement pipelining to achieve better throughput. In the case
of the Pipelined Streaming architecture, it pipelines the loading, processing and unloading
of entire frames (see Figure 3-42). In Burst I/O architectures when bit reversed outputs are
used, the core implements a partial pipeline to overlap the loading on one frame with the
unloading of another (see Figure 3-45).
However, a change to the transform size can only be applied when the pipeline is empty.
Changing the transform size when the pipeline is not empty would result in data loss, so the
core prevents this. When new configuration information is sent to the Configuration
channel, and that information contains a change in transform size, the core does not load
more frames until all frames already in the pipeline are processed and unloaded.
This is all handled automatically by the core, allowing you to send the configuration
information at any time. However, throughput drops until the pipeline is fully flushed. This
behavior only occurs if the transform size is to change. All other configuration options can
be applied without waiting for the core pipeline to empty.
• Vivado Design Suite User Guide: Designing IP Subsystems using IP Integrator (UG994)
[Ref 7]
• Vivado Design Suite User Guide: Designing with IP (UG896) [Ref 8]
• Vivado Design Suite User Guide: Getting Started (UG910) [Ref 9]
• Vivado Design Suite User Guide: Logic Simulation (UG900) [Ref 10]
If you are customizing and generating the core in the Vivado IP integrator, see the Vivado
Design Suite User Guide: Designing IP Subsystems using IP Integrator (UG994) [Ref 7] for
detailed information. IP integrator might auto-compute certain configuration values when
validating or generating the design. To check whether the values do change, see the
description of the parameter in this chapter. To view the parameter value you can run the
validate_bd_design command in the Tcl Console.
You can customize the IP for use in your design by specifying values for the various
parameters associated with the IP core using the following steps:
For details, see the Vivado Design Suite User Guide: Designing with IP (UG896) [Ref 8] and
the Vivado Design Suite User Guide: Getting Started (UG910) [Ref 9].
The Vivado Integrated Design Environment (IDE) provides several FFT core customization
screens with fields to set the parameter values for the particular instantiation required. A
description of each field follows:
• Component Name: The name of the core component to be instantiated. The name
must begin with a letter and be composed of the following characters: a to z, A to Z, 0
to 9, and “_”.
Configuration Tab
• Channels: Select the number of channels from 1 to 12. Multichannel operation is
available for the three Burst I/O architectures.
• Transform Length: Select the desired point size. All powers of two from 8 to 65536 are
available.
• Implementation Options: Select an implementation option, as described in
Architecture Options.
° The Pipelined Streaming I/O, Radix-2 Burst I/O, and Radix-2 Lite Burst I/O
architectures support point sizes 8 to 65536.
° Check Automatically Select to choose the smallest implementation that meets the
specified Target Data Throughput, provided the specified Target Clock Frequency is
achieved when the FFT core is implemented on an FPGA.
° Target Clock Frequency and Target Data Throughput are only used to automatically
select an implementation and to calculate latency. The core is not guaranteed to run
at the specified target clock frequency or target data throughput.
• Transform Length Options: Select the transform length to be run time configurable or
not. The core uses fewer logic resources and has a faster maximum clock speed when
the transform length is not run time configurable.
Implementation Tab
• Data Format: Select whether the input and output data samples are in Fixed-Point
format, or in IEEE-754 single precision (32-bit) Floating-Point format. Floating-Point
format is not available when the core is in a multichannel configuration.
• Precision Options: Input data and phase factors can be independently configured to
widths from 8 to 34 bits, inclusive. When the Data Format is Floating-Point, the input
data width is fixed at 32 bits and the phase factor width can be set to 24 or 25 bits
depending on the noise performance required and available resources.
° Unscaled
- All integer bit growth is carried to the output. This can use more FPGA
resources.
° Scaled
- A user-defined scaling schedule determines how data is scaled between FFT
stages.
° Block Floating-Point
- The core determines how much scaling is necessary to make best use of
available dynamic range, and reports the scaling factor as a block exponent.
• Control Signals: Clock Enable (aclken) and Synchronous Clear (aresetn) are
optional pins. Synchronous Clear overrides Clock Enable if both are selected. If an
option is not selected, some logic resources can be saved and a higher clock frequency
might be attainable.
• Optional Output Fields: XK_INDEX is an optional field in the Data Output Channel.
OVFLO is an optional field in both the Data Output channel and Status Channel.
• Throttle Schemes: Select trade-off between performance and data timing
requirements. Realtime mode typically gives a smaller and faster design, but has strict
constraints on when data must be provided and consumed. Non-Realtime mode has no
such constraints, but the design might be larger and slower. See Controlling the FFT
Core for more details.
• Rounding Modes: At the output of the butterfly, the LSBs in the datapath need to be
trimmed. These bits can be truncated or rounded using convergent rounding, which is
an unbiased rounding scheme. When the fractional part of a number is equal to exactly
one-half, convergent rounding rounds up if the number is odd, and rounds down if the
number is even. Convergent rounding can be used to avoid the DC bias that would
otherwise be introduced by truncation after the butterfly stages. Selecting this option
increases slice usage and yields a small increase in transform time due to additional
latency.
• Output Ordering: Output data selections are either Bit/Digit Reversed Order or Natural
Order. The Radix-2 based architectures (Pipelined Streaming I/O, Radix-2 Burst I/O and
Radix-2 Lite Burst I/O) offer bit-reversed ordering, and the Radix-4 based architecture
(Radix-4 Burst I/O) offers digit-reversed ordering. For the Pipelined Streaming I/O
architecture, selecting natural order output ordering results in an increase in memory
used by the core. For Burst I/O architectures, selecting natural order output increases
the overall transform time because a separate unloading phase is required.
° Cyclic Prefix Insertion can be selected if the output ordering is Natural Order. Cyclic
Prefix Insertion is available for all architectures, and is typically used in OFDM
wireless communications systems.
° Data And Phase Factors (Burst I/O architectures): For Burst I/O architectures,
either block RAM or distributed RAM can be used for data and phase factor storage.
Data and phase factor storage can be in distributed RAM for all point sizes up to
and including 1024 points.
° Data And Phase Factors (Pipelined Streaming I/O): In the Pipelined Streaming I/
O solution, the data can be stored partially in block RAM and partially in distributed
RAM. Each pipeline stage, counting from the input side, uses smaller data and
phase factor memories than preceding stages. You can select the number of
pipeline stages that use block RAM for data and phase factor storage. Later stages
use distributed RAM. The default displayed on the IDE offers a good balance
between both. If output ordering is Natural Order, the memory used for the reorder
buffer can be either block RAM or distributed RAM. The reorder buffer can use
distributed RAM for point sizes less than or equal to 1024.
- When block floating-point is selected for the Pipelined Streaming I/O
architecture, a RAM buffer is required for natural order and bit reversed order
output data. In this case, the reorder buffer options remain available and
distributed RAM can be selected for all point sizes below 2048.
° Hybrid Memories: Where data, phase factor, or reorder buffer memories are stored
in block RAM, if the size of the memory is greater than one block RAM, the memory
can be constructed from a hybrid of block RAMs and distributed RAM, where the
majority of the data is stored in block RAMs and a few bits that are left over are
stored in distributed RAM. This Hybrid Memory is an alternative to constructing the
memory entirely from multiple block RAMs. It provides a reduction in the block
RAM count, at the cost of an increase in the number of slices used. Hybrid
Memories are only available when block RAM is used for one or more memories and
the number of slices required for a Hybrid Memory implementation is below an
internal threshold of 256 LUTs per memory. If these conditions are met, Hybrid
Memories are made available and can be selected.
• Optimize Options:
° Complex Multipliers: Three options are available for customization of the complex
multiplier implementation:
- Use CLB logic: All complex multipliers are constructed using slice logic. This is
appropriate for target applications that have low performance requirements, or
target devices that have few DSP slices.
- Use 3-multiplier structure (resource optimization): All complex multipliers
use a three real multiply, five add/subtract structure, where the multipliers use
DSP slices. This reduces the DSP slice count, but uses some slice logic. This
structure can make use of the DSP slice pre-adder to reduce or remove the need
for extra slice logic, and improve performance.
° Butterfly Arithmetic: Two options are available for customization of the butterfly
implementation:
- Use CLB logic: All butterfly stages are constructed using slice logic.
- Use XtremeDSP Slices: This option forces all butterfly stages to be
implemented using the adder/subtracters in DSP slices.
Information Tabs
• Implementation Details:
° Resource Estimates: Based on the options selected, this field displays the DSP slice
count and 18K block RAM numbers. The resource numbers are just an estimate. For
exact resource usage and slice/LUT-FlipFlop pair information, a
post-implementation utilization report should be consulted.
° AXI4-Stream Port Structure: This section shows how the FFT fields are mapped to
the AXI channels.
• Latency:
° This tab shows the latency of the FFT core in clock cycles and microseconds (μs) for
each point size supported. The latency is from the Upstream Master supplying the
first sample of a frame to the last sample of output data coming out of the core,
assuming that the FFT core was idle and neither the Upstream Master or the
Downstream Slave inserted wait states. This is not the minimum number of cycles
between starting consecutive frames, as frames might overlap in some cases. The
latency in microseconds is based on the target clock frequency.
User Parameters
Table 4-1 shows the relationship between the fields in the Vivado IDE and the User
Parameters (which can be viewed in the Tcl Console).
Notes:
1. Parameter values are listed in the table where the IDE parameter value differs from the user parameter value. Such
values are shown in this table as indented below the associated parameter.
Output Generation
For details, see the Vivado Design Suite User Guide: Designing with IP (UG896) [Ref 8].
Tab 1: Basic
The Basic tab is used to specify the transform configuration and architecture in a similar
way to page 1 of the Vivado IDE.
System Generator supports only single-channel implementation of the FFT and, hence,
Channels is not available as a GUI option.
Tab 2: Advanced
The Advanced tab is used to specify phase factor precision, scaling, rounding, optional
output fields, throttle scheme, and optional port options in a similar way to page 2 of the
Vivado IDE.
System Generator can optionally shorten the AXI4-Stream signal names on the symbol by
removing the m_axis_ or s_axis_ prefixes.
System Generator automatically sets the Input Data Width parameter based on the signal
properties of the XN_RE and XN_IM ports.
Tab 3: Implementation
The Implementation tab is used to specify memory and optimization options in a similar
way to page 3 of the Vivado IDE.
• Number of stages using block RAM: Specifies the number of stages for the Pipelined
Streaming I/O architecture that uses block RAM for data and phase factor storage. As
dynamic list boxes are not offered with the System Generator GUI, this option displays
the full range (0 to 11) selection, but allows you to select only valid values as visible in
the Vivado IDE.
• FPGA Area Estimation: See the System Generator for DSP User Guide (UG640) [Ref 11]
for detailed information about this option.
Required Constraints
This section is not applicable for this IP core.
Clock Frequencies
This section is not applicable for this IP core.
Clock Management
This section is not applicable for this IP core.
Clock Placement
This section is not applicable for this IP core.
Banking
This section is not applicable for this IP core.
Transceiver Placement
This section is not applicable for this IP core.
Simulation
For comprehensive information about Vivado simulation components, as well as
information about using supported third-party tools, see the Vivado Design Suite User
Guide: Logic Simulation (UG900) [Ref 10].
IMPORTANT: For cores targeting 7 series or Zynq-7000 devices, UNIFAST libraries are not supported.
Xilinx IP is tested and qualified with UNISIM libraries only.
C Model
The Xilinx® LogiCORE™ IP Fast Fourier Transform (FFT) core has a bit-accurate C model
designed for system modeling. A MATLAB® software MEX function for seamless MATLAB
software integration is also available.
Features
• Bit accurate with FFT core
• Dynamic link library
• Available for 64-bit Linux and 64-bit Windows platforms
• MATLAB software MEX function
• Supports all features of the FFT core that affect numerical results
• Designed for rapid integration into a larger system model
• Example C++ and M code showing how to use the function is provided
Overview
The Xilinx LogiCORE IP FFT has a bit-accurate C model for 64-bit Linux and 64-bit Windows
platforms. The model has an interface consisting of a set of C functions, which resides in a
dynamic link library (shared library). Full details of the interface are given in FFT C Model
Interface. An example piece of C++ code showing how to call the model is provided. The
model is also available as a MATLAB software MEX function for seamless MATLAB software
integration.
The model is bit accurate but not cycle accurate, so it produces exactly the same output
data as the core on a frame-by-frame basis. However, it does not model the core latency or
its interface signals. The C model is an optional output of the Vivado® Design Suite. For
information about generating IP source outputs, see the Vivado Design Suite User Guide:
Designing with IP (UG896) [Ref 8].
E.g. For the FFT IP (v9.1) in a project called myproj and with the default configuration of the
FFT IP (v9.1), the lin64 C model can be found in myproj/myproj.gen/sources_1/ip/xfft_0/
cmodel/xfft_v9_1_bitacc_cmodel_lin64.zip
The C-model is available if the IP is generated when added as a design source. If IP is added
as a part of block design, the c-model is not available.
Unzip the FFT C model zip file. This produces the directory structure and files shown in
Table 5-1.
Installation
On Linux, ensure that the directory in which the files
libIp_xfft_v9_1_bitacc_cmodel.so and libgmp.so.11 are located is in your
$LD_LIBRARY_PATH environment variable.
The C model is used through three functions, declared in the header file
xfft_v9_1_bitacc_cmodel.h:
int xilinx_ip_xfft_v9_1_bitacc_simulate
(
struct xilinx_ip_xfft_v9_1_state* state,
struct xilinx_ip_xfft_v9_1_inputs inputs,
struct xilinx_ip_xfft_v9_1_outputs* outputs
);
void xilinx_ip_xfft_v9_1_destroy_state
(
struct xilinx_ip_xfft_v9_1_state* state
);
To use the model, first create a state structure using the first function,
xilinx_ip_xfft_v9_1_create_state. Then run the model using the second function,
xilinx_ip_xfft_v9_1_bitacc_simulate, passing the state structure, an inputs
structure, and an outputs structure to the function. Finally, free up memory allocated for the
state structure using the third function, xilinx_ip_xfft_v9_1_destroy_state. Each
of these functions is described fully in the following sections.
Note: C_CHANNELS is not a generic used in the C model. The model is always single channel. To
model multiple channels in a multichannel FFT, see Modeling Multichannel FFTs.
The notes under the following headings apply to the inputs structure.
• General
• FFTs with Fixed-Point Interface
• FFTs with Floating-Point Interface
General
1. You are responsible for allocating memory for arrays in the inputs structure.
2. nfft input is only used with run time configurable transform length (that is,
C_HAS_NFFT = 1). If the transform length is fixed (C_HAS_NFFT = 0), C_NFFT_MAX is
used for nfft. In this case, nfft should be equal to C_NFFT_MAX, and a warning is
printed if it is not (but the model continues, using C_NFFT_MAX for nfft and ignoring
the nfft value in the inputs structure).
3. xn_re and xn_im must have 2nfft elements. xn_re_size and xn_im_size must be
set to 2nfft.
4. xn_re and xn_im can be in natural or bit/digit-reversed sample index order. The
C model produces samples in the same ordering format as they were input.
number of stages in the FFT. This is dependent on the architecture, and on nfft, the
point size of the transform:
a. Radix-4, Burst I/O (C_ARCH = 1) or Pipelined, Streaming I/O (C_ARCH = 3):
stages = ceil(nfft/2)
b. Radix-2, Burst I/O (C_ARCH = 2) or Radix-2 Lite, Burst I/O (C_ARCH = 4):
stages = nfft
5. If C_HAS_NFFT = 0, C_NFFT_MAX is used for nfft. The scaling in each stage is an
integer in the range 0-3, which indicates the number of bits the intermediate result is
shifted right. So 0 indicates no scaling, 1 indicates a division by 2, 2 indicates a division
by 4, and 3 indicates a division by 8. Again, scaling_sch[0] is the scaling in the first
stage, scaling_sch[1] the scaling in the second stage, and so on. Insufficiently large
scaling results in overflow, indicated by the overflow output.
The notes under the following headings apply to the outputs structure.
• General
General
1. You are responsible for allocating memory for the outputs structure and for arrays in the
outputs structure.
2. xk_re and xk_im must have at least 2nfft elements. You must set xk_re_size and
xk_im_size to indicate the number of elements in xk_re and xk_im before calling
the FFT function. On exit, xk_re_size and xk_im_size are set to the number of
elements that contain valid output data in xk_re and xk_im.
3. The C model produces data in the same ordering format as the input data. Hence, if
xn_re and xn_im were provided in natural sample index order (0,1,2,3...), xk_re and
xk_im samples will also be in natural sample index order.
4. If overflow occurred with the Pipelined, Streaming I/O architecture (C_ARCH = 3) due to
differences between the FFT core and the model in the order of operations within the
processing stage, the data in xk_re and xk_im might not match the XK_RE and XK_IM
outputs of the FFT core. The xk_re and xk_im data must be ignored if the overflow
output is 1. This is the only case where the model is not entirely bit accurate to the core.
If the generics of the core need to be changed, destroy the existing state structure and
create a new state structure using the new generics. There is no way to change the generics
of an existing state structure.
The example code can be used to test your compilation process. See Compiling with the FFT
C Model.
Linux
To compile the example code, run_bitacc_cmodel.c, first ensure that the directory in
which the files libIp_xfft_v9_1_bitacc_cmodel.so and libgmp.so.11 are
located is present on your $LD_LIBRARY_PATH environment variable. These shared libraries
are referenced during the compilation and linking process.
Place the header file and C++ source file in a single directory. Then in that directory,
compile using the GNU C++ Compiler:
Windows
When compiling on Windows, the symbol NT must be defined either by a compiler option
or in user source code before the xfft_v9_1_bitacc_cmodel.h header file is included.
The FFT C model does not support the LCC compiler shipped with MATLAB software.
Xilinx has verified that GCC version 4.1.1 can successfully be used to build the MEX function
on 64-bit Linux.
The FFT MEX function is called xfft_v9_1_bitacc_mex. Enter this function name without
arguments at the MATLAB software command line to see usage information. The FFT MEX
function syntax is:
1. nfft input is only used for run time configurable transform length (that is,
generics.C_HAS_NFFT = 1). It is ignored otherwise and generics. C_NFFT_MAX is used
instead.
2. For fixed-point input FFTs (that is, generics.C_USE_FLT_PT = 0), to ensure identical
numerical behavior to the hardware, pre-quantize the input_data values to have
precision determined by C_INPUT_WIDTH. This is achieved using the MATLAB software
built-in quantize function.
3. scaling_sch input is only used for a fixed-point input, scaled FFT (that is,
generics.C_USE_FLT_PT = 0, generics.C_HAS_SCALING = 1, and generics.
C_HAS_BFP = 0). It is ignored otherwise.
4. input_data can be in natural or bit/digit-reversed sample index order. The MEX
function produces samples in the same ordering format as they were input.
The notes under the following headings apply to the MEX function outputs.
• General
• FFTs with Fixed-Point Interface
• FFTs with Floating-Point Interface
General
1. There is no need to create and destroy state, as must be done with the C model; this is
handled internally by the FFT MEX function.
2. The FFT MEX function performs extensive checking of its inputs. Any invalid input results
in a message reporting the error and the function terminates.
3. The MEX function produces data in the same order as the input data. Hence, if
input_data was provided in natural sample index order (0,1,2,3...), output_data samples
will also be in natural sample index order.
processing stage, the output data might not match the XK_RE and XK_IM outputs of the
FFT core. The output data must be ignored if the overflow output is 1. This is the only
case where the model is not entirely bit accurate to the core.
The example code can be used to test your MEX function compilation process. See Building
the MEX Function.
For the FFT C model, the example C++ code provided, run_bitacc_cmodel.c,
demonstrates how to model a multichannel FFT. This example code creates the FFT state
structure, then uses a loop to run the model on each channel's input data in turn, then
finally destroys the state structure. For the FFT MEX function, call the function on the input
data of each channel in turn.
Dependent Libraries
The C model uses MPIR libraries. Pre-compiled MPIR libraries are provided with the C
model, using the following versions of the libraries:
• MPIR 2.6.0
Because MPIR is a compatible alternative to GMP, the GMP library can be used in place of
MPIR. It is possible to use GMP or MPIR libraries from other sources, for example, compiled
from source code.
GMP and MPIR in particular contain many low level optimizations for specific processors.
The libraries provided are compiled for a generic processor on each platform, not using
optimized processor-specific code. These libraries work on any processor, but run more
slowly than libraries compiled to use optimized processor-specific code. For the fastest
performance, compile libraries from source on the machine on which you run the
executables.
Source code and compilation scripts are provided for the version of MPIR that were used to
compile the provided libraries. Source code and compilation scripts for any version of the
libraries can be obtained from the GMP [Ref 12] and MPIR [Ref 13] web sites.
Note: If compiling MPIR using its configure script (for example, on Linux platforms), use the
--enable-gmpcompat option when running the configure script. This generates a libgmp.so
library and a gmp.h header file that provide full compatibility with the GMP library.
Test Bench
This chapter contains information about the test bench provided in the Vivado ® Design
Suite.
The demonstration test bench source code is one VHDL file: demo_tb/
tb_<component_name>.vhd in the Vivado output directory. The source code is
comprehensively commented.
The demonstration test bench drives the core input signals to demonstrate the features and
modes of operation of the core. This includes performing an FFT on a pre-generated input
data frame. The input data frame consists of a complex sinusoid with a frequency of 2.6
times the frame size. The FFT of this input frame is a peak centred between output samples
2 and 3. For FFTs with a maximum point size of 64 or greater, the input data is modified by
adding a second complex sinusoid with a frequency of 23.2 times the frame size and a
quarter of the magnitude of the first sinusoid. This modifies the FFT by adding a smaller
peak centred between output samples 23 and 24. The test bench captures this output frame
and uses it as the input frame for an inverse transform. The output of this inverse transform
is therefore the same as the original input frame (modified by the scaling and finite
precision effects of the FFT core).
The operations performed by the demonstration test bench are appropriate for the
configuration of the generated core, and are a subset of the following operations:
Input data is pre-generated in the create_ip_table function and stored in the IP_DATA
constant. New input data frames can be added by defining new functions and constants.
Make sure that each input data frame is of the T_IP_TABLE array type.
All operations performed by the demonstration test bench to drive the core's inputs are
done in the data_stimuli process. This process also contains procedures to simplify
driving a frame of input data. Configuration is requested in this process by setting cfg_*
signals to the desired configuration and setting the do_config shared variable to either
IMMEDIATE or AFTER_START. The configuration signals are actually driven by the
config_stimuli process.
The clock frequency of the core can be modified by changing the CLOCK_PERIOD constant.
Upgrading
This appendix contains information about migrating a design from the ISE® Design Suite to
the Vivado® Design Suite, and for upgrading to a more recent version of the IP core. For
customers upgrading in the Vivado Design Suite, important details (where applicable)
about any port changes and other impact to user logic are included.
Parameter Changes
There are no parameter changes between versions 9.0 and 9.1.
Port Changes
There are no port changes between versions 9.0 and 9.1.
Functionality Changes
Latency Changes
The majority of configurations in version 9.1 have unchanged latency compared to version
9.0.
The exception is the Pipelined Streaming I/O architecture when configured in Block
Floating-Point scaling mode. In this case, the latency will increase in version 9.1 by 1 cycle
compared to version 9.0.
The exception is the Pipelined Streaming I/O architecture when configured in Block
Floating-Point scaling mode. In this case, the output data may no longer be bit accurate
with previous versions due to the introduction of a rounding stage. The IP remains bit
accurate with the C model, however user test vectors generated from earlier versions of the
C model or IP may require updating to match the new behavior.
Debugging
This appendix includes details about resources available on the Xilinx Support website and
debugging tools.
Documentation
This product guide is the main document associated with the Fast Fourier Transform core.
This guide, along with documentation related to all products that aid in the design process,
can be found on the Xilinx Support web page or by using the Xilinx® Documentation
Navigator.
Download the Xilinx Documentation Navigator from the Downloads page. For more
information about this tool and the features available, open the online help after
installation.
Answer Records
Answer Records include information about commonly encountered problems, helpful
information on how to resolve these problems, and any known issues with a Xilinx product.
Answer Records are created and maintained daily ensuring that users have access to the
most accurate information available.
Answer Records for this core can be located by using the Search Support box on the main
Xilinx support web page. To maximize your search results, use keywords such as:
• Product name
• Tool message(s)
• Summary of the issue encountered
A filter search is available after results are returned to further target the results.
AR: 54501
Technical Support
Xilinx provides technical support at the Xilinx Support web page for this LogiCORE™ IP
product when used as described in the product documentation. Xilinx cannot guarantee
timing, functionality, or support if you do any of the following:
• Implement the solution in devices that are not defined in the documentation.
• Customize the solution beyond that allowed in the product documentation.
• Change any section of the design labeled DO NOT MODIFY.
To contact Xilinx Technical Support, navigate to the Xilinx Support web page.
Debug Tools
There are tools available to address Fast Fourier Transform design issues. It is important to
know which tools are useful for debugging various situations.
The Vivado logic analyzer is used with the logic debug IP cores, including:
See the Vivado Design Suite User Guide: Programming and Debugging (UG908) [Ref 15].
Reference Boards
Various Xilinx development boards support the Fast Fourier Transform core. These boards
can be used to prototype designs and establish that the core can communicate with the
system.
C Model Reference
See Chapter 5, C Model in this guide for tips and instructions for using the C Model files
provided to debug your design.
Simulation Debug
The simulation debug flow for Mentor Graphics Questa Advanced Simulator is illustrated in
Figure B-1. A similar approach can be used with other simulators.
X-Ref Target - Figure B-1
Questa Advanced
Simulator
Simulation Debug
Yes
No
No
Xilinx Resources
For support resources such as Answers, Documentation, Downloads, and Forums, see
Xilinx Support.
• From the Vivado® IDE, select Help > Documentation and Tutorials.
• On Windows, select Start > All Programs > Xilinx Design Tools > DocNav.
• At the Linux command prompt, enter docnav.
Xilinx Design Hubs provide links to documentation organized by design tasks and other
topics, which you can use to learn key concepts and address frequently asked questions. To
access the Design Hubs:
• In the Xilinx Documentation Navigator, click the Design Hubs View tab.
• On the Xilinx website, see the Design Hubs page.
Note: For more information on Documentation Navigator, see the Documentation Navigator page
on the Xilinx website.
References
These documents provide supplemental material useful with this product guide:
Revision History
The following table shows the revision history for this document.