BIST-Based Test and Diagnosis of FPGA Logic Blocks1
Miron Abramovici
Bell Labs - Lucent Technologies
Murray Hill, NJ
Charles Stroud2
Dept. of Electrical and Computer Engineering
University of North Carolina at Charlotte
Keywords: Built-In Self-Test, FPGA testing, FPGA diagnosis, fault-tolerance, reconfigurable
systems
Abstract: We present a Built-In Self-Test (BIST) approach able to detect and accurately diagnose all
single and practically all multiple faulty programmable logic blocks (PLBs) in Field Programmable Gate
Arrays (FPGAs) with maximum diagnostic resolution. Unlike conventional BIST, FPGA BIST does not
involve any area overhead or performance degradation. We also identify and solve the problem of testing
configuration multiplexers, that was either ignored or incorrectly solved in most previous work. We introduce the first diagnosis method for multiple faulty PLBs; for any faulty PLB, we also identify its internal
faulty modules or modes of operation. Our accurate diagnosis provides the basis for both failure analysis
used for yield improvement and for any repair strategy used for fault-tolerance in reconfigurable systems.
We present experimental results showing detection and identification of faulty PLBs in actual defective
FPGAs. Our BIST architecture is easily scalable.
1. Introduction
An FPGA consists of an N×N array of programmable logic blocks (PLBs) and programmable I/O
blocks, connected by a programmable interconnect network. In this paper, we consider RAM-based
FPGAs, which are programmed by writing an on-chip configuration memory. FPGA manufacturing tests
should detect all the faults affecting every possible mode of operation of its PLBs, and also detect all the
faults affecting its interconnect network. Usually the logic and the interconnect are separately tested. The
goal of “detecting a faulty PLB” implicitly assumes that a defective PLB may have multiple internal faults.
The usual assumption in logic test is that the FPGA under test has at most one faulty PLB. However, multiple faulty PLBs may exist in newly manufactured FPGAs or may appear in FPGAs used in long missions
in harsh environments. Interconnect testing targets faults such as shorts, opens, and programmable
switches stuck-on and stuck-off.
Most previous methods for FPGA testing, both for PLBs [22][17][21][18][19][32][33][20] and for
interconnect [25][30][31][9], rely on externally applied vectors, hence they are applicable only for devicelevel testing. Therefore, while these tests can be used for manufacturing test, they are not applicable to insystem test or to fault-tolerant system applications. In contrast, BIST-based methods [37][38][39][7][40]
1. This material is based upon work supported in part by the National Science Foundation under Grant No. MIP-9409682, by
the DARPA ACS program under contract F33615-98-C-1318, by the Univ. of Kentucky Center for Robotics and Manufacturing Systems, and by the Microelectronics Group of Lucent Technologies.
2. Formerly with the Dept. of Electrical Engineering, University of Kentucky.
1
[41][42][28][13][15][4] can also be reused for board and system-level testing; their reuse reduces the effort
involved in developing system diagnostic routines to test FPGAs in their system mode of operation. Since
BIST deals with every FPGA in isolation, it also provides a simple solution to the problem of locating
faulty FPGAs in the system. Of course, BIST is more difficult to implement, because, unlike in external
testing, we cannot rely on a fault-free tester to provide vectors and analyze results. Following the approach
introduced in [37][38], BIST methods configure one part of the FPGA to be under test, and the other part
to generate vectors for, and to analyze the results from, the subcircuits under test; then the resources of the
FPGA change roles so that the entire FPGA is eventually tested. BIST techniques have also been applied
to on-line FPGA testing [34][2][3], but in this paper we discuss only off-line testing. Other on-line FPGA
test methods rely on redundant design techniques [6][10].
Both external-test and BIST methods involve multiple configurations of the FPGA. An FPGA that is
tested in-system is configured via its boundary-scan interface [35]. The data required to reconfigure the
FPGA under test are maintained within the test environment - automatic test equipment(ATE) for device
test, or CPU (or maintenance processor) for in-system test. To minimize the memory requirements as well
as the test time (which is dominated by the device programming time), the number of test configurations
should be kept to a minimum. BIST methods may require more configurations than ATE-based FPGA
tests.
Conventional BIST approaches introduce both area overhead and delay penalties; the latter may result
in speed degradation unacceptable in high-performance systems. In contrast, BIST for FPGAs - first introduced for testing PLBs [37][38] and then extended to testing programmable interconnect [41] - exploits
the reprogrammability of an FPGA to configure it exclusively with BIST logic during off-line testing. In
this way, for manufacturing test, testability is achieved without any cost, since the BIST logic “disappears”
when the FPGA is no longer under test. When BIST is complete, an FPGA tested in-system needs to be
reconfigured for its normal operation. For in-system test, the only cost is the additional memory required
for the BIST configurations. The results of our implementation will show that this cost is negligible.
Diagnosis consists of mapping an incorrect response from the circuit under test into the fault(s) that
can explain the obtained response. The required diagnostic resolution depends on the goal of the testing
process. Usually in system-level testing, the objective is to locate a replaceable defective component.
Thus, in-system identification of a faulty FPGA would be sufficient in this context. However, one can take
advantage of the reprogrammability and the regular structure of an FPGA to achieve fault tolerance by
repairing the FPGA in place. Here the goal is to assure that the defective chip will still correctly execute
its intended function. This is much more economical than replacing defective FPGAs, and it is an essential
feature in environments where device replacement is not feasible or practical, such as unmanned missions
or remote stations. It is interesting to note that in the Teramac custom computer [7], about 75% of the 864
FPGAs used in the system are defective and have been repaired by fault-tolerant reconfiguration. This process requires the accurate identification of faulty PLBs, which are bypassed and replaced with fault-free
unused cells by reprogramming the FPGA (if such cells are still available) [23][29][7][11]. The same resolution (to the level of a faulty PLB) is required for the “node-covering” fault-tolerant technique, where
the repair occurs after manufacturing testing and is application-independent and invisible to the user [14].
Another yield-enhancement technique [16] replaces an entire faulty row (or column) by a spare one, and
2
hence its resolution requirement is only to identify a faulty row (or column). When the goal of testing is
the improvement of the manufacturing process, then the most accurate resolution - locating faults inside a
PLB - is required to support subsequent failure analysis. The ability to locate defective modules inside a
PLB enables a new form of fault-tolerance that reuses the fault-free modules or fault-free modes of operation of partially defective PLBs [2][11]. Other fault-tolerance techniques [24] also rely on identifying
faults within defective PLBs (referred to as “clusters”) with the goal of reusing them.
Many FPGA test methods configure the FPGA to create iterative logic arrays (ILAs), composed of
identical cells connected serially as one-dimensional horizontal or vertical arrays [23][17][39][40][18][21]
[19][28][32][33][20] or 2-D arrays [22]. Sometimes separate ILAs are used to propagate errors from the
logic under test [39][40][20]. Constructing ILAs to test the RAM operation of PLBs is described in
[39][18][32][33]. ILAs are especially useful when they are C-testable, since then they can be completely
tested with a number of tests that does not depend on the number of cells in the ILA. An external-test ILAbased method has been used to detect multiple faulty PLBs [20]. ILA structures are also useful for diagnosing single faulty PLBs; typically, testing the horizontal (vertical) ILAs identifies a faulty row (column),
and the faulty PLB is located at the intersection of the faulty row and the faulty column [23][40][21][28].
However, such procedures may not be reliable in the presence of multiple faults, since a fault-identifying
signal may subsequently propagate through a defective block.
The method used to check the (non-commercial) FPGAs used in the Teramac custom computer [7]
configures each row as a pseudo-random sequence generator and checks the final register contents after
applying a given number of clock cycles against an expected signature. The same procedure is then
repeated using columns instead of rows, and the faulty cells are located at the intersection of the faulty
rows with the faulty columns. However, the test provided to the blocks that form the pseudo-random
sequence generator does not achieve complete fault coverage and hence it cannot guarantee accurate diagnosis (in general, fault detection is a necessary condition for fault location). In addition, applying the test
requires a fault-free finite-state machine in the FPGA, the total test time is very large, and developing the
diagnostics tests is an expensive manual process.
The BIST method of [42] provides hierarchical and adaptive diagnosis; it first identifies multiple
faulty groups of PLBs, and then the faulty PLBs in the faulty groups. For 2 t+1 groups of PLBs, this
approach can diagnose only up to t faulty groups. To identify faulty PLBs, the PLBs in faulty groups are
compared with PLBs from fault-free groups. This approach requires a large number of test configurations,
which makes the total test time prohibitive for manufacturing testing. Scalability appears to be a problem,
as the layout is not regular, and the number of configurations increases with the size of the FPGA.
In this paper, we present the first complete test and diagnosis method for FPGAs.Our BIST technique
detects any single faulty PLB and any combination of multiple faulty PLBs, without requiring any faultfree core in the FPGA; the tests are complete for almost any fault model. We have identified and solved
the problem of testing configuration multiplexers, that was either ignored or incorrectly solved in most previous work. Our new diagnosis algorithm locates any single faulty PLB and, except for few pathological
cases, identifies any possible combination of multiple faulty PLBs. Moreover, it also determines the faulty
modules or faulty modes of operation of any defective PLB. Our approach is easily scalable. The same
BIST approach can perform fault detection and identification at any level of testing - device level for man-
3
ufacturing testing and yield enhancement, and system level for repair strategies in fault-tolerant
applications. We used the Lucent Optimized Reconfigurable Cell Array (ORCA) [26] for the initial design
and implementation of the BIST-based diagnostic approach, but we emphasize that our technique can be
applied to other RAM-based FPGAs, including Xilinx [43] and Altera [5].
Since our BIST is independent of the system function implemented in the FPGA, our approach could
be considered an “overkill” for in-system testing: why don’t we test the FPGA just in its normal mode of
operation? The main reason is that during the normal operation of an adaptive system, of a custom-computing machine, or of a fault-tolerant application, the same FPGA will be used with different
configurations at different times. Hence, our strategy makes certain that no assigned function will be incorrectly performed because of dormant faults that affect currently unused logic cells or unused modes of
operation of the active cells. This is important because testing of dormant faults is essential in achieving
high reliability in safety-critical applications [36].
The remainder of this paper is organized as follows. Section 2 outlines our BIST architecture.
Section3 reviews the method used to access the BIST architecture via the FPGA Boundary Scan interface.
Section4 presents the fault detection capabilities of our BIST approach, while Section5 analyzes its diagnostic features. Section6 discusses the results from the successful use of this approach in locating faulty
PLBs in manufactured FPGAs. Finally, Section7 presents our conclusions.
2. The BIST Architecture
2.1 Testing PLBs
The strategy of our FPGA BIST approach is to configure groups of PLBs as test pattern generators
(TPGs) and output response analyzers (ORAs), and another group as blocks under test (BUTs), as illustrated in Figure 1a. The BUTs are then repeatedly reconfigured to test them in all their modes of operation.
We refer to the test process that occurs for one configuration as a test phase. A test session is a sequence
of test phases that completely test the BUTs in all of their modes of operation. Once the BUTs have been
tested, the roles of the PLBs are reversed so that in the next test session the previous BUTs become TPGs
or ORAs, and vice versa. Since half of the PLBs are BUTs during each test session, we need only two test
sessions to test all PLBs in the FPGA. Figures1b and 1c show the floorplans for the two test sessions called NS and SN - that completely test every PLB in an 8×8 FPGA. Figure 1a corresponds to the first four
BIST Start/Reset
TPG
BUT
BUT
ORA
BUT
...
...
BIST Done
TPG
BUT
BUT
ORA
ORA
ORA
BUT
BUT
BUT
Pass/Fail
TPGs
BUTs
ORAs
BUTs
ORAs
BUTs
ORAs
BUTs
b) Floorplan for first
test session (NS)
a) TPG, BUT, and ORA connection
Figure 1. The BIST architecture.
4
BUTs
ORAs
BUTs
ORAs
BUTs
ORAs
BUTs
TPGs
c) Floorplan for second
test session (SN)
rows in Figure 1b. The name of a session denotes the direction of the flow of test patterns during that session. The floorplan for SN is obtained by flipping the floorplan for NS around the horizontal axis shown
as a dotted line in the middle of the array. This changes the roles of the PLBs so that every PLB is under
test in one of the two test sessions. Note that all BUTs are tested in parallel and that patterns may be applied
at-speed.
Each test phase consists of the following steps: 1) reconfigure the FPGA, 2) initiate the test sequence,
3) generate test patterns, 4) analyze output responses, and 5) read the test results. In step 1, the test controller(ATE for wafer/package testing; CPU or maintenance processor for board/system testing) interacts
with the FPGA(s) under test to reconfigure the logic by retrieving a BIST configuration from the configuration storage (ATE memory; disk) and loading it into the FPGA(s). The test controller also initializes the
TPGs, BUTs, and ORAs and initiates the BIST sequence (via the BIST Start/Reset input in step 2) and
reads the subsequent Pass/Fail results (step 5). Steps 3 and 4 are concurrently performed by the BIST logic
within the device. After the board or system-level BIST is complete, the test controller must reconfigure
the FPGA for its normal system function; hence the normal device configuration must be stored along with
the BIST configurations. The test application time is dominated by the FPGA reconfiguration time. Since
the total test and diagnosis time is a major factor in the system down time (system availability) and cost,
an important goal of our BIST approach is to minimize the number of configurations used for test and
diagnosis.
Figure 2 illustrates the typical structure of a PLB, consisting of a memory block that can function as
a look-up table (LUT) or RAM, several flip-flops (FFs), and multiplexing output logic. The LUT/RAM
block may also contain special-purpose logic for arithmetic functions (counters, adders, multipliers, etc.)
The RAM may be configured in various modes of operation (synchronous, asynchronous, single-port,
dual-port, etc.). The FFs can also be configured as latches, and may have programmable clock-enable, preset/clear, and data selector functions. Our strategy relies on pseudoexhaustive testing [27], which in this
context means that every subcircuit of a PLB is tested with exhaustive patterns in each one of its modes of
operation [1]. The memory block is checked with RAM test sequences which are exhaustive for faults specific to RAMs [12]. Note that all three subcircuits of a PLB are easily controllable and observable from
the PLB’s I/O pins, and that exhaustive testing of every module is feasible since the number of inputs is
reasonably small. This results in practically complete fault coverage without explicit fault model assumptions and without fault simulation. For example, all single and multiple stuck-at faults, as well as all faults
(of any type) that do not increase the number of states, are guaranteed to be detected; the overwhelming
majority of faults that do increase the number of states are also detected. Thus for any practical purpose
the PLB logic test is complete. (Applying a complete logic test at normal operating frequency is likely to
create a good delay-fault test, but in this paper we deal only with logic faults.)
In every test phase, a BUT is configured in a different mode of operation; hence its pseudoexhaustive
LUT/RAM
FFs
Output
Logic
Figure 2. Typical PLB structur
5
X
test may also change from phase to phase. For example, the test sequence for combinational logic followed
by FFs is different from the test sequence for a RAM. Thus a TPG may have different structures depending
on the sequences needed in different phases. All BUTs are configured to have the same logical function
and receive the same input test patterns from the two identical TPG blocks. Since all fault-free BUTs must
produce the same output patterns, the ORAs simply compare corresponding outputs from different BUTs.
Unlike the signature-based compression circuits found in most BIST applications, comparator-based
ORAs do not suffer from the aliasing problem that occurs when a faulty circuit produces the good circuit
signature.
Figure 3 shows the structure of an ORA comparing four pairs of BUT outputs. The signalBuOi (BdOi)
is the i-th output from the BUT up above (down below) the ORA. The FF stores the result of the comparison, and the feedback loop latches the first mismatch in the FF. This represents a compression of the
results, since any number of mismatches in the same phase translate into one error. The result FFs of all
ORAs are connected to form a scan chain (indicated by the dotted lines in Figure1a). In every test phase,
the scan chain is also tested, to assure the integrity of the test results. Our previous BIST approach [40]
recorded only one Pass/Fail result for every row of ORAs. However, storing the test results of each ORA
cell significantly improves the diagnosis resolution achievable by this architecture, as we will show in
Section5. For an N×N FPGA, the number of ORA cells is NORA=(N2/2)-N.
BuO1
BdO1
BuO2
BdO2
BuO3
BdO3
1
0
BuO4
BdO4
Scan
Out
Scan In from previous ORA
TDI
TCK
Figure 3. Integrated ORA/scan cell
Two important features of this architecture help both testing and diagnosis of the FPGA. First, every
BUT (except those in the first two and in the last two rows) is simultaneously compared with two other
BUTs by two different ORAs (one above and one below). Second, the pair of BUTs being compared by
each ORA are fed by two different TPGs. Section4 and Section5 will show how these features help
achieve complete testing and maximum diagnostic resolution.
2.2 Testing Configuration Multiplexers
A configuration multiplexer (MUX) is a commonly used hardware mechanism that selects subcircuits
for various modes of operation. A configuration MUX is controlled by configuration memory bits to select
one input to be connected to its output. In Figure4a, assume that we set the configuration bit S to 0 to connect V0 to X. Then the subcircuit producing V1 disappears from the circuit model seen by the user. This is
correct from a design viewpoint, because the value V1 can no longer affect X in the current configuration.
But from a testing viewpoint, in any test for the MUX, we need to setV0 and V1 to complementary values.
In general, for a MUX with k inputs, if V is the value of the selected input, all the other k-1 inputs should
be set to value V.
6
V0 0
V0
0
1
X
V1
1
a)
S=0
S=0
V1
X
s-a-1
x
1
b)
Figure 4. C onfiguration multiplexer
The problem arises because FPGA CAD tools generate the configuration bitstream based on the user
model, which will never include the functionally inactive subcircuits (called “invisible logic” in [38]).
Thus in Figure 4a, when S=0, V0 will be set to both 0 and 1, but V1 cannot change. Similarly, the user logic
cannot control V0 in any configuration where S=1. The result is that the testing of the MUX may not be
complete. For example, the s-a-1 fault in the gate-level MUX model in Figure4b is detected only when
S=0, V0=0, and V1=1. But this pattern may never be applied if V1 cannot be controlled when S=0.
Our solution relies on separately configuring the invisible logic so that it will generate the proper values needed for the inactive MUX inputs. Then we “overlay” the resulting configuration files over the main
configuration file with the active logic, and we “merge” them without changing any MUX setting done in
the main configuration. This process is conceptually simple, but its implementation requires knowledge of
the FPGA configuration stream structure.
Note that in most previous work dealing with testing FPGAs, the problem of testing a configuration
MUX is either not addressed or it is “solved” functionally, by connecting every input in turn to the output,
and providing both 0 and 1 values to the selected input. However, the invisible logic driving the inactive
inputs is completely ignored. Hence prior claims of “complete testing” may not be valid since the testing
of every configuration MUX in the FPGA is likely to be incomplete. Methods that did provide complete
tests for configuration multiplexers, such as [21][19], used models that did not remove the invisible logic.
But such models cannot be used with the existing FPGA CAD tools to generate configuration files.
2.3 Testing Memory Blocks
For the LUT/RAM block of a PLB, we first test its RAM mode of operation. We configure the TPGs
to apply a march test [12], which detects, among other faults, all the stuck faults in memory cells, as well
as all faults in the address and read/write circuity of theRAM. We rely on this RAM-mode test as the major
test of the memory block, so that subsequent tests for different modes of operation of the same block do
not need to retarget the already detected faults. When we test the LUT operation, we configure alternating
XOR/XNOR functions for the LUT outputs during one test phase, and alternating XNOR/XOR functions
during a second test phase. For any combinational function of n inputs, the TPG will apply all 2n vectors.
This strategy using a RAM test sequence to detect most faults in the LUT/RAM block is not applicable
to FPGAs whose PLB memory block cannot be operated as RAM, and as a result, functions only as a
ROM. To test a ROM-only type of LUT with n address bits, we can use the 2n configurations proposed in
[18].
Some currently available FPGAs contain large embedded RAM arrays (much larger than the RAMs
used within PLBs). To test such RAMs we need an additional session, in which the PLBs surrounding a
7
RAM implement the same BIST logic that would be added to test the RAM had it been embedded on a
system chip. The difference is that in an FPGA the BIST logic would disappear after the RAM test session.
All embedded RAMs may be tested in parallel, possibly sharing the BIST controller. If the FPGA has several identical RAM modules, we can feed them with the same patterns, and use comparators to check
mismatches between corresponding outputs.
2.4 Scalability of the BIST Approach
Our BIST architecture has a very regular easily-scalable structure, automatically generated by a simple procedure (that also does algorithmic placement and routing), based on the dimensions ( N×N) of the
FPGA array. Since an ORA compares the outputs of its two neighbor BUTs, all signals from BUTs to
ORAs use only local routing resources which are oblivious to the size of the FPGA. Global routing is used
to distribute the patterns generated by TPGs to BUTs. Ignoring fanout load limitations, adding rows and
columns to an array of PLBs will just extend the length of the vertical and horizontal global lines fed by
TPG outputs, hence the usage of the global routing resources required for distributing the TPG patterns
does not change with the FPGA size. In the ORCA FPGA, we use the bidirectional drivers available in the
local routing surrounding a PLB to redistribute incoming TPG signals, thus avoiding fanout overload. The
following analysis is for FPGAs where such drivers are not available. If k is the number of cells used by
one TPG (k=4 in our implementation), a TPG row has N TPG = N ⁄ k TPGs. (For simplicity we assume
that N is always a multiple of k.) We divide the BUTs into NTPG subsets of k columns and we feed each
2
such subset from its closest two TPGs. Then each TPG output drives N ⁄ ( 2 ⋅ N TPG ) = k ⋅ N ⁄ 2 BUTs.
This shows that the loading grows only linearly with N, which is the square root of the size of the FPGA.
Nevertheless, if the loading may not grow over a given limit Lmax, then the largest N for which the BIST
architecture is feasible is N max = 2 ⋅ L max ⁄ k (note that Lmax depends on the BIST clock frequency).
When N grows over this limit, we divide the FPGA in four quadrants, such that it is feasible to implement
the BIST architecture separately in each quadrant. For example, if Nmax=8, then an 16×16 FPGA will be
divided into four 8×8 FPGAs. All quadrants are tested concurrently. However, the scan chains of each
quadrant will be connected into a single scan chain, so that the result retrieval time grows linearly with the
size of the FPGA.
We emphasize that increasing N does not affect the number of test sessions, which is always two. The
number of test configurations (phases) depends only on the structure and the modes of operation of the
PLB, and it is independent of N (the dependence shown in Figure 6 of [20] for the BIST method is a result
of incorrect assumptions). Since all BUTs are tested in parallel, the BIST execution time is also independent of the size of the FPGA. In addition to the time for scanning out the results, the only test-time
dependence on the size of the array is the reconfiguration time for each test phase. This is inherent in all
the currently available FPGAs, for which the configuration loading is a serial process. The time of the
method described in [21] does not depend on N, but this is based on a parallel loading mechanism that is
not featured in existing FPGAs.
3. Boundary Scan Access
Practically all recently developed FPGAs, such as [5][26][43], feature a boundary-scan interface con-
8
trolled by a Test Access Port (TAP) [35], that can be also used for reconfiguration. This allows FPGAs to
be reconfigured and tested in-system without requiring additional I/O pins. The Lucent [26] and the Xilinx
[43] architectures also provide user-defined access to the FPGA core, which we use to control the ORA
scan chain. Altera FPGAs [5] do not feature user-definedTAP instructions, but the equivalent functionality
may be implemented in test mode at the cost of several test-dedicated I/O pins.
Reconfiguring the FPGA with the BIST test phases, initiating the BIST sequence, and reading the
BIST results (steps 1, 2, and 5 of each test phase) are performed using the TAP. Access to our BIST architecture requires the ability to control the BIST Start/Reset to initialize the TPGs and ORAs and start the
BIST sequence, as well as the ability to read the BIST Done and ORA Pass/Fail results from the BIST
circuitry. This access must be made within the confines of the typical FPGA boundary-scan circuitry architecture. For example, unlike boundary-scan cells for external interconnect testing, the Shift/Capture and
Update control signals are not made available to the core of the FPGA by the TAP circuitry in most
FPGAs. Instead, the only internal signals provided by the TAP for user-defined internal scan chains are
TCK, TDI, a port to send data out on TDO, and an internal enable signal (TEN). TEN is active when the
user-defined scan-chain instruction is decoded by the TAP controller and remains active until a different
instruction is loaded into the TAP instruction register. Additional considerations in selecting a boundaryscan access method for final implementation include total test time, PLB overhead, routability, and diagnostic resolution. A detailed analysis of four methods for access to the FPGA BIST architecture via the
boundary-scan interface was given in [13]. In this section, we describe the method selected for the final
implementation of our BIST approach.
By integrating an internal scan chain in the PLBs containing the ORAs, we obtain an architecture that
allows us to observe the results of the comparisons done by every ORA, without additional logic resources
and with only local routing resources for the scan chain. By using the data-select MUX that is part of the
flip-flop circuitry in the most PLBs, we can alternately use the flip-flop to latch mismatches in the ORA
and to shift out the Pass/Fail results at the completion of the BIST sequence. As a result, we can create
individual, independent ORAs with integrated scan registers in each ORA as illustrated in Figure5. This
provides a significant enhancement to diagnostic resolution as will be discussed in Section5.
Working within the confines of the typical FPGA boundary scan architecture, we use the TEN signal
P a ss/F a il lo g ic
B IS T _D o n e
fo r O R A 1
P a ss/F a il lo gic
fo r O R A N O R A
TE N
1
0
1
0
1
0
To TDO
TD I
TCK
TC K
TEN
TD I
TD O
BIST D one
Initialization
BIST
N O RA +1 clock cycles
Failing O RA indication
Figure 5. Results scan-chain and timing diagram
9
for the BIST Start/Reset functions. As a result, the FFs in the TPGs, ORAs, and BUTs are being reset until
we load the user-defined scan register instruction, at which time the BIST sequence begins. This is accomplished by having TEN reset all FFs in the FPGA. The TDI input is used as the Shift/Capture control to the
internal scan chain containing the BIST outputs (BIST Done and Pass/Fail indications) as illustrated in
Figure 5. By connecting the BIST Done output to the last register in the internal scan chain, this signal is
immediately observable on TDO by holding TDI at the Capture logic value (logic 1 in our implementation). As soon as BIST Done goes active, we begin shifting out the Pass/Fail indications by setting TDI to
the Shift logic value (logic 0 in this case). NORA+1 clock cycles are needed to retrieve all the BIST results.
By supplying a logic 1 (via the active TEN signal) to the scan input of the first scan FF in the chain, we are
able to test the integrity of the internal scan chain as well. For example, a scan chain FF s-a-0 will be
detected by the absence of the logic 1 at the end of the BIST results-shifting sequence. (Since a value of 1
represents an error, a FF s-a-1 will be immediately detected.)
Since all BIST operations involve the TAP controller, this is the first FPGA subcircuit that is tested.
For this we use the “Tapdance” test sequence introduced in [8] or a subset of it. If the FPGA is tested with
ATE, then we can use the entire sequence, which is a comprehensive functional test. For in-system test,
we exclude any subsequence that involves FPGA I/O pins other than the four boundary-scan pins. For
example, we skip testing the instructions that capture data in the boundary scan register (these instructions
are usually tested together with the board interconnect by a separate board-level test). In this way, we
obtain a complete test for all TAP operations that will be used by our BIST sequence. Although this subset
is essentially a short external test, it relies only on the same four pins used by BIST, and can be regarded
as an intrinsic part of the BIST sequence.
To summarize the sequencing of the FPGA test via the boundary scan interface, after the TAP controller test described above has passed, the test controller sends an instruction to the TAP controller to
access the configuration memory and then downloads the configuration bits for the first BIST test phase.
After configuration, the controller sends an instruction to access the user-defined scan register, which initiates the BIST sequence. The TDI input is held high until the BIST Done signal goes active indicating that
the ORA Pass/Fail results are valid. Then TDI is set low and the Pass/Fail results are shifted out as illustrated in Figure 5. If a logic 1 appears at the end of the shift sequence (indicating the scan chain is faultfree), the test controller moves to the next BIST phase and repeats this sequence of operations. This process
continues until all BIST test sessions and phases have been executed. All failures are recorded and subsequently analyzed for diagnosis. If the FPGA has been tested in system and found to be fault-free, then it is
reconfigured to its normal mode of operation. If the diagnosis locates defective PLBs, the FPGA can be
repaired by reconfiguration.
4. BIST-Based Fault Detection
In this section we show that our BIST approach achieves complete fault detection for single faulty
PLBs, and practically complete fault detection for multiple faulty PLBs. A faulty PLB in a TPG or in an
ORA may not produce an error if its fault does not affect the operation of the TPG or ORA. Thus we will
rely on detection of PLB faults only when a faulty PLB is under test (configured as a BUT).
Claim 1: Any single faulty PLB is guaranteed to be detected.
10
Proof: When the faulty PLB is a BUT, it receives the correct pseudo-exhaustive patterns from a faultfree TPG in every one of its modes of operation. The outputs of the faulty BUT are compared with a faultfree BUT fed by a fault-free TPG, and the comparison and error latching are done by a fault-free ORA.
(Since the scanning mechanism of the ORA result register is tested as part of every test phase, we can
assume that errors propagate through a fault-free scan chain.) Thus the faulty PLB is detected.
The more difficult question is whether we can have multiple faulty PLBs that mask each other, so that
together they escape detection. Although, in general, we cannot claim that any possible combination of
faulty PLBs will be detected, we can identify many large classes of multiple faulty PLBs whose detection
can be guaranteed. In the following, when we analyze the detectability of a group G of faulty PLBs, we
implicitly assume that all the other PLBs are fault-free (in other words, G is not a subset of a larger group
of faulty PLBs). G is detected when any of its PLBs is detected.
Claim 2: Any group of faulty PLBs in the same row is guaranteed to be detected.
Proof: Except the blocks in a TPG, the only interaction among PLBs located in different columns is
provided by the scan register connecting the ORA result flip-flops. However, the scan operation of the
ORA result register is also tested by the technique described in Section3. Hence, faults in BUT and ORA
PLBs in the same row cannot mask each other. Faulty PLBs in rows 1 or N may not be detected when they
are part of a TPG (for example, when two faulty TPGs feeding pairs of BUTs compared by the same ORAs
generate identical patterns), but they will be detected when configured as BUTs.
The next result deals with the detection of faulty PLBs residing in the same column, called the faulty
column. A middle row refers to any row except 1 and N.
Claim 3: Any group of faulty PLBs in the middle rows of the same column that has at least two adjacent fault-free PLBs, is guaranteed to be detected.
Proof: Consider two adjacent fault-free PLBs that have a faulty neighbor PLB. For illustration, assume that P and Q in Figure 6a are fault-free, and that R is faulty (denoted by a black
cell). Consider the test session in which Q and R are BUTs and P is an ORA. The fault(s) in PR
R must be detected, because all test patterns are applied by two fault-free TPGs (in row 1 or QS
N) to R and Q, and R is compared with a fault-free BUT (Q) by a fault-free ORA (P). Thus
a)
b)
the presence of the two adjacent fault-free PLBs does not allow their faulty neighbor to be Figure 6.
masked, and therefore any group of faulty PLBs in the middle rows of the same column is
detected.
Note that to escape detection, there must be at least N/2 faulty PLBs in the group, because otherwise
the faulty column will have at least two adjacent fault-free PLBs. This is a very restrictive condition: for
example, in a 20×20 FPGA, any faulty column with less than 10 faulty FPGA is guaranteed to be detected.
Furthermore, the faulty PLBs configured as BUTs must produce identical output responses during every
test phase. The reason Claim 3 restricts faulty PLBs to the middle rows of the faulty column is to guarantee
that the TPGs in rows 1 and N are fault-free and thus generate all test patterns. Otherwise, it is theoretically
possible that the group of faulty PLBs escape detection, even if the faulty column has two adjacent faultfree PLBs; for this to happen, all of the following conditions must be satisfied: 1) the faulty PLB in row 1
or N must change the patterns produced by one TPG, and, 2) the faulty TPG must skip all the patterns that
detect every faulty BUT bordering two adjacent fault-free PLBs, and, 3) the fault-free BUTs must have
11
identical responses for every pair of different vectors produced by the two TPGs (otherwise errors will
appear at all the ORAs in the fault-free columns). Clearly, this set of conditions is so restrictive that it is
unlikely to ever occur in practice. Hence in practice, columns that include faulty PLBs in rows 1 or N will
also be detected. (The above analysis shows that the claim made in [20] that BIST-based approaches “do
not detect all double faulty PLBs” is incorrect.)
Figure 6b shows an example of a faulty column with N=8 where no pair of fault-free PLBs are adjacent. Consider the test session in which all 4 faulty PLBs are BUTs. In addition, assume that their faults
are functionally equivalent, so that no mismatches will be produced. If these faults are not activated in the
other test session when the faulty PLBs in the middle rows serve as ORAs, and the PLB in row 8 is part
of a TPG, then this multiple fault will escape detection. However, this situation can be characterized as
“pathological,” since it has an extremely low probability of occurrence: 1) half of the PLBs in the same
column must be faulty, and, 2) these PLBs must reside in every other row and, 3) their faults must be functionally equivalent, and, 4) these faults must not be detected when these PLBs are configured as ORAs and
TPG cell. The probability of occurrence is very low, because we need the AND of four conditions which
are very unlikely by themselves.
Because masking cannot occur among faults in the middle rows of different columns, we can derive
from Claim 3 the following more general result.
Claim 4: Any group of faulty PLBs in the middle rows of the FPGA, such that every faulty column has
at least two adjacent fault-free PLBs, is guaranteed to be detected.
Like Claim 3, Claim 4 can also be extended in practice to cover faulty PLBs in rows 1 andN. Figure 7
illustrates an interesting group of 4 faulty PLBs that can escape detection if the following conditions are
all satisfied: 1) PLBs X and Y have the same position in the first and the second TPG in row 1,and, 2) PLBs
V and Z have the same position in the first and the second TPG in row 8, and, 3) the faults in X and Y are
equivalent, and, 4) the faults in V and Z are equivalent, and, 5) the TPGs in row 1 skip all the patterns that
detect faults in V and Z, and, 6) the TPGs in row 8 skip all the patterns that detect faults in X and Y. The
faulty PLBs escape detection because in every test session, the patterns generated by the two TPGs are
identical (so no mismatches are detected at any ORAs), and they miss the faults in the faulty BUTs.
Clearly, this would be another pathological situation.
X
V
TPGs Y
BUTs
ORAs
BUTs
ORAs
BUTs
ORAs
BUTs Z
X
V
BUTs Y
ORAs
BUTs
ORAs
BUTs
ORAs
BUTs
TPGs Z
Figure 7. Pathological case that escapes fault detection
Although we cannot guarantee that the two test sessions illustrated in Figure1 will detect any possible
combination of faulty PLBs, it appears that the conditions that allow a group of faulty PLBs to escape
detection are so restrictive, that they are very unlikely to occur in practice. Therefore we can conclude that,
in practice, any combination of faulty PLBs will be detected.
12
TPGs
BUTs
ORAs
BUTs
ORAs
BUTs
ORAs
BUTs
BUTs
ORAs
BUTs
ORAs
BUTs
ORAs
BUTs
TPGs
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
But if needed, we can enhance the BIST sequence to guarantee the detection of any combination of
faulty PLBs by adding the two test sessions shown in Figure8. These are obtained by rotating the test sessions of Figure
°, so that the flow of test patterns is horizontal instead of vertical. If a group of faulty
PLBs did escape detection in the first two test sessions, this was caused by masking among faulty PLBs in
the same column. But the rotation is in effect interchanging the rows and the columns, so that it destroys
the masking relations because the faulty PLBs interact now as if they are in the same row. Hence a group
that escapes detection in the first two sessions is guaranteed to be detected in at least one of the additional
two sessions.
b) Test session EW
a) Test session WE
Figure 8. Additional (“horizontal”) test sessions
5. BIST-Based Diagnosis
In this section, we present the use of the BIST approach in diagnosing an FPGA that failed the tests
provided by the two test sessions shown in Figure1. In any fault location procedure, maximum diagnostic
resolution is achieved when faults are isolated within an equivalence class containing all the faults that produce the observed response. If every equivalence class has only one fault, we say that the fault isuniquely
diagnosed, that is, there is no other fault which can produce the same response. In our case a fault is one
faulty PLB that may have any number of internal faults, and the response is obtained at the outputs of the
ORAs. We begin by assuming a single faulty PLB in the FPGA, then we analyze the case of multiple faulty
PLBs, and we also discuss locating faults inside a defective PLB.
5.1 Locating A Single Faulty PLB
Claim 5: Any single faulty PLB is guaranteed to be uniquely diagnosed.
Proof: First we analyze the case when a faulty PLB is detected only in the session when it is configured as a BUT. Note that most faulty BUTs will produce errors at two ORAs, except a faulty BUT in first
two or the last two rows, which will produce an error in only one ORA. The column of the failing BUT is
identified by the scan-chain position of the ORAs where errors are detected. Looking only at the faulty
column, let Oi denote the ORA located in row i and Bj the BUT in row j. For example, a defective B4 will
cause errors at O3 and O5 in session NS, while a defective B1 will be detected only at O2 in session SN.
The results of this analysis for an 8×8 FPGA are given in Table1 under the heading “only BUT failures.”
Errors at ORA outputs are marked by X. We can observe that the error pattern of every faulty row is
unique.
Now we analyze the case when some faults in a PLB may also be detected when that PLB is configured as an ORA or a TPG. The results of this analysis are given in Table1 under the heading “with
13
Table 1: Errors caused by a single faulty PLB
Faulty
Function
only BUT failures
Function
Session NS
Row
1
2
3
4
5
6
7
8
Session NS Sessio SN
TPG
BUT
ORA
BUT
ORA
BUT
ORA
BUT
BUT
ORA
BUT
ORA
BUT
ORA
BUT
TPG
O3
O5
O7
with potential ORA/TPG failures
Session SN
O2
X
O4
X
X
O6
X
X
X
X
X
X
X
X
X
Session NS
O3
(X
X
(X)
X
O5
X
X
(X)
X
O7
X)
X
(X)
X
Session SN
O2
X
(X)
X
(X
O4
X
(X)
X
X
O6
X
(X)
X
X)
potential ORA/TPG failures.” A fault in an ORA may cause an error only in that ORA when it reports a
mismatch even though the compared pairs of output values agree. Thus in addition to the error at O3 in
session NS, a faulty B2 may also cause an error at O2 in session SN. This error is marked by “(X)” to denote
a potential error. A fault in a TPG cell may cause the TPG to produce patterns different from those of a
fault-free TPG, thus possibly generating mismatches in every comparator that observe the BUTs fed by the
faulty TPG; in rows 1 and 8, we use “(X X X)” to denote a potential group of three errors. Thus, if all ORAs
fed by the same TPG have errors in one session, we have an indication of faults in a TPG PLB, but we can
ignore these errors and use the failures of the other test session to determine which PLB is faulty. Although
every row in Table 1 now contains one potential error or one potential group of errors, it is easy to observe
that the pattern of every faulty row is still different from all others. Therefore we can conclude that after
the two BIST sessions we can accurately locate the row in which the faulty PLB resides. Combined with
the column of the ORAs where errors are reported, this uniquely identifies the position of the faulty PLB.
This analysis is similar to the one done in [40], with the main difference being that the architecture of
[40] allowed us only to locate the row containing the faulty PLB, while the new architecture shown in
Figure 1 also provides us with the column of the faulty PLB. In the old architecture, the two additional test
sessions illustrated in Figure 8, were required to locate the faulty column. These test sessions are no longer
needed to locate a single faulty PLB in the new architecture.
5.2 Locating Multiple Faulty PLBs
The following results deal with the location of a group G of faulty PLBs that is detected in at least one
test session. To uniquely diagnose G means to identify all its faulty blocks such that no other group (including subsets of G) can produce the same result. Although unique diagnosis is not always possible, there are
many situations when it can be guaranteed.
Claim 6: Any group of faulty PLBs in the same row is guaranteed to be uniquely diagnosed.
Proof: In the session when all PLBs in the faulty row are under test, each BUT is observed at one or
two ORAs (depending on the location of the faulty row) in the same column. Thus the ORAs where failures are observed are in different columns and do not interact.
Two PLBs in the same column are said to be disjoint if their faults cannot be observed at the same
ORA. A group of PLBs in the same column is disjoint if every pair of PLBs in the group are disjoint. For
14
example, PLBs in rows 1, 4, and 8 of the same column form a disjoint group, but PLBs in rows 3 and 5 are
not disjoint, since both of them are observed at the O4. Since faults in disjoint PLBs do not interact, we
have the following results.
Claim 7: Any disjoint group of faulty PLBs in the same column is guaranteed to be uniquely
diagnosed.
Claim 8: Any group of faulty PLBs in the FPGA such that the faulty PLBs in every column are disjoint,
is guaranteed to be uniquely diagnosed.
Thus far for diagnosis of disjoint faulty PLBs, it has been sufficient to know the ORAs where errors
are observed. But it is easy to see that this knowledge is not enough to diagnose non-disjoint faulty PLBs.
For example, if two PLBs may be faulty and we obtain errors at O3 and O5 in session NS (see Table1), we
do not know whether B2 is also faulty in addition to B4. However, in each test session we also record all
failing phases, and we can use these data to achieve greater diagnostic resolution. If the sets of failing test
phases obtained at O3 and O5 are different, this difference can be explained only by the faults inB2 in addition to those obvious faults in B4.
Assuming that a faulty PLB is detected only when it is configured as a BUT, the set of FB i – 1
failing phases obtained at Oi is given by:
Feqi
(1)
FOi = ( FB i – 1 ∪ FBi + 1 ) – Feq i
where FBk is the set of failing phases of Bk, and Feqi is the set of failing phases of both
FB i + 1
Bi-1 and Bi+1 that have identical responses (and thus do not cause mismatches at Oi). In
Figure 9.
Figure 9, the area of FOi is marked by diagonal lines. Note that FOi is empty ( ∅ ) when
both Bi-1 and Bi+1 are fault-free, or when the faults in Bi-1 and Bi+1 are equivalent (since then
FB i – 1 = FBi + 1 = Feq i ). It is interesting to observe that there exists one situation when FOi is the same,
no matter if only one or both of the BUTs observed at Oi are faulty; this occurs if the two BUTs have the
∅ sam
same sets of failing phases FB i – 1 = FBi + 1
(), but their faulty responses areFeq
never
i =the
Knowing the set of failing phases observed at Oi and the complete set of failing phases of one of the
two faulty BUTs, we can determine the set of failing phases where the two BUTs have identical responses
by
Feq i = FB i – 1 – FO i = FB i + 1 – FO i
(2)
Based on FOi and one of the two sets, we can also compute a lower bound on the other set by
FBi + 1 ⊇ ( FOi – FB i – 1 ) ∪ Feq i
(3)
FBi – 1 ⊇ ( FO i – FB i + 1 ) ∪ Feq i
(4)
or by
Note that (3) becomes an equality when FOi and FB i – 1 are disjoint:
FBi + 1 = FOi ∪ Feq i
(5)
This occurs when FB i – 1 ⊆ FB i + 1 and .Feq i = FB i – 1
Next we outline a diagnostic procedure that, whenever possible, uniquely locates a group of faulty
15
PLBs in the same column and recognizes situations when unique diagnosis cannot be achieved. The procedure, called MULTICELLO (Multiple Faulty Cell Locator), relies on two assumptions which are valid
in most practical situations:
A1: There are at most two interacting (non-disjoint) faulty BUTs having identical responses in the
same failing phase.
A2: A faulty PLB works correctly when configured as an ORA.
Later we will discuss what happens when these assumptions are not true. The following results will
be used by our diagnosis procedure.
Lemma 1: A BUT observed by two ORAs that do not report failures in phase p does O1 p
O1
not fail in phase p.
B ⇒ B p
Proof: We denote “does not fail in phase p” by p (see Figure10). Assume, by con- O2 p
O2
tradiction, that B fails phase p. Let B1 be the BUT above O1 and B2 the BUT below O2.
Figure 10
Since B fails p, but p is not reported by either O1 or O2, from equation (2) we conclude
that both B1 and B2 must fail p with faulty responses identical to that of B. But then we would have three
faulty BUTs with identical responses in p, and this would contradict assumption A1. Therefore B does not
fail in phase p.
Lemma 2: A BUT observed by two non-failing ORAs is fault-free.
Proof: From Lemma1, a BUT observed by non-failing ORAs does not fail in any
B1 p
B1
phase.
O p⇒ O
Lemma 3: Let O be an ORA observing BUTs B1and B2. If B1 does not fail in phase B2
B2 p
p and O does not report a failure in p, then B2 does not fail in phase p either.
Figure 11.
Proof: From equation (1) (see Fig u re11).
B1 p
B1
Lemma 4: Let O be an ORA observing BUTs B1and B2. If B1 does not fail in phase
O p⇒ O
p and O reports a failure in p, then B2 fails in phase p.
B2
B2 p
Proof: From equation (1) and assumption A2 (see Figure12).
Figure 12
Lemma 5: Let O be an ORA observing BUTs B1 and B2. If B1 fails in phase p and B1 p
B1
O does not report a failure in p, then B2 fails in phase p, and B1 and B2 have identical O p ⇒ O
responses in phase p.
B2
B2 p
Proof: From equation (2) (see Figure13).
Figure 13
Next, we will present the MULTICELLO algorithm and at the same time we will
illustrate its execution analyzing the responses obtained at the ORAs in one column in the SN test session
of a 20×20 FPGA, where rows are numbered 1 to 20 with the TPGs in row 20 (see Figure14). The goal is
to determine the set of failures for every BUT. The algorithm first identifies only non-failing phases for
BUTs, then proceeds to determine the failing ones.
Procedure MULTICELLO:
1) Record ORA results and initialize the failures of every BUT in each phase as unknown.
This initial state is shown in Figure14 step1, where 0 and 1 entries for an ORA indicate, respectively,
a passing and a failing result in the corresponding phase, and the empty cells denote unknown BUT
failures. For example, O8 reports failures in phases 1, 5, and 7.
2) In each column p, for every two consecutive ORAs with a 0 mark, enter a 0 for the BUT between them.
16
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
B1
O2
B3
O4
B5
O6
B7
O8
B9
O10
B11
O12
B13
O14
B15
O16
B17
O18
B19
0 0 0 0 0 1 1 1
0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0
1 0 0 0 1 0 1 0
0 0 1 0 0 0 0 0
1 1 1 0 1 0 1 0
0 1 0 1 0 0 0 0
0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0
step 1
B1
O2
B3
O4
B5
O6
B7
O8
B9
O10
B11
O12
B13
O14
B15
O16
B17
O18
B19
0
0
0
0
0
0
0
0
0
0
0
1 0
0
0 0
B1
O2
B3
O4
B5
O6
B7
O8
B9
O10
B11
O12
B13
O14
B15
O16
B17
O18
B19
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1 0
0
1 1 1 0
0 1 1 1
0
0 1 0 0
0
0 0
0 0 0 0
0
0
1 0 1 0
0
0
0 0 0 0
0
0
1 0 1 0
0
0
0 1 0 1 0 0 0 0
0
0
0 0 0 0
0 0 0 1 0 0 0 0
0 0 0
0 0 0 0
0 0 0 0 0 0 0 0
step 2
1
0
0
0
0
0
0
0
1
2
0
0
0
0
0
0
0
0
0
0 0
0
1 1
0
0 1
0 0
0 0
0 0
0 0
0 0
3
0
0
0
0
0
0
0
0
0
1
4
0
0
0
0
0
0
0
0
0
0
0
1 0
0 0
0 1
0
0 1
0
0 0
0
5
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
6 7 8
1 1
0
1 0
0 0
0 0
0 0
0 1
0
0 0
0
0 1
0 0
0 0
0 0
0 0
0 0
0 0
0 0
step 3
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
B1
O2
B3
O4
B5
O6
B7
O8
B9
O10
B11
O12
B13
O14
B15
O16
B17
O18
B19
1
0
0
0
0
0
0
0
1
1
0
1
1
0
0
0
0
0
0
0
2
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0
0
3
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
4
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
5
0
0
0
0
0
0
0
1
1
0
1
1
0
0
0
0
0
0 0
0
6 7
1
1 1
1 0
1 0
0 0
0 0
0 0
0 1
0 1
0 0
0 1
0 1
0 0
0 0
0 0
0 0
0 0
0 0
0 0
8
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
step 4
Figure 14 .Locating multiple faulty PLBs
This step applies Lemma1 and its results are shown in Figure14 step2 (new entries are shown in
bold). We use the same 1 and 0 notation to respectively denote a BUT failure and a passing test.
3) In each column p, for every two adjacent 0 marks followed by an empty cell, enter a 0 in the empty cell.
This step applies Lemma3; the two adjacent 0 marks belong to a BUT and an ORA, and the empty
cell is the other BUT observed by the same ORA. The results are shown in Figure14 step3. Note that
B5 and B7 have already been identified as fault-free.
4) In each column p, for every adjacent 0 and 1 marks followed by an empty cell, enter a 1 in the empty
cell.
This step applies both Lemma4 (when the 0 mark is for a BUT) and Lemma5 (when the 0 mark is
for an ORA). The results are shown in Figure 14 step 4. We have identified 6 faulty PLBs - B1, B3,
B9, B11, B13, and B15. Note that none of the failures of B9 is observed at O10, because B11 has identical responses in phases 1, 5, and 7. Also note that we have identified every failing phase for 5 out of
the 6 faulty PLBs, but we cannot determine whether B1 also fails in phase 6.
5) Consistency check: If there is an ORA reporting a failure in phase p, while neither of the two BUTs
observed by the ORA fails in p, then report inconsistency and exit.
6) If every PLB has been identified as fault-free or faulty, the group of faulty PLBs has been uniquely
diagnosed.
This is not the case in our example, since B17 and B19 may have equivalent faults failing phase 4. To
achieve unique diagnosis, we need to apply the horizontal sessions shown in Figure8.
End
The above example dealt with the results of only one test session. Under the assumption that a faulty
PLB is detected only when configured as a BUT, the two sessions can be independently analyzed by the
same procedure, and a different group of faulty BUTs may be identified in each session.
It is interesting to see how MULTICELLO handles the results produced by a single faulty PLB. For
17
example, assume that we obtain errors in several phases both at O4 and O6; then in step2 and step3 all
BUTs except B5 are identified as fault-free, and then B5 is located as the source of the failing phases. Similarly, if errors are obtained only at O2, MULTICELLO will uniquely diagnose B1 as faulty.
However, there are two single faulty PLBs that MULTICELLO cannot uniquely diagnose. For example, if we obtain the same errors at O2 and O4, then all the BUTs except B1 and B3 are identified as faultfree, and B3 is diagnosed as faulty. But we cannot determine whether B1 is fault-free or faulty with the
same set of failing phases as B3. (In the previous example, we could not determine whether B1 had a failure
in phase 3.) Thus MULTICELLO correctly finds B3 to be faulty, but {B3} and {B1,B3} are indistinguishable. The source of this problem is that the first and the last BUT in a column are observed only at one
ORA, while all the other BUTs are observed at two ORAs.
If we cannot achieve unique diagnosis, we can apply the horizontal test sessions shown in Figure8,
so that the non-disjoint cells in the same column whose interaction made diagnosis difficult are no longer
interacting. For the example above, { B3} and {B1,B3} will be distinguished because B1 and B3 will be
checked in separate rows. This represents anadaptive diagnosis strategy, where the tests to be applied next
are determined based on the results obtained so far - here the additional test sessions are applied only if the
initial two test sessions do not uniquely diagnose the faulty PLBs in the FPGA.
Figure 15 illustrates an interesting situation where even the addition of the horizon1 2 3 4 5
1
B
O B O B
tal sessions will not help in obtaining a unique diagnosis. Let us assume that in column
√
1, we obtain the set of failing phases FO at O2 and O4; as shown before, MULTICELLO 2 O
FO
identifies B3 as faulty and B1 as potentially faulty with the same set FO. If next we apply 3 B
O
the horizontal test sessions, and for row 1, we obtain the same errors FO at O2 and O4, 4
B √
then the same PLB in the corner (row 1 and column 1, denoted by R1C1) is again poten- 5
Figure 15
tially faulty. So the groups {R3C1, R1C3} and {R3C1, R1C3, R1C1} are
indistinguishable. However, this is another pathological situation, since it requires these three specific
PLBs to have the same sets of failing phases. There are only four such cases in the entire FPGA - one for
each corner.
If MULTICELLO ends up detecting an inconsistency, this indicates either that the actual fault in the
circuit is possibly an interconnect fault, or that some of the assumptions used are not valid. The most likely
to be invalidated is the assumption about detecting a PLB only when configured as a BUT. Note thatMULTICELLO will not do anything useful in a session where all ORAs in the same column have errors, since
it needs two ORA without errors to execute step2. But these results are characteristic of a fault in a TPG
cell, and diagnosis can still be successful in the other session where the TPG cells are BUTs. Although we
can have a different diagnosis procedure to work under the assumption that faults in an ORA may modify
its set of failures, instead we apply the horizontal test sessions and use MULTICELLO to process their
results first.
In summary, we apply the two test sessions in Figure1, and we use MULTICELLO to diagnose the
results of every column. If unique diagnosis is not achieved, we repeat the process with the horizontal test
sessions in Figure 8. This adaptive strategy will achieve unique diagnosis for any group of faults encountered in practice.
Compared with the diagnosis algorithm for multiple faulty PLBs presented in [4], MULTICELLO is
18
simpler and relies on less restrictive assumptions.
5.3 Diagnosis Within A Faulty PLB
During each test session, the errors (failing ORA indications) are actually recorded for every test
phase. This allows us to identify the failing mode(s) of operation of the faulty PLB and its faulty internal
module(s). For example, consider the test phases developed for the ORCA 2C and 2CA series FPGA given
in Table 2. The first 9 phases are used to test the various modes of operations of the PLBs in the ORCA
2C series, while the complete set of 14 test phases are used to test the ORCA 2CA series. Phases 1-4 test
the LUT portion of the PLB, phases 5-9 test the FFs, and phases 10-14 test additional LUT modes of operation present only in ORCA 2CA series. For example, if only phase 10 fails, then we know that only the
logic used exclusively to implement the multiplier is faulty. If only phases 5 and 7 fail, then the most likely
cause is faulty connection(s) between the LUT and the flip-flops. Such accurate diagnosis is extremely
useful in failure-mode analysis and yield improvement. This accuracy also provide the basis for allowing
the reuse of a partially defective PLBs in fault-tolerant applications or adaptive computing applications
[2]. For example, if only phases 5 through 9 fail, then the LUT is fault-free (since it passed exhaustive tests
in each of its modes of operation), and this PLB with defective flip-flops can be safely used to implement
any combinational logic function.
Additional diagnostic resolution can be obtained by constructing independent ORAs that compare the
fewest possible BUT outputs per PLB flip-flop. As a result, a Pass/Fail indication is obtained for each pair
of outputs being compared to give diagnostic information regarding which portion of the BUT is faulty.
This ORA implementation is illustrated in Figure16, where each ORA flip-flop stores the result of only
one comparison. In some FPGAs (such as the Xilinx 4000 and Vertex series [43] and ORCA 3C series
[26]), independent ORAs can be implemented in one PLB to compare a single output from each of the two
BUTs for maximum diagnostic resolution. Other FPGAs, such as the ORCA 2C series [26], are limited to
comparing two pairs of BUT outputs due to the LUT architecture. When the test phases for the various
modes of operation are coupled with the ‘fine-grain’ diagnostic resolution of the independent ORAs illusTable 2: Summary of BIST Phases for ORCA 2C and 2CA series Logic Blocks
Phase
No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Flip-Flop/Latch Modes & Options
FF/Latch
Flip-Flop
Flip-Flop
Latch
Flip-Flop
Latch
-
Set/Reset
async. reset
async. set
sync. set
sync. reset
-
Clock
falling edge
falling edge
active low
rising edge
active high
-
Clk Enable Flip-Flop Data In
active low
LUT output
enabled
PLB input
active high
LUT output
active low
PLB input
active low
dynamic select
-
19
LUT
No.
Mode
Asynchronous RAM
Adder/subtracter
5-variable MUX
5-variable XOR
Count up
Count up/down
Count down
4-variable
4-variable
Multiplier
Greater/equal to Comp
Not equal to Comp
Synchronous RAM
Dual port RAM
Outs
4
5
4
4
5
5
5
4
4
5
5
5
4
4
trated in Figure 16, the faulty portion of a PLB module can be determined as well (for example, which one
of the PLB flip-flops or LUTs are faulty). This accuracy in diagnosis was never achieved in any previous
work.
BuO 1
BdO 1
1
0
BuO 4
BdO 4
1
0
Scan
Out
S can In from
previous O R A
TDI
TCK
Figure 16 .ORA with greater diagnostic resolutio
6. Experimental Results for ORCA FPGAs
In this section we present the results of the implementation of our BIST-based diagnostic approach
using ORCA 2C15A FPGAs, and discuss our experience with testing and diagnosis of known defective
FPGAs. The ORCA 2C15A has 400 PLBs in a 20 ×20 array and it requires the full set of 14 test phases
summarized in Table2 for each one of the two test sessions. Each test phase for the 2C15A requires
220,800 bits of memory to store the configuration for that test phase for a total storage requirement of 0.77
Mbytes for test configurations for both test sessions. The total FPGA test time, including all reconfigurations, is less than one second, using a 10 MHz boundary-scan clock.
We were provided with five 2C15A devices by Lucent Technologies Microelectronics Group in
Allentown, PA. From manufacturing test results, two FPGAs were known to be fault-free and three were
known to be defective. All three defective devices failed our test, and the two fault-free devices passed.
Analyzing the results of the defective FPGAs, our PLB diagnosis procedure MULTICELLO reported
inconsistencies for Chip1 and Chip2, and identified one faulty PLB in Chip3 (in row 3 and column 18).
Table 3: Summary of testing faulty FPGAs
Test applied
Chip1
Chip2
Chip3
Off-line PLB test (this paper)
Off-line routing test [41]
On-line PLB test [2][3]
FAIL
FAIL
P&F
FAIL FAIL
FAIL PASS
FAIL FAIL
To validate these results, we retested the defective chips with the off-line BIST for programmable
interconnect described in [41], and with the on-line BIST for PLBs (the Roving STARs approach) presented in [2][3]. No application was programmed in FPGAs for the on-line test. The on-line BIST tests
each PLB twice, once in a vertical STAR and one in an horizontal STAR. Table3 summarizes the results
of all tests. The failures of the routing BIST in Chip1 and Chip2 show that the inconsistencies reported by
MULTICELLO trying to locate defective PLBs in these devices are caused by the presence of interconnect
faults. For Chip1, the on-line PLB test passed all vertical roving tests, but failed some of the horizontal
roving tests. Since these tests involve the same PLBs but different routing resources, we concluded that
Chip1 has only interconnect faults that also cause its off-line PLB test to fail. Chip2 fails every test and
has both faulty PLBs and faulty interconnect. Chip3 passed the interconnect test, and the on-line diagnosis
20
method described in [3] identified the same faulty PLB as MULTICELLO.
The faulty PLB in Chip3 failed phases 5-9, and by using one additional configuration that separated
between FFs and LUT, we were able to determine that the fault affects only the FFs and the LUT is faultfree [3]. This defective PLB has been successfully reused to implement combinational logic functions in
a fault-tolerant application [11].
Since faulty FPGAs are difficult to obtain, we have also performed verification of our BIST approach
using a fault emulator that injects faults in the FPGA by changing configuration bits just prior to download.
7. Conclusions
In this paper, we have described a BIST approach for programmable logic blocks in SRAM-based
FPGAs. Our approach is applicable at any level of testing and, unlike conventional BIST, it does not introduce any area overhead or delay penalty. Every PLB is pseudoexhaustively tested in all its modes of
operation, so the tests are practically complete for any logic fault model. We have shown that in practice,
our method detects any combination of faulty PLBs. Our architecture facilitates the testing of all PLBs in
only two test sessions, and the number of test configurations is also independent of the size of the FPGA.
Its regular structure allows easy scalability with the size of the FPGA and algorithmic generation of placement and routing data for every test phase based only on the size of the array.
We also presented the first diagnosis algorithm that can accurately locate any single and most multiple
faulty PLBs with maximum diagnostic resolution. The multiple faulty PLBs that cannot be diagnosed
appear to be very restrictive situations unlikely to occur in practice. Our method can also identify defective
subcircuits inside a PLB. This diagnostic information can then be used for yield enhancement in the manufacturing process or for repair strategies in fault-tolerant applications. We have successfully verified our
approach with faults injected using a fault emulator and with actual defective FPGAs.
An open problem to be investigated in the future is diagnosis under a mixed-fault model, that is, locating faults in a chip that has both faulty PLBs and faulty interconnect.
Acknowledgments
The authors acknowledge the essential contributions to this project by Eric Lee (Lucent Technologies), Sajitha Wijesuriya (Lucent Technologies), and Carter Hamilton (Xilinx) during their graduate work
in the VLSI-FPGA Design & Test Laboratory at the University of Kentucky. We also thank the reviewers
for their very useful comments that helped us improve the paper Finally, the authors acknowledge thesupport, assistance, and encouragement of C.T. Chen, Al Dunlop, and Carolyn Spivak of Lucent
Technologies.
References
[1] M. Abramovici, M. A. Breuer, and A. D. Friedman, “Digital Systems Testing and Testable Design,”
(revised printing) IEEE Press, 1995.
[2] M. Abramovici, C. Stroud, S. Wijesuriya, C. Hamilton, and V. Verma, “Using Roving STARs for OnLine Testing and Diagnosis of FPGAs in Fault-Tolerant Applications,” Proc. IEEE International Test
Conf., pp. 973-982, 1999.
[3] M. Abramovici, C. Stroud, B. Skaggs, and J. Emmert, “Improving On-Line BIST-Based Diagnosis
for Roving STARs”, Proc. IEEE International On-Line Test Workshop, pp. 31-39, July 2000.
21
[4] M. Abramovici and C. Stroud, “BIST-Based Detection and Diagnosis of Multiple Faults in FPGAs,”
Proc. IEEE International Test Conf., October 2000.
[5] Altera Corp., http://www.altera.com/html/products/products.html
[6] A. Burress and P. Lala, “On-Line Testable Logic Design for FPGA Implementation”, Proc. International Test Conf ., pp. 471-478, 1997.
[7] B. Culbertson et a ., “Defect Tolerance on the Teramac Custom Computer,” Proc. IEEE Symp. on
Field-Programmable Custom Computing Machines, pp. 140-147, 1997.
[8] T. A. Dahbura, M. U. Uyar, and C. W. Yau, “An Optimal Test Sequence for the JTAG/IEEE P1149.1
Test Access Port Controller,” Proc. IEEE International Test Conf., pp. 55-62, 1989.
[9] .D. Das and N.A. Touba, “A Low Cost Approach for Detecting, Locating, and Avoiding Interconnect
Faults in FPGA-Based Reconfigurable Systems,” Proc. IEEE International Conf. on VLSI Design, pp.
266-269, January 1999.
[10] S. D’Angelo, C. Metra, G. Sechi, “Transient and Permanent Fault Diagnosis for FPGA-Based Systems,” IEEE International Symp. on Defect and Fault Tolerance in VLS , pp. 330-338, Novembe
1999.
[11] J. Emmert, C. Stroud, B. Skaggs, and M. Abramovici, “Dynamic Fault Tolerance in FPGAs via Partial Reconfiguration,” Proc. 8th Annual IEEE Symp. on Field-Programmable Custom Computing
Machines, April 2000.
[12] A. van de Goor, Testing Semiconductor Memories Theory and Practice, John Wiley and Sons, 1991.
[13] C. Hamilton, G. Gibson, S. Wijesuriya, and C. Stroud, “Enhanced BIST-Based Diagnosis of FPGAs
via Boundary Scan Access,” Proc. IEEE VLSI Test Symp ., pp. 413-418, May 1999
[14] F. Hanchek and S. Dutt, “Methodologies for Tolerating Logic and Interconnect Faults in FPGAs,”
IEEE Trans. on Computers, pp. 15-33, Jan. 1998.
[15] I.G. Harris and R. Tessier, “Interconnect Testing in Cluster-Based FPGA Architectures,” Proc.
Design Automation Conf., June 2000.
[16] F. Hatori et al., “Introducing Redundancy in Field Programmable Gate Arrays,” Proc. IEEE Custom
Integrated Circuits Conf., pp. 7.1.1-7.1.4, 1993.
[17] W. K. Huang and F. Lombardi, “An Approach to Testing Programmable/Configurable Field Programmable Gate Arrays,” Proc. IEEE VLSI Test Symp., pp. 450-455, 1996.
[18] W. K. Huang, F. J. Meyer, N. Park, and F. Lombardi, “Testing Memory Modules in SRAM-based
Configurable FPGAs,” IEEE International. Workshop on Memory Tech., Design and Testing , August
1997
[19] W. K. Huang, F. J. Meyer, X. Chen, and F. Lombardi, “Testing Configurable LUT-Based FPGAs,”
IEEE Trans. on VLSI Systems, Vol. 6, No. 2, pp. 276-283, June 1998.
[20] W. K. Huang, F. J. Meyer, and F. Lombardi, “An Approach for Detecting Multiple Faulty FPGA
Logic Blocks”, IEEE Trans. on Computers, Vol. 49, No. 1, pp. 48-54, 2000.
[21] T. Inoue, S. Miyazaki, and H. Fujiwara, “Universal Fault Diagnosis for Lookup Table FPGAs,”
IEEE Design & Test of Computers, Vol. 15, No. 1, pp. 39-44, Jan. 1998.
[22] C. Jordan and W. P. Marnane, “Incoming Inspection of FPGAs,” Proc. European Test Conf., pp. 371377, 1993.
[23] J. L. Kelly and P. A. Ivey, “Defect Tolerant SRAM Based FPGAs,” Proc. International Conf. on
Computer Design, pp. 479-482, 1994.
[24] V. Lakamraju and R. Tessier, “Tolerating Operational Faults in Cluster-based FPGAs,” Proc. ACM/
SIGDA International Symp. on FPGAs, pp. 187-194, Febr. 2000.
[25] F. Lombardi, D. Ashen, X. Chen, and W. K. Huang, “Diagnosing Programmable Interconnect Systems for FPGAs,” Proc. ACM/SIGDA International Symp. on FPGAs, pp. 100-106, Febr. 1996.
22
[26] Lucent Technologies, Inc., http://www.micro.lucent.com/micro/fpga
[27] E. McCluskey, “Verification Testing - A Pseudoexhaustive Test Technique,” IEEE Trans. on Computers, Vol. C-33, No. 6, pp. 541-546, June, 1984.
[28] C. Metra, G. Mojoli, S. Pastore, D. Salvi, and G. Sechi, “Novel Technique for Testing FPGAs,”
IEEE European Design and Test Conf., pp. 89-94, Febr., 1998.
[29] J. Narasimhan et al., “Yield Enhancement of Programmable ASIC Arrays by Reconfiguration of Circuit Placements,” IEEE Trans. on CAD, Vol. 13, No. 8, pp. 976-986, August 1994.
[30] M. Renovell, J. Figueras, andY. Zorian, “Test of RAM-Based FPGA: Methodology and Application
to Interconnects,” Proc. IEEE VLSI Test Symp., pp. 230-237, 1997.
[31] M. Renovell, J. Portal, J. Figueras, andY. Zorian, “Testing the Interconnect of RAM-Based FPGAs,”
IEEE Design & Test of Computers, Vol. 15, No. 1, pp. 45-50, Jan. 1998.
[32] M. Renovell, J.M. Portal, J. Figueras and Y. Zorian, “SRAM-based FPGA: Testing the LUT/RAM
Modules”, Proc. IEEE International Test Conf., pp. 1102-1111, 1998.
[33] M. Renovell, J.M. Portal, J. Figueras and Y. Zorian, “SRAM-based FPGA: Testing the Embedded
RAM Modules”, J. of Electronic Testing: Theory and Application (JETTA) , Vol. 14, No. 1/2, pp. 159167, Jan./Feb. 1999.
[34] N. R. Shnidman, W. H. Mangione-Smith, and M. Potkonjak, “On-line Fault Detection for Bus-Based
Field Programmable Gate Arrays,” IEEE Trans. on VLSI Systems, Vol. 6, No. 4, pp. 656-666, Dec.
1998.
[35] “Standard Test Access Port and Boundary-Scan Architecture,” IEEE Standard P1149.1-1990, May
1990.
[36] A. Steininger and C. Scherrer, “On the Necessity of On-Line BIST in Safety-Critical Applications,”
Proc. 29th Fault-Tolerant Computing Symp., pp.208-215, 1999
[37] C. Stroud, P. Chen, S. Konala, and M. Abramovici, “Evaluation of FPGA Resources for Built-In
Self-Test of Programmable Logic Blocks,” Proc. ACM/SIGDA International Symp. on FPGAs , pp.
107-113, 1996.
[38] C. Stroud, S. Konala, P. Chen, and M. Abramovici, “Built-In Self-Test for Programmable Logic
Blocks in FPGAs (Finally, A Free Lunch: BIST Without Overhead!)”, Proc. IEEE VLSI Test Symp.,
pp. 387-392, 1996.
[39] C. Stroud, E. Lee, S. Konala, and M. Abramovici, “Using ILA Testing for BIST in FPGAs”, Proc.
IEEE International Test Conf., pp. 68-75, 1996.
[40] C. Stroud, E. Lee, and M. Abramovici, “BIST-based Diagnostics for FPGA Logic Blocks,” Proc.
IEEE International Test Conf., pp. 539-547, 1997.
[41] C. Stroud, S. Wijesuriya, C. Hamilton, and M. Abramovici, “Built-In Self-Test of FPGA Interconnect,” Proc. International Test Conf., pp. 404-411, 1998.
[42] S.-J. Wang and T.-M. Tsai, “Test and Diagnosis of Faulty Logic Blocks in FPGAs,” Proc. IEEE
International. Conf. on Computer Aided Design, pp. 722-727, 1997.
[43] Xilinx, Inc., http://www.xilinx.com/products
23