Edt PDF
Edt PDF
Edt PDF
SYSTEMS-ON-A-CHIP
EMBEDDED DETERMINISTIC TEST FOR
SYSTEMS-ON-A-CHIP
BY
ADAM KINSMAN, B. ENG. (ENGINEERING PHYSICS)
JUNE 2005
a thesis
submitted to the department of electrical and computer engineering
and the committee on graduate studies
of mcmaster university
in partial fulfillment of the requirements
for the degree of
Master of Applied Science
ii
Abstract
iii
Acknowledgments
Many more people have contributed in some way to this work than can be listed here,
but a few certainly deserve special mention. I am thankful for the administrative and
technical staff of the ECE department at McMaster for much practical assistance. I
am also appreciative to Dr. Terry Todd and Dr. Alex Jeremic for their time and
valuable feedback which helped to bring this thesis to its final form.
To many friends both personal and professional I owe thanks, especially those
in the Computer Aided Design and Test Research group of McMaster, Qiang Xu,
Ho Fai Ko, David Lemstra, David Leung and Ehab Anis. To them I am grateful
for many hours of stimulating discussion and technical advice which have helped me
during the course of my study. To my supervisor Dr. Nicola Nicolici I am deeply
indebted for much guidance in both research and in life and for many hours spent in
motivating discussion and in helpful revision of this work. This thesis would not have
been possible without him.
I am appreciative of my sister in law Amanda and my brothers Joshua, Matthew
and Philip for their emotional support and for providing an environment of relaxation
in which I can recharge. My father and mother in law Bruce and Jane and my dad
and mom Bruce and Jan deserve a great deal of thanks for instilling in me the values
which have been so important to do good research, and for supporting me through
difficult decisions. For her in-exhaustible patience with me, my wife Pamela deserves
more credit for this thesis than any of us, myself included. Her perpetual support has
made this work possible.
Above all I am grateful to God for the ability and desire to do research, and for
placing these people in my life who have supported me in my endeavors. All things
are possible through Him who gives me strength.
iv
Glossary
v
Fault a model which captures the effect of one or many types of defects
Fault coverage the percentage of faults detected by a test set
Fault simulation evaluation of a test pattern to determine what faults it detects
FDR frequency directed run-length encoding
FSM finite state machine, algorithmic hardware control
Functional inputs the inputs of the device used during normal operation
Functional test tests focused on circuit operation, in contract to structural test
Golomb a type of statistical coding
HDL hardware description language, syntax for RTL descriptions e.g. Verilog/VHDL
Huffman a type of statistical coding
IC integrated circuit, many transistors integrated in a single package, a device
IEEE 1500 a standard for interfacing to IP-protected cores
IP intellectual property, details of a core which are not disclosed
IP-consistency respecting IP-protection by not requiring IP details
ISCAS89 benchmarks benchmark circuits widely used in literature on test
ITRS International Technology Roadmap for Semiconductors
JTAG (IEEE 1149) a widely used device programming interface
Layout low level description of a circuit in terms of transistor placements
LFSR linear feedback shift register, a type of PRPG
Low cost tester an ATE without special expensive features
Manufacturing test comparison of manufactured devices against the design
MISR multiple input signature register
Modularity ease of plugability into a system architecture
Moores law an empirical observation of exponential transistor density growth
Netlist a low-mid level description of a circuit in terms of logic gate connections
Phase shifter a network placed on a PRPG output to reduce shift correlations
Phase shifter reconfiguration use of muxes to modify a phase shifter
Physical layout see layout
Programmability (WRT TDC) ability of a decompressor to pass any test
PRPG pseudo-random pattern generator, produces a pseudo-random sequence
Pseudo-random BIST BIST by random patterns, contrast to deterministic BIST
vi
RTL register transfer level, a mid-high level description using register operations
Run-length encoding compression by encoding long runs of 1s or 0s as a symbol
Scalability the ability to cope with increasing circuit size, VTD, etc.
Scan access to the internal state registers of an IC
Scan chain reconfiguration linking of scan chains end to end
Scan chains shift register structure facilitating scan
Scan frequency the clock frequency of the scan chains
SCU scan chain utilization - useful portion of decompressor output
SOC system-on-a-chip, an integrated circuit comprised of cores
Specification the formal description of the design requirements for a system
Structural test tests focused on circuit logic pathways, contrast to functional test
Stuck-at-0 a fault model assuming a node shorted to ground
Stuck-at-1 a fault model assuming a node shorted to power
STUMPS Self-Test Using a MISR and Parallel Shift register sequence generator
Synthesis transformation of a description to a lower level of abstraction
Synthesis (architectural) behavioral RTL
Synthesis (logic) logic equations netlist
Synthesis (RTL) RTL logic equations
System integrator one who links cores to form a system, contrast to core provider
TCU tester channel utilization - useful portion of decompressor input
TDC test data compression, reduction in the amount of test data applied to a CUT
Test cube a partly specified test pattern (containing x-bits)
Test data compression see TDC
Test pattern a vector of circuit input values applied with the intent of testing
Test set a collection of test patterns
Test set compaction merging of test cubes which do not conflict
Test set dependent decompression methods tuned to a given test set
Test set independent decompression methods equally applicable to many test sets
Testbench stimuli and expected responses used by simulation during verification
VLSI very large scale integrated circuits, containing millions of transistors
VTD volume of test data, the amount of data applied to a CUT
vii
Contents
Abstract iii
Acknowledgments iv
Glossary v
1 Introduction 1
1.1 VLSI Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Design Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Automation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.4 VLSI Design Process . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 VLSI Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Structural Testing and ATPG . . . . . . . . . . . . . . . . . . 9
1.2.2 Scan Based Testing . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.3 Test Application . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3 Test Data Compression . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.1 Automated Test Equipment . . . . . . . . . . . . . . . . . . . 20
1.3.2 Built-in Self-test . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.3.3 Embedded Deterministic Test . . . . . . . . . . . . . . . . . . 23
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2 Related Work 25
2.1 VTD Reduction by ATPG and ATE . . . . . . . . . . . . . . . . . . 26
viii
2.2 BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Test Data Compression . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4 Test Set Dependent Methods . . . . . . . . . . . . . . . . . . . . . . . 32
2.5 Temporal Decompression . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6 Spatial Decompression . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.6.1 Combinational Decompression . . . . . . . . . . . . . . . . . . 38
2.6.2 Sequential Decompression . . . . . . . . . . . . . . . . . . . . 41
2.7 Multi-level Compression . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
ix
3.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 85
3.6.2 Tester Channel Utilization . . . . . . . . . . . . . . . . . . . . 85
3.6.3 Uncompressed Test Sets . . . . . . . . . . . . . . . . . . . . . 86
3.6.4 Single Mode Testing VTD . . . . . . . . . . . . . . . . . . . . 87
3.6.5 Multi-mode Testing VTD . . . . . . . . . . . . . . . . . . . . 89
3.6.6 Area Overhead Estimation . . . . . . . . . . . . . . . . . . . . 90
3.6.7 Algorithm Execution Time . . . . . . . . . . . . . . . . . . . . 93
3.6.8 Comparison to Existing Solutions . . . . . . . . . . . . . . . . 96
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4 Conclusion 104
4.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
x
List of Tables
xi
List of Figures
1.1 c
Scaling of transistor count (Intel [29]). . . . . . . . . . . . . . . . . 2
1.2 VLSI design flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Manufacturing defects responsible for device failure [86]. . . . . . . . 9
1.4 Stuck-at fault models for different defects. . . . . . . . . . . . . . . . 10
1.5 Structural test vector generation. . . . . . . . . . . . . . . . . . . . . 11
1.6 Merging of two test vectors into one by compaction. . . . . . . . . . . 12
1.7 General sequential circuit. . . . . . . . . . . . . . . . . . . . . . . . . 13
1.8 General sequential circuit with scan DFT. . . . . . . . . . . . . . . . 13
1.9 Functional test application. . . . . . . . . . . . . . . . . . . . . . . . 15
1.10 Scan test application. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.11 Scaling of volume of test data. . . . . . . . . . . . . . . . . . . . . . . 18
1.12 Test pattern application with an ATE. . . . . . . . . . . . . . . . . . 20
1.13 BIST test pattern application. . . . . . . . . . . . . . . . . . . . . . . 21
c
1.14 BIST fault coverage (Kluwer [13]). . . . . . . . . . . . . . . . . . . 22
1.15 EDT - a combination of BIST and ATE. . . . . . . . . . . . . . . . . 23
xii
2.9 Combinational decompressors. . . . . . . . . . . . . . . . . . . . . . . 41
2.10 Averaging of care bits in a test pattern. . . . . . . . . . . . . . . . . . 42
2.11 General sequential decompressor. . . . . . . . . . . . . . . . . . . . . 43
2.12 Seed buffering architecture. . . . . . . . . . . . . . . . . . . . . . . . 45
2.13 Variable injection architecture. . . . . . . . . . . . . . . . . . . . . . . 46
2.14 Implementation of stalling. . . . . . . . . . . . . . . . . . . . . . . . . 47
3.1 IP-consistency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2 Modularity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3 Scalability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4 Programmability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5 Low cost test platform. . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.6 Choices for SOC test application (each square = 1 core). . . . . . . . 56
3.7 Non-uniform distributions enabling improved compression. . . . . . . 57
3.8 Illustration of frames and blocks. . . . . . . . . . . . . . . . . . . . . 59
3.9 Decompression scheme for seed-only packing. . . . . . . . . . . . . . . 60
3.10 Transformation from buffering seeds at the core level to the system level. 61
3.11 Modeling of LFSR and phase shifter in GF2 . . . . . . . . . . . . . . . 62
3.12 Generating equations for a test cube. . . . . . . . . . . . . . . . . . . 63
3.13 Calculation of seeds for StreamBIST. . . . . . . . . . . . . . . . . . . 69
3.14 Compression flow for StreamPack. . . . . . . . . . . . . . . . . . . . . 70
3.15 Packing of seed streams and generation of control. . . . . . . . . . . . 72
3.16 NOOP insertion to address reseed conflicts. . . . . . . . . . . . . . . 73
3.17 Test time comparison for StreamPack. . . . . . . . . . . . . . . . . . 74
3.18 Core level decompressor for seed and mux packing. . . . . . . . . . . 75
3.19 Multi-mode core level decompressor. . . . . . . . . . . . . . . . . . . 80
3.20 Tester channel utilization improvement. . . . . . . . . . . . . . . . . . 86
3.21 Core level area (8,16 bit LFSRs). . . . . . . . . . . . . . . . . . . . . 90
3.22 Core level area (32 bit LFSR, multi-mode). . . . . . . . . . . . . . . . 91
3.23 System level area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
xiii
List of Algorithms
1 StreamBIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2 CompressPattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3 StreamPack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4 MMStreamBIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
xiv
M.A.Sc. - A.B. Kinsman - McMaster
Chapter 1
Introduction
Integrated circuits (ICs) are thin chips made out of semiconductor material (e.g.,
silicon or germanium), which consist of a connection of semiconductor devices like
transistors, and passive components such as resistors or capacitors. The devices on an
IC can be used to build logic gates that process electronic signals using the formalism
of the Boolean algebra. The ICs based predominantly on logic gates are called digital
with examples being microprocessors, microcontrollers, digital signal processors and
application specific integrated circuits (ASICs). From simple electronic gadgets such
as digital clocks to more sophisticated appliances like cellular phones and personal
computers, our lives are influenced by digital ICs. They even affect our lives in ways
which are not directly evident. For instance, banks rely heavily on complex computer
systems for financial management and security. Although the circuits themselves in
all these applications may be very diverse, what they all have in common is that
they require design and manufacture. In almost all cases the complexity of the device
is well beyond human management at the transistor level. This has spawned the
electronic design automation (EDA) industry which provides tools for dealing with
designs of ever increasing levels of complexity.
In the remainder of this chapter discussion will be provided on the design flow
for very large scale integrated (VLSI) circuits (Section 1.1), providing justification
for manufacturing test (Section 1.2) and in turn the need for test data compression
(Section 1.3), the focus of this work. Thesis organization is outlined in Section 1.4.
1
1.1. VLSI Design M.A.Sc. - A.B. Kinsman - McMaster
c
Figure 1.1: Scaling of transistor count (Intel [29]).
2
1.1. VLSI Design M.A.Sc. - A.B. Kinsman - McMaster
3
1.1. VLSI Design M.A.Sc. - A.B. Kinsman - McMaster
for particular applications and if the IP is not protected others may reap benefits
instead of the company that put the work into developing the core. The issue of IP
protection raises challenges, especially in test, which will be addressed in Section 1.2.
Simply note for now that SOCs may consist of IP-protected and non-IP-protected
cores, that is, some cores to which the system integrator does not have access to the
internal circuit details, and some for which access is granted.
1.1.2 Abstraction
In addition to relieving some of the design effort there is a sense in which reuse
enables levels of abstraction by providing convenient borders on which to divide the
design. Coming back to the adder, there is no need to consider the inner workings
of an adder once they are in place and the design may be handled at a level of
abstraction where the adder is considered a black box. This level is often referred
to as register transfer level (RTL). Although reuse facilitates abstraction it is not
necessary for abstraction. Levels of abstraction simply provide a way of linking the
very high level design specification to the very low level transistor implementation
in silicon. Instead of jumping directly from specification to implementation, tasks
which the system must perform can be divided and each task handled individually,
then each task may be divided into separate blocks and so on with more detail added
at each level. For designs of current day complexity the specification essentially gives
the behavioral description of the circuit. The job of the designer is to translate the
behavioral description into an RTL description using standard hardware description
languages (HDLs) such as Verilog HDL or VHDL [98, 99]. For small designs the
designer would then be able to progress deeper to the gate level and the transistor
level, but complexities of todays chips are beyond the scope of what people can
reasonably manage. This has resulted in the development of design automation tools.
1.1.3 Automation
In order to cope with such large complexities, the designer will generally not proceed
past the RTL when designing a circuit. Once a complete RTL description in some
4
1.1. VLSI Design M.A.Sc. - A.B. Kinsman - McMaster
HDL has been amassed the HDL is given to a tool called a logic synthesis tool. This
process transforms the RTL description to a gate level description of the SOC called
a gate level netlist. Note that because of the intractability of the problem the gate
level netlist will not be optimal for the design at hand, however the results obtained
in the time required far outweigh the extra area and performance overhead which
may arise from giving control of this level to the tool. Once a gate level netlist has
been obtained, it is fed into a physical design/layout tool to determine the locations
of transistors and how the wires (vias, metal layers) must be routed to obtain the
circuit described by the gate level netlist. This process too is sub-optimal however.
Were time to be spent searching for an optimal solution, likely no devices would ever
be constructed.
5
1.1. VLSI Design M.A.Sc. - A.B. Kinsman - McMaster
Library of
Application &
market driven existing
designs
Specification
Manual construction by
Designer
Behavioural
description
Architectural Control path
synthesis Data path
Design Verification
Soft cores
HDL description
RTL synthesis
Synthesis
Logic functions
Automated synthesis by
Logic synthesis
Firm cores
Gate level netlist
Hard cores
Physical layout
Manufacturing Test
Mask set
Wafer FAB
Manufacturing
Devices
6
1.1. VLSI Design M.A.Sc. - A.B. Kinsman - McMaster
It is to be expected that some design errors will occur in light of the fact that
subtle mistakes can be made when handling such a degree of complexity, and that
the stages from RTL synthesis onward are handled by software tools which have no
explicit understanding about the function of the device. At each level of abstraction
a tool merely adds detail according to rules applied to input from the level above.
To cope with such design errors, design verification methods have been developed.
Formal verification methods exist, as do simulation based methods which are used
much more frequently. Generally speaking testbenches are developed which, through
software simulation, apply stimulus to a simulation model of the circuit to check
whether operation of the model will align with the specifications. A simulation model
exists at each level of abstraction with lower level models containing more detail to
reflect the greater detail in the circuit description itself. Because of this greater detail,
simulation is more computationally expensive at lower levels of abstraction and so
verification is performed at all levels during synthesis to try and catch and correct
as many errors as possible as early as possible thereby avoiding having to spend
more computational effort on simulation later on. The other reason for verification
at each level is that each time a circuit change is made synthesis must be re-done.
Accordingly simulation on the gate level model for instance does not require place
and route to be run which saves time on each design correction. At the end, design
verification should ensure (to the limits of: 1) accuracy of the simulation models and
2) extensiveness of the testbenches) the correlation between the specification and the
fabrication instructions (layout).
Other than its role in the design process, design verification is outside the scope of
this thesis. It is mentioned more so to bring contrast to the process of manufacturing
test since distinction is not often made between test and verification, especially in
software development. We have seen in the previous paragraph that the role of veri-
fication is to ensure that what we designed is what we intended to design. The role
of manufacturing test is to ensure that what we have manufactured is what we have
designed. The right extreme of Figure 1.2 shows the levels between which comparison
occurs for both verification (upper section) and test (lower section). It is immedi-
ately clear that while verification is done through software simulation, as we have
7
1.2. VLSI Test M.A.Sc. - A.B. Kinsman - McMaster
discussed, manufacturing test makes comparisons at the device level and is therefore
a physical process. At the simplest level this process consists of deciding a set of test
vectors which are applied to each circuit, and the response returned by each circuit is
compared against the expected response which is obtained by simulation. The details
of this process, including the generation and application of vectors, are provided in
the next section.
8
1.2. VLSI Test M.A.Sc. - A.B. Kinsman - McMaster
possible circuit inputs and check that the circuit output is correct for each given in-
put combination. This method is called exhaustive functional testing and although it
could give extremely high confidence about the operability of the device, it becomes
entirely impractical for anything more than the most trivial of devices. For example,
consider a 64 bit ripple carry adder. This circuit would have 129 inputs (64+64+1),
and thus there are 2129 possible input combinations. Assuming automated test equip-
ment (ATE) is used to apply these patterns at 1 GHz (extremely fast by todays
standards), the test would take 2.16 1022 years to complete [13].
9
1.2. VLSI Test M.A.Sc. - A.B. Kinsman - McMaster
Sa-1
Sa-0
(a) Defect free AND gate (b) Stuck-at-0 fault (c) Stuck-at-1 fault
Figure 1.4: Stuck-at fault models for different defects.
Figure 1.4 shows the fault modeling of defects. In Figure 1.4(a), the transistor
level detail of a logical AND gate has been shown. The most straightforward example
of a stuck-at fault is the case where the node in question has actually become shorted
to a power or ground rail. Figure 1.4(b) shows the case where the output has become
shorted to ground, resulting in a stuck-at 0 fault on the output. Not only explicit
shorts to power or ground manifest as stuck-at faults however. Take for example
Figure 1.4(c), in this figure the output has become shorted to one of the inputs. In
this case, when testing the gate (applying patterns to the inputs and observing the
output) it will behave as though the other input (the one to which the output is not
shorted) is stuck-at 1 since the shorted input will always be passed directly to the
output. Hence, a very important observation is that most short or open defects that
do not make any particular node to be connected to power/ground rails will still be
detected by stuck-at test patterns. In this way a small number of fault models are
able to cover many classes of defects. It should be noted that after manufacturing
test the failing devices are analyzed and the failing pattern information will be used
for diagnosis, which is a separate problem not covered in this thesis.
With fault models in place, the insurmountable task of targeting every kind of
defect reduces to the task of targeting each fault at each node, a still difficult but
more manageable problem. To this end, a tool called automatic test pattern generation
(ATPG) is employed which systematically works through the nodes in the circuit
description and generates tests according to selected fault models, these being the
aforementioned structural tests. Figure 1.5 shows the process of generating a test for
a particular stuck-at fault in a circuit. At the basic level, a test for a given stuck-at
10
1.2. VLSI Test M.A.Sc. - A.B. Kinsman - McMaster
Backward Forward
Justification Propagation
x
1 x
1 1
1 1 1 Sa-0
1
1 1(0) 1(0)
1 1
x 1 0
x
x
x
0
11
1.2. VLSI Test M.A.Sc. - A.B. Kinsman - McMaster
x11xx0 01x0x0
x11xx0
01x0x0
0110x0
12
1.2. VLSI Test M.A.Sc. - A.B. Kinsman - McMaster
Primary Functional-out
Inputs
Current State
Sequential D Q D Q D Q
Combinational
Memory
Logic
Clock
Elements
Clock
13
1.2. VLSI Test M.A.Sc. - A.B. Kinsman - McMaster
then at some point a test should fail and thus the chip will be rejected anyway. As
a precaution, before any other test patterns are applied the sequence ...0110... is
shifted through the scan chains. Since this sequence contains all transitions (00,
01, 10, 11), an uncorrupted sequence emerging from the scan chain will also
give a high degree of confidence about the reliability of the scan hardware.
By increasing the number of Scan-in and Scan-out pins, the FFs may be linked
into multiple scan chains to reduce scan chain length and thereby the time required
to shift in a new current state. For example, if a circuit contains 1 million FFs, this
may be organized as a single scan chain of length 1 million (using two scan pins)
or 2 scan chains with length 500,000 (using 4 scan pins) requiring half the time to
scan in a new state. The number of chains can therefore be extended as much as
required providing there are sufficient package pins available. The introduction of the
muxes obviously incurs area and performance penalties. The penalty in area penalty
is obvious, and the performance penalty arises because the current state instead of
passing only through the combinational logic to arrive at the FF inputs, must now
pass through the scan muxes as well, adding extra propagation delay. This increases
the time which must be left between consecutive clock cycles and therefore decreases
the maximum operating frequency of the circuit. These penalties are relatively small
and well worth accepting given the increased observability and controllability gained.
By introducing scan the combinational block may be accessed directly, and thus
combinational ATPG as discussed in the previous section may be used to generate
test vectors, and the need for sequential ATPG in its intractability is removed.
In addition to the small area and performance penalty brought about by the
introduction of scan, the problem of volume of test data (VTD) arises when scan
based testing is used. Without scan the size of each pattern corresponds to the
number of primary inputs to the circuit, roughly a few hundred at most for even
current day large designs. Once scan is integrated however, each pattern must now
drive not only the primary inputs but also all the flip-flops for the entire design, a
number exceeding 1 million for current day large designs. Delivery of this enormous
VTD to the circuit under test (CUT) has posed significant challenges.
14
1.2. VLSI Test M.A.Sc. - A.B. Kinsman - McMaster
Current State
Primary
Inputs
Sequential
Combinational
Memory
Logic
Clock
Elements
Primary
Outputs
Next State
15
1.2. VLSI Test M.A.Sc. - A.B. Kinsman - McMaster
Current State
Primary Pseudo
Inputs Inputs
Scan-in Scan-out
Sequential
Combinational
Logic 1 Scan en Memory
Clock
Elements
Primary Pseudo
Outputs Outputs
Next State
Application of scanned-in
Current State stimulus to combinational
Primary Pseudo circuit
Inputs Inputs
Scan-in Scan-out
Sequential
Combinational
Logic 0 Scan en Memory
Clock
Elements
Primary Pseudo
Outputs Outputs
Capture of response
Next State from combinational
circuit
state in the memory influences the values appearing on the outputs. As mentioned
above, this approach is very efficient in terms of VTD since each pattern consists
of only the values for primary inputs, plus primary outputs for the response. It is
also much faster since each test requires only a single clock cycle rather than many
as is the case for scan. Where this method fails is its inability to ensure adequate
fault coverage with reasonable computational effort and thus we require scan also as
discussed above.
Figure 1.10 shows the process of applying a scan test to a circuit. The main
difference between this and functional testing is that a given state can be reached
by scanning it in as opposed to reaching it by manipulation through the primary
16
1.2. VLSI Test M.A.Sc. - A.B. Kinsman - McMaster
inputs. In fact, scan enables application of patterns which may not be possible in
functional testing since normally unreachable states may be scanned in. This is one of
the mechanisms by which scan testing enables higher fault coverage. To apply a scan
pattern two steps must occur. The first step is setup of the new state through scan,
shown in Figure 1.10(a). Here the scan input arrow has been shaded to emphasize
the path along which the new state data is provided. The scan output path has also
been highlighted to show that simultaneous to the setup of a new state, the result
from application of the previous pattern to the circuit (which is stored in the FFs) is
scanned out and compared with the expected response. Note also that the Scan en
pin has been asserted which links the scan cells into chains as covered in Section 1.2.2.
The second step of scan testing is to apply the newly updated state to the circuit
and capture the response as in Figure 1.10(b). During this step the Scan en pin is
de-asserted to put the circuit into normal functional mode so that each FF draws its
data from its respective output of the combinational logic. A single clock cycle is
applied to the circuit so that the combinational output is captured in the FFs, thus
it may be scanned out as the next pattern is scanned in. At the same time that the
values from the sequential elements are applied and the results captured, the primary
inputs are stimulated and the primary outputs observed, as indicated by the shaded
arrows in the figure. Note that scan FFs may be added to drive/capture the primary
inputs/outputs in the test mode to circumvent application and observation of them
directly. Henceforth in this thesis we refer only to data in the scan chains without
loss of generality because of this fact.
In contrast to test per clock, this method is often called test per scan because
each test pattern requires multiple clock cycles amounting to a single scan to apply.
The increased number of clock cycles required to scan in a new state results in much
longer test times than the test per clock scheme. In addition to longer test times, the
data for each pattern now consists of values for each FF (likewise for the expected
responses) which drastically increases the VTD. Addressing this issue of increased
VTD is the topic of the next section.
17
1.3. Test Data Compression M.A.Sc. - A.B. Kinsman - McMaster
700
400
300
200
100
0
0 10 20 30 40 50
Gate Count (Millions)
Figure 1.11: Scaling of volume of test data - assumes 20 gates/FF, 1 pattern/10 FFs,
256 scan channels and 20 MHz scan clock.
18
1.3. Test Data Compression M.A.Sc. - A.B. Kinsman - McMaster
Figure 1.11 shows how VTD is affected by circuit size. The values in the graph
are calculated from the gate count according to the assumptions given in the caption.
It is assumed that there are 20 gates per flip-flop, a widely accepted empirical result
used by many (e.g., [77]). We have assumed here 1 pattern for 10 FFs, another
empirical observation which we believe to be optimistic. In reality we believe it to
be more patterns per FF - giving an even larger VTD than we obtain here. The
assumption of 256 scan channels is justifiable since 256 must be used for each input
and output scan channels giving 512 in total, a significant number of pins for the range
of device sizes given. Finally, a 20 MHz scan clock is assumed mainly for the sake
of containing power during test. For volume of test data only the stimulus has been
considered, so the total ATE memory required will be twice the value in the graph
(stimulus + response). It has been calculated as the product of patterns F F s or
Gates2 /200 (based on the assumptions). The test time is patternstime per pattern
where time per pattern = F F s/scan channels giving Gates2 /51200 or V T D/256.
As mentioned above, the situation is only aggravated by Moores law which is fueling
the increasing gate count. With gates scaling as the square of time, and VTD scaling
as the square of gates, VTD has become a serious concern. This problem is further
aggravated by the fact that the gate count (and hence scan flip-flops and test patterns)
will continue to increase at a faster rate than the tester channel count (determined
by the packaging constraints on the chip pin count) and the scan clock frequency
(limited by the power constraints).
Having established that VTD is large and is only getting larger, we must evaluate
its impact on current test application methods and ensure the problem is addressed
in new methods which are proposed. We have seen in Section 1.2.3 that functional
testing leads to comparatively small volumes of test data, however, as addressed
it cannot provide the fault coverage that is necessary for required product quality.
Constrained to use scan based testing, let us consider different pattern application
methods and the impact of VTD scaling on them.
19
1.3. Test Data Compression M.A.Sc. - A.B. Kinsman - McMaster
20
1.3. Test Data Compression M.A.Sc. - A.B. Kinsman - McMaster
Circuit Under
Test (CUT)
Scan Chain
B
Scan Chain
I
S
T
associated challenges arising in test using ATEs, a method known as built-in self-test
(BIST) has been explored, as discussed in the next section.
21
1.3. Test Data Compression M.A.Sc. - A.B. Kinsman - McMaster
c
Figure 1.14: BIST fault coverage (Kluwer [13]).
use some form of pseudo-random pattern generator (PRPG) which continually cycles
providing pseudo-random sequences to the circuit. The benefit of PRPGs is a very
compact implementation however, it is a accompanied by a reduction in fault coverage.
Figure 1.14 shows the fault coverage obtained by BIST in the ideal case and in
practice. In the ideal case random pattern testing gives very good fault coverage, as
indicated in the upper curve. However, because of the existence of random pattern
resistant faults, fault coverage as shown in the lower curve is what can be expected for
realistic cases. This level of fault coverage is clearly unacceptable and given the trend
in the curve a much larger number of patterns would have to be applied to obtain a
reasonable fault coverage level. The use of control points in the circuit may alleviate
this problem, nevertheless it will introduce both hardware and performance overhead,
in addition to complicating the design process. Thus the fundamental tradeoff for
BIST is established, either unacceptably large area overhead or unacceptably long
test times are necessary to reach the required fault coverage. As expressed, this
tradeoff results from the aforementioned random pattern resistant faults, aside from
which random testing is able to substantially reduce the volume of test data. For
this reason, hybrid testing has been proposed where pseudo-random patterns are first
applied to cover easily detectable faults, after which deterministic pattern application
is used to catch the random pattern resistant faults. This marriage of random and
deterministic testing is mainly what has sustained BIST research and due to the
22
1.3. Test Data Compression M.A.Sc. - A.B. Kinsman - McMaster
23
1.4. Thesis Organization M.A.Sc. - A.B. Kinsman - McMaster
Test data compression is made possible by the presence of the x-bits in the test
set produced by ATPG, as seen in Section 1.2.1. Since these bits may take on any
value, they are irrelevant information. The compression is accomplished by building
a decompressor in such a way that the tester is able to communicate the location
and value of the specified bits to it, and having it fill the x-bit values in an alternate
way. The irrelevant information is thereby stripped from the test set greatly reduc-
ing the volume of test data which must be applied to the chip. What makes EDT
challenging is the development of new CAD (or EDA) algorithms for compressing
test sets and generating hardware architectures that effectively trade-off compression
factor, embedded hardware complexity, ATE features and capacity, performance im-
pact, test time, design time and methodology. It should be noted that a full EDT
solution also provides compression for the output responses since often the number of
internal scan chains is more than the available pins, and also to reduce the volume of
stored output responses to a level on par with the stimuli. This task of compressing
output responses comes with its own unique challenges [67, 81, 106] and as a result
is considered a separate problem, not in the scope of this work.
24
M.A.Sc. - A.B. Kinsman - McMaster
Chapter 2
Related Work
Chapter 1 of this thesis introduced the VLSI design flow, explained the relevant steps
in integrated circuit testing, and motivated the need for test data compression. Given
the importance of the test data compression problem, this chapter attempts to answer
the question: why do we need yet another approach for embedded deterministic test?
This is done by reviewing the recent related approaches in the field and summarizing
their main strengths and weaknesses.
In Section 2.1 we start by reviewing the ATPG and ATE solutions for test data
compression. Some recent BIST approaches relevant to EDT technology are outlined
in Section 2.2. Then Section 2.3 gives the basic equation of test data compression
which is used for introducing the different research directions in EDT that have been
undertaken recently. Section 2.4 reviews several test set dependent approaches that
use the specific test cube data to hardwire the on-chip decompression engine and
stresses their main limitation. The opposite research direction is labeled as test set
independent and can focus either on temporal and spatial decompression. On the one
hand the key feature of temporal decompression, detailed in Section 2.5, is that it
exploits the ratio between the scan frequency and ATE frequency. On the other hand
the spatial decompression exploits the ratio between internal scan chains and tester
channels and due to its practical relevance, has been widely adopted as outlined in
Section 2.6. Section 2.7 overviews multi-level test data compression and Section 2.8
summarizes this chapter.
25
2.1. VTD Reduction by ATPG and ATE M.A.Sc. - A.B. Kinsman - McMaster
Scan Chain
.
.
Scan Chain
From . To
.
ATE ATE
Scan Chain
.
.
Scan Chain
26
2.2. BIST M.A.Sc. - A.B. Kinsman - McMaster
of data stored on the tester. As an added benefit they accomplish VTD reduction
in a non-intrusive way since no decompression hardware must be placed on the chip.
A drawback however is that although they reduce the VTD stored on the tester, the
bandwidth between the ATE and the CUT is not reduced at all as it is clearly shown in
Figure 2.1. This means that a tester channel is required for each scan chain and since
the number of scan chains is limited by the package pin constraints (Section 1.3), long
pattern application times result which cannot be compensated by the reduction in
number of patterns mainly because conflicting test patterns limit pattern reduction.
In addition, ATE based methods can be expensive since the special features on which
they rely cost extra. Besides which, if the tester at hand does in fact have these
features they can be used to augment an EDT approach. Furthermore, application
of more patterns improves the probability that a defect (not modeled by the targeted
fault model) will be covered, and thus improves the quality of the test. It is therefore
better to concentrate on reducing the VTD per pattern as accomplished by EDT,
rather than targeting only a reduction in the number of patterns.
2.2 BIST
The concept of BIST, first introduced in Section 1.3.2, is the holy grail of EDT. By
placing test hardware on the chip it seeks to remove entirely the need for external
input, which reduces test cost to essentially nothing in comparison to external test
application using expensive ATEs. Despite the fact that BIST cannot be used to
entirely remove the need for external data application it does provide the basis for
EDT and as a result some of the previous work done in BIST is summarized here.
A basic single scan chain architecture for BIST is shown in Figure 2.2(a) where
pseudo-random patterns are fed into the scan chains and applied to the circuit. The
response is shifted out and captured in a linear feedback shift register (LFSR) that
has been configured as a signature register. Figure 2.2(b) shows BIST applied to
multiple scan chains, which substantially improves the test time by cutting down
the time required to shift each pattern into the scan chains. This configuration is
known as the STUMPS configuration for Self-Test Using a MISR and Parallel Shift
register sequence generator [7, 13]. To drive multiple chains, all the FFs in the LFSR
27
2.2. BIST M.A.Sc. - A.B. Kinsman - McMaster
P
H Scan Chain
A .
LFSR S .
L E
Scan Chain
M
F S
. I
.
Scan Chain
S H S
I Scan Chain
R F . R
T .
E
R Scan Chain
LFSR
(a) Single scan chain BIST. (b) Multiple scan chain BIST.
Figure 2.2: BIST pattern application.
are used instead of just one as in Figure 2.2(a). In order to break up the strong
shift correlations between adjacent FFs in the LFSR, a phase shifter is introduced as
illustrated. By taking linear combinations of the LFSR outputs, a separation (called
the channel separation) in the pseudo-random sequences can be obtained, hence the
name phase shifter. More details regarding the requirement and design of phase
shifters can be found in [79]. At the output, the response across the multiple scan
chains is collected into the multiple input signature register (MISR). As mentioned
earlier, most BIST methods rely on some sort of pseudo-random pattern generator,
and LFSRs are very commonly chosen as the PRPG. Figure 2.3 details an example of
an LFSR and a MISR. In the LFSR, the state is modified by the feedback polynomial
only, whereas in the MISR the next state depends on both the feedback and the
inputs, which allows it to be used for response collection.
Although many existing BIST methods use LFSRs, other types of PRPGs operate
on the same principle and are equally applicable to BIST. For example, [70] uses linear
hybrid cellular automata to drive the scan chains through a phase shifter. In [14], a
twisted ring counter is used as the PRPG, but the work attempts to encode seeds for
the ring counter to enable increased determinism. Self reseeding hardware is used in
[2] for the same reason as in [14] but is applied to LFSRs. An LFSR is also used by
[4] where a seed selection algorithm is proposed, and also in [42] which explores the
relationship between seeds of a given LFSR. A method similar to reseeding the PRPG
is the insertion of state skipping logic which enables dropping of entire sections of the
state sequence if desired, as in [96].
28
2.2. BIST M.A.Sc. - A.B. Kinsman - McMaster
Pseudo-random sequence
LFSR D
3
Q D
2
Q D
1
Q D
0
Q
Clock
Signature
MISR D
3
Q D
2
Q D
1
Q D
0
Q
Clock
Inputs
29
2.3. Test Data Compression M.A.Sc. - A.B. Kinsman - McMaster
can be disrupted) requiring feedback in the design flow, which complicates matters.
Moreover, circuit modifications do not fully respect the reuse aspect of design since
for IP-protected cores no access to internal details of the core is provided, thereby
invalidating methods which rely on circuit modifications. The final type of method
that will be discussed here is the kind that places certain constraints on the circuit to
be tested. As an example take [90] which requires that the SOC have an embedded
processor and an FPGA core on which the BIST machinery may be formed. Such
methods can be effective in their domain of applicability but often that domain is not
nearly general enough.
In addition to BIST limitations outlined in the previous chapter, another con-
sideration which should be made is that BIST hardware can be ineffective during
diagnosis of a circuit given its randomness and resistance to determinism. For diag-
nosis, a high degree of control is desired to enable tracking down of a defect. Where
BIST cannot effectively provide such a level of control EDT can work well because
of its built-in determinism. In light of these facts and the discussion in Section 1.3.2,
we can conclude that EDT is the most viable choice as a solution to the problem
of VTD. The next section introduces the governing relationship of EDT, which has
motivated its main research directions.
30
2.3. Test Data Compression M.A.Sc. - A.B. Kinsman - McMaster
Scan Chain
.
Decompressor
.
Compactor
Scan Chain
Tester . Tester
.
Channels Channels
Scan Chain
.
.
Scan Chain
BWout
CR = . (2.1)
BWin
Equation 2.1 is the very basic relationship between the compression ratio (CR)
and the bandwidth (BW) in and out of a decompressor. Since we may write the
bandwidth as the product of the bus-width carrying the data and the frequency at
which that data is sent, we can obtain:
Now consider that the test set must contain some amount of useful data, which
cannot be modified by the decompressor and thus remains the same on both the input
and output sides. This gives:
BIT Sout
BIT Sout fout fout BIT Susef ul
= BIT Sin . (2.3)
BIT Sin fin fin BIT Susef ul
Finally if we write fr as the frequency ratio between the (CUT) and the ATE
(fr = fout /fin ), and if we use T CU and SCU to denote tester channel utilization
31
2.4. Test Set Dependent Methods M.A.Sc. - A.B. Kinsman - McMaster
and scan chain utilization respectively, that is, the percentage of data in the tester
channels or in the scan chains respectively which is useful, we obtain:
fout SCU T CU
CR = = fr . (2.4)
fin T CU SCU
We have established that automatic test pattern generation (ATPG) provides test
vectors which are not completely specified. This unspecified portion of the information
is obviously useless, and so it can be seen that the measure of SCU relates to the
concept of care bit density, the percentage of bits in the test set which are useful or
take on a specified value (as opposed to being left free as x-bits). In this way the
SCU value is fixed by the ATPG process and the only free parameters in determining
compression ratio are fr and T CU . These two parameters embody the two main
directions of research in test data compression. We call those methods which focus
on fr - temporal compression methods (Section 2.5) and those which focus on T CU -
spatial compression methods (Section 2.6). In addition to these divisions a distinction
can be made regarding what are called test set dependent and test set independent
compression methods. Test set dependent methods exploit the structure of the test
set itself (e.g., specific locations and values of care bits) to produce compression,
while test set independent methods exploit only very general features (e.g., average
care bit density). While the sections just mentioned above will focus on the test
set independent methods of their respective approaches, Section 2.4 gives some work
done on test set dependent compression methods and a justification as to why test
set independent approaches are favorable.
32
2.4. Test Set Dependent Methods M.A.Sc. - A.B. Kinsman - McMaster
test patterns at hand as do [10, 82], and [94] uses the test set to determine switch
connections for a reconfigurable switch based decompressor. A finite state machine
(FSM) is used in [63] to drive a combinational expander which is derived from the test
set. A linear decompression network is derived from the linear dependencies of the
test set by [83], and [5] does the same but includes inversion of the scan outputs when
determining the network. In [88] consideration is given to power by reordering the
test set and partitioning scan chains based on examination of the test set and [11, 68]
construct a scan tree where the output of a scan cell branches to multiple chains and
the structure of the tree (i.e. where to branch) is determined by dependencies in the
test set. In [74] scan chains are separated into groups by evaluating the test set and
determining which groups may be tested using random, periodic or external tests
respectively, somewhat like hybrid testing, but across different sections of the circuit
instead of across different parts of the test set.
Although test set dependent methods can often be used to obtain better compres-
sion results than test set independent methods, they suffer some serious drawbacks.
Aside from the IP issues discussed previously with respect to BIST for those methods
which change the circuit (e.g. modify scan cell order) there are methods which do
not interfere with the circuit itself, however they still do not allow alternate pattern
sets to be used. It is very desirable to reuse the existing manufacturing test infras-
tructure on-chip for embedded debug and diagnosis when tracing a problem in the
circuit. Related to this is the application of failure analysis patterns when more detail
is required than the simple pass/fail provided by manufacturing test. Further, new
tests may need to be added even for manufacturing test to detect an un-anticipated
fault (perhaps due to an un-modeled defect) which is resulting in test escapes (devices
which pass manufacturing test but which are in reality defective and are caught later
in-field). In all these cases, the additional patterns would almost certainly be outside
the expansion space of the decompressor since it would have been crafted to con-
tain just the original test set (to increase compression). To avoid all these problems,
the practical approach is to design a decompressor which obtains relatively exten-
sive coverage of the entire potential space of test patterns and in addition provide a
mechanism whereby any pattern not in the output space of the decompressor may
33
2.5. Temporal Decompression M.A.Sc. - A.B. Kinsman - McMaster
Sync data
Pattern
Generator
Data in Data out
34
2.5. Temporal Decompression M.A.Sc. - A.B. Kinsman - McMaster
35
2.5. Temporal Decompression M.A.Sc. - A.B. Kinsman - McMaster
been applied to the test data compression problem in [52, 107]. In [107] the boundary
scan cells are modified to enable low overhead, while [52] reuses on-chip memory for
decompression.
One of the main strengths of statistical coding methods for test data compression
is the ability to compress highly specified test sets. This makes them particularly
suitable for application to IP-protected circuits. Because IP-protection implies that
there is no disclosure of internal circuit information, ATPG and fault simulation may
not be performed by the system integrator and thus test generation is the responsi-
bility of the core provider while the system integrator must translate the core level
tests of each core into an entire system level test. To reduce the amount of data that
must be communicated between the core provider and system integrator the core
tests may arrive highly compacted. As will be shown later, other approaches suffer
loss of compression when the care bit density is high which makes statistical coding
favorable in such cases.
On the other hand, the synchronization referred to towards the beginning of this
section is a substantial drawback of this approach. Essentially most ATEs are un-
equipped with any type of reactive functionality, that is they operate in the stream
mode and so the aforementioned sync data pin shown in Figure 2.5 cannot control
the tester. Since codewords vary in length as do the sequences which they represent,
a short codeword may arrive representing a long sequence and so the decompressor
may not finish expanding a given codeword before the next one arrives. Address-
ing this synchronization problem leads to additional overhead either in terms of area
or compression ratio and more importantly, it complicates the implementation pro-
cess due to the number of distinct clock domains that need to be considered in the
decompressor.
Another difficulty encountered by statistical coding lies in the fact that the scan
chains are clocked at a higher frequency. Because the scanning shifts values through
all the flip-flops in the scan chains simultaneously, the number of transitions during
any clock cycle is greatly increased over the normal transitions occurring in the native
mode. As a result, the power rails may be unable to deliver sufficient current to switch
a flip-flop, or worse, the chip may not be able to dissipate quickly enough the power
36
2.6. Spatial Decompression M.A.Sc. - A.B. Kinsman - McMaster
Scan Chain
.
.
Test Scan Chain
. Data .
. Decompressor .
Scan Chain
.
.
Scan Chain
37
2.6. Spatial Decompression M.A.Sc. - A.B. Kinsman - McMaster
clock frequency of the ATE and the CUT scan chains may be kept the same and the
ATE memory needed may be reduced either by using a smaller number of channels
from the ATE, or by driving a larger number of scan chains from the same number
of tester channels resulting in reduced test time. Such decompressors come in com-
binational and sequential varieties, each of which will be described in the following
sections.
38
2.6. Spatial Decompression M.A.Sc. - A.B. Kinsman - McMaster
Scan Chain
.
.
Scan Chain
. .
. .
Scan Chain
.
.
Scan Chain
39
2.6. Spatial Decompression M.A.Sc. - A.B. Kinsman - McMaster
Scan Chain
2k x n .
.
ROM Scan Chain
. .
Data
. .
Address Scan Chain
.
.
Scan Chain
40
2.6. Spatial Decompression M.A.Sc. - A.B. Kinsman - McMaster
41
2.6. Spatial Decompression M.A.Sc. - A.B. Kinsman - McMaster
42
2.6. Spatial Decompression M.A.Sc. - A.B. Kinsman - McMaster
Scan Chain
.
.
Test Scan Chain
. Data .
. Decompressor .
Scan Chain
.
Clock
.
Scan Chain
43
2.6. Spatial Decompression M.A.Sc. - A.B. Kinsman - McMaster
44
2.6. Spatial Decompression M.A.Sc. - A.B. Kinsman - McMaster
. . Scan Chain
.
.
. Scan Chain
B
U L .
. . ...
. F F .
ATE . .
. F S . Scan Chain
. .
E R .
R .
. Scan Chain
Clock
45
2.6. Spatial Decompression M.A.Sc. - A.B. Kinsman - McMaster
. . Scan Chain
.
.
. Scan Chain
L .
. ...
. F .
ATE .
. S . Scan Chain
.
R .
.
. Scan Chain
Clock
46
2.6. Spatial Decompression M.A.Sc. - A.B. Kinsman - McMaster
. Scan Chain
.
.
. Scan Chain
B
U .
. ...
. F .
ATE .
. F . Scan Chain
.
E .
R .
. Scan Chain
Clock
Stall
47
2.7. Multi-level Compression M.A.Sc. - A.B. Kinsman - McMaster
48
2.8. Summary M.A.Sc. - A.B. Kinsman - McMaster
2.8 Summary
This chapter provided a survey of the recent advancements in embedded deterministic
test and it was established that spatial decompression based on linear decompressors is
the most suitable approach for large scale integrated circuits. Nevertheless, given the
continuing increase in circuit complexity and the new emerging design methodologies
based on the core reuse philosophy, adapting spatial decompression to core based
SOCs using IP-protected blocks faces its own challenges, as shown in the following
chapter where a novel approach is introduced.
49
M.A.Sc. - A.B. Kinsman - McMaster
Chapter 3
IP-consistency: We desire a method that does not require access to the inter-
nal details of the core to provide compression, thus fault simulation and ATPG
should not be relied upon (Section 3.1.1);
50
3.1. Desired Features M.A.Sc. - A.B. Kinsman - McMaster
Low Cost Test Platform: No special requirements should be made about the
ATE and what features it has, we should assume a very low cost basic tester
which operates only in the stream mode without any sequencing-per-pin options
(Section 3.1.5).
3.1.1 IP-consistency
We have seen that the SOC design paradigm facilitates design reuse by enabling
construction of complex systems through the assembly of pre-designed modules called
cores. This brings with it some unique test challenges as some cores may be IP-
protected in which case no internal details of the circuit are available. IEEE 1500
standard has been developed to ensure core test reuse and isolation during test [35],
however no provisions were made yet for test data compression. Many compression
methods rely heavily on integrated fault simulation/ATPG to obtain small volumes
of test data and these methods are un-applicable to IP-protected cores. As shown in
Figure 3.1, methods which require fault simulation restrict compression to the domain
of the core provider. This poses a problem if some core providers choose not to provide
compression hardware, or if the method of compression they use does not suit the
system level test methodology. By creating the decompression engine without the
need for IP disclosure, freedom to perform compression on either the core provider
or system integrators side is enabled. This gives the system integrator the ability to
perform compression in such a way as to facilitate packing of the compressed streams
at the system level. Moreover, compression can still be done by the core provider as
long as the compression method used fits the system level methodology.
51
3.1. Desired Features M.A.Sc. - A.B. Kinsman - McMaster
Ci
de rcui
re tai t
qu ls Ci
ire
Scan Insertion d de rcui
ta t
i
re NOT ls
qu
ire
d
(performed for each core)
ATPG
Fault Simulation
Core Level
Compression done in a
w IP way that facilitates system
lo -p
tf level compression
tes C te rote
al O st
flo cted
orm or S w
N f
Core level
Compression
compression
(done once for
whole system)
System Level
Test-access Responsibility of
mechanism core provider
design Responsibility of
system integrator
System level May be done
compression by either
52
3.1. Desired Features M.A.Sc. - A.B. Kinsman - McMaster
LFSR
20 xx 3E xx xx xx xx D1
System level
All these types of compression The type of
compressors exhibit decompressor
discontinuous use of doesnt matter as
tester channels long as there is
slack in the
demand for data
Data 20 FD 3E 2C E4 5B 54 D1
Control 1 0 1 2 0 2 0 1
3.1.3 Scalability
To remain applicable in the long term, scalability needs to be taken into account. Even
if the hardware and computational overhead are acceptable for current day devices, if
they do not scale effectively they cannot be used in the future. Figure 3.3 shows how
the hardware should be kept relatively low, even when many cores are used. Since
the core level decompressor size is linked to the size of the core it accompanies, the
area overhead for this hardware will be kept from scaling to an unreasonable level.
At the system level, some small control logic is required. The size of the control logic
should scale roughly logarithmically with the number of cores, and thus the area for
this hardware will remain reasonable as well. In terms of computation, the execution
time for core level compression should scale linearly with the number of cores since
it must be performed for each core, while system level compression must also remain
practical even as the system level design size increases.
53
3.1. Desired Features M.A.Sc. - A.B. Kinsman - McMaster
Decompression Decompression Decompression Decompression Decompression Decompression
Engine Engine Engine Engine Engine Engine
Data Data
Embedded
Core 1
Decompression
... Embedded
Core n
Decompression
Engine Engine
Data ...
Control .
Control
Decode .
Logic .
3.1.4 Programmability
Programmability is of great importance when forming a test compression methodology
since changes to the test set (modified patterns, extra patterns) may be required. By
keeping the hardware as generic as possible and providing a multi-mode feature, any
test set desired may be passed through the decompressor. Although the compression
will suffer as care bit density is increased, through multi-mode configuration the
decompressor retains the ability to pass any pattern through (Figure 3.4).
Decompression Decompression
Engine Engine
1 x 0 1 x x 0 x 0 x x 0 x x 0 1 1 x 0 1 0 x 0 1
x 1 1 0 x 1 x x x 0 x x x 1 x 1 x 0 1 0 0 1 x 0
x x x 1 x x 0 x x x x 1 x x 0 x 1 0 x 1 x 1 0 x
0 x 0 x x x x 0 1 0 0 x x 1 x 0 0 x 1 1 1 x x 1
Same general attributes Increased care bit density:
as original test set: Reduced compression but
Passes through decompressor still passes through decompressor
54
3.2. Key Observations M.A.Sc. - A.B. Kinsman - McMaster
0 r5 1 0 1 r3 0 r4 1
Less memory r3 1 r2 0 r4 1 r1 0 r3 1
required due to
repeat feature r4 0 r2 1 0 1 0 r4 1 r2 0
r1 1 0 1 r4 0 r1 1 0 r4 1 0
Repeat feature
results in a more
0 1 1 1 1 1 1 0 1 0 0 0 0 1 1 1 1 1 expensive ATE
1 1 1 1 0 0 0 1 1 1 1 1 0 0 1 1 1 1
Stream mode ATE,
0 0 0 0 0 1 1 1 0 1 0 1 1 1 1 1 0 0 inexpensive but
uses more memory
1 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 1 0
55
3.2. Key Observations M.A.Sc. - A.B. Kinsman - McMaster
Few
tester
channels
Choice (b)
Cores tested in series
Many
tester
channels Short test time
Few
tester
channels
Figure 3.6: Choices for SOC test application (each square = 1 core).
of tester channels required is much fewer as one set of channels is allocated and used
by the cores in sequence, i.e., the cores are tested in series. However, the test time
now becomes a problem and the volume of test data is still large. Both problems are
solved through part (c) of the figure, where a small number of tester channels are
allocated and the cores are tested simultaneously. This is only possible if the cores
being tested do not demand 100% exclusivity of the tester channels. In other words,
cores may only be tested simultaneously provided that for each core there are times
during which the core does not need the channels, an issue addressed in the next
paragraph.
In order to exploit the observation that simultaneously tested cores use fewer tester
channels and require less time to test (reducing VTD), the test sets of the cores must
satisfy the condition that the cores do not require 100% bandwidth from the tester
during 100% of the testing time. Figure 3.7 shows empirically how this is enabled
(notes: the test set used here is from the ISCAS89 Benchmark set [12], and Comp-
(number) gives a relative average care bit density of the test set, both will be discussed
further in Section 3.6). In Figure 3.7(a), a test set has been broken into blocks1 which
1
A block is a set of consecutive clock cycles and will be further detailed in Section 3.3.
56
3.2. Key Observations M.A.Sc. - A.B. Kinsman - McMaster
Percentage of Patterns
Comp 50 - 2398 Patterns Comp 50
Percentage of Blocks
80
80
Comp 150 - 738 Patterns 70 Comp 150
70
60 Comp 400 - 314 Patterns 60 Comp 400
50 Comp 1000 - 236 Patterns 50 Comp 1000
40 Mintest - 134 Patterns 40 Mintest
30 30
20
20
10
10
0
0
0 10 20 30 40 50 60 70 80 90 100
0 10 20 30 40 50 60 70 80 90 100
Care Bit Density Reseed Density
are binned according to care bit density, and the number of blocks in each bin is
plotted on the vertical axis against the care bit density of that bin. It is immediately
clear that a large percentage of the blocks have a very low care bit density while only a
few have a very high care bit density. Recalling from Section 2.6 that there is a direct
correlation between care bits and seed bits required from the tester, we would expect
a correlation between the care bit density and the required density of reseeds. This
correlation can be seen in Figure 3.7(b). This figure was obtained in a similar way
to Figure 3.7(a), each pattern from the test set was broken into blocks and as many
blocks as possible were satisfied from a single seed, so some patterns were reseeded
frequently (those with high care bit densities) and some reseeded infrequently (those
with low care bit densities). The patterns were then binned according to number of
reseeds per pattern and graphed as number of patterns against number of reseeds.
As expected, nearly all of the patterns have very few reseeds while only a few require
reseeding in every block. Because the tester operates in the stream mode the test data
in the time during which reseeding is not required is wasted, i.e., the tester channels
are under-utilized. It is this fact that we exploit: during the time in which one core
does not require the seed channels for reseeding, we use the same tester channels to
reseed another core and thereby improve the tester channel utilization, which leads to
reduced VTD.
Since the intuition behind what has been done may seem obvious, some of the
obstacles which needed to be overcome in order to enable this approach are discussed
here along with an overview of how we have addressed these challenges. The first and
57
3.3. Seed-only Packing M.A.Sc. - A.B. Kinsman - McMaster
most obvious challenge is the design of a hardware architecture which can support
this time-multiplexing of tester channels. Because the cost in area should be kept
to a minimum, the hardware should be relatively simple yet it must be robust and
modular so that it can be generated in an automated way. This desire for automation
brings to light the other main challenge of enabling this work, the CAD support. In
addition to being able to generate efficient hardware structures automatically, we
want an integrated flow into which we can dump the test specifications and then
collect our compressed data which we can transfer directly to the tester. In order
to cope with ever increasing volumes of test data, human involvement in managing
design and test flows should be kept to a minimum. On the hardware front, ideas have
been borrowed from the previous work in compression with modifications to suit the
time-multiplexing proposed. With this hardware in mind an algorithmic framework
for compression at the core level has been developed which we call StreamBIST, and
a pattern reordering algorithm which we call StreamPack for merging of the core
level compressed streams has been tackled at the system level. In the next section,
these hardware, core level CAD and system level CAD solutions are discussed in more
detail.
58
3.3. Seed-only Packing M.A.Sc. - A.B. Kinsman - McMaster
Scan
In
Frame
+
Block (4)
+ +
Block (8)
59
3.3. Seed-only Packing M.A.Sc. - A.B. Kinsman - McMaster
Control Register
LINEAR DECOMPRESSOR (DCMP)
D Q D Q
D
C
D Q
0
1 D
3
Q Control
M
P
Reseed
Signals
4 / 2 DEC
3
0
TAM 1 1
Seed
Lines D Q
0
1 D
2
Q Lines D
C
2
3
0
2 M EN
P
Data
D Q
0
1 D
1
Q (Seed Stream)
1
D
D Q
0
1 D
0
Q C
M
P
0
Modulo
Reseed Block Length
Indicator Counter
D
C
M
P
Clock Clock
Seed Register LFSR Phase Shifter Scan Chains
60
3.3. Seed-only Packing M.A.Sc. - A.B. Kinsman - McMaster
S L L
When time-
e F multiplexed, only F
e S one core needs S
d R the seed at a time, R
Originally when so the buffer can
constructed, each be shared
core is provided
with its own seed
buffer
S L S L
e F e F
e S e S
d R d R
Time-multiplexing
S L reduces not only L
e F the overall VTD, F
but also the
e S decompressor area S
d R R
Reseed Reseed
enables enables
(a) One seed buffer per de- (b) Seed buffer shared at the
compressor. system level.
Figure 3.10: Transformation from buffering seeds at the core level to the system level.
3.3.2 Seed-only CAD Support - Core Level
The CAD flow for compressing test sets for an SOC onto this decompression ar-
chitecture begins with calculation of seeds at the core level. Before discussing seed
calculation, we will first cover hardware modeling. Figure 3.11 shows explicitly how
the model equations for a given hardware configuration are obtained. Since the XOR
gate is equivalent to addition in GF2 the operations are linear and may be represented
as matrix manipulations. The LFSR feedback polynomial determines which LFSR
FFs are used to calculate the value of the most significant bit of the LFSR in the
next clock cycle, and as such becomes the top row of the LFSR transition matrix
(shown in the figure). Since the remainder of the LFSR is a simple shift register, each
remaining row in the matrix is simply the basis vector corresponding to the imme-
diately adjacent LFSR cell. This matrix therefore can be used to compute the next
LFSR state from the current state, and of course multiplication of the current state
with the nth power of the transition matrix will give the state after n clock cycles. It
follows that the LFSR matrix is a square matrix with number of rows and columns
equal to the LFSR size.
61
3.3. Seed-only Packing M.A.Sc. - A.B. Kinsman - McMaster
0
L0 + P0 0 0 1 1
1 0 0 0
1 D Q
D Q 1
0 1 0 0
1
LFSR Feedback: L3 = L1 L0 0 0 1 0
0
1 D Q
D Q 0
0
Phase Shifter: P3 = L3 1 0 0 0
Reseed
P2 = L3 L0 1 0 0 1
Indicator
P1 = L2 L1 L0 0 1 1 1
P0 = L3 L2 1 1 0 0
Clock
Seed Register LFSR Phase Shifter Scan Chains
62
3.3. Seed-only Packing M.A.Sc. - A.B. Kinsman - McMaster
0 0 1 1
1 0 0 0 t
0 1 0 0 P3 1 0 0 0 0 0 1 1 a3
LFSR Feedback: L3 = L1 L0 0 0 1 0 P2 1 0 0 1 1 0 0 0 a2
=
P1 0 1 1 1 0 1 0 0 a1
Phase Shifter: P3 = L3 1 0 0 0 P0 1 1 0 0 0 0 1 0 a0
P2 = L3 L0 1 0 0 1
P1 = L2 L1 L0 0 1 1 1 t = 0 at the reseed point
P0 = L3 L2 1 1 0 0
XXX0
0
1 0 0 0 0 0 1 1 1 0 0 0 X0XX Test
1 0 0 1 1 0 0 0 1 0 0 1 a3 a2 a1 a0 = 0 XX1X
= Cube
0 1 1 1 0 1 0 0 0 1 1 1 a3 a2 a1 a0 = 1 XX10
1 1 0 0 0 0 1 0 1 1 0 0 a3 a1 a2 a0 = 1
1 a1 a0 a3 a2 = 0
1 0 0 0 0 0 1 1 0 0 1 1 a2 a0 a3 a1 = 0
1 0 0 1 1 0 0 0 0 0 0 1 1110
=
Seed
0 1 1 1 0 1 0 0 1 1 1 0
1 1 0 0 0 0 1 0 1 0 1 1
2
1 0 0 0 0 0 1 1 0 1 1 0 0 1 2 3
1 0 0 1 1 0 0 0 = 0 0 1 0
0 1 1 1 0 1 0 0 1 1 1 1 0 0 0 1 | 0 X X X 0
1 1 0 0 0 0 1 0 0 1 0 1 1 1 1 1 | 1 X 0 X X
3 0 1 0 1 | 1 X X 1 X
1 0 0 0 0 0 1 1 1 1 0 0 1 1 0 0 | 0 X X 1 0
1 0 0 1 1 0 0 0 0 1 0 0 1 0 1 0 | 0
=
0 1 1 1 0 1 0 0 1 1 0 1
1 1 0 0 0 0 1 0 1 0 1 0
63
3.3. Seed-only Packing M.A.Sc. - A.B. Kinsman - McMaster
Algorithm 1: StreamBIST
Input : test cubes, parameters (see table 3.1)
Output: compressed streamcore , incompressiblecore , decomp hardwarecore
1 decomp hardwarecore = GenerateCoreHardware(parameters);
2 compressed streamcore = NULL;
3 incompressiblecore = NULL;
4 foreach test pattern Pi in test cubes do
5 compressed Pi = CompressPattern(Pi , parameters) // Algorithm 2;
6 if (compressed Pi 6= NULL) then
7 Add compressed Pi to compressed streamcore ;
else
8 Add Pi to incompressiblecore ;
end
end
9 Return compressed streamcore , incompressiblecore , decomp hardwarecore ;
64
3.3. Seed-only Packing M.A.Sc. - A.B. Kinsman - McMaster
and/or Verilog HDL) description (decomp hardwarecore ) to which the core can be
connected at the system level. Actual compression begins in the next step, as the al-
gorithm progresses through the test set one pattern at a time attempting to compress
each one. For each pattern that is compressible with the given hardware configura-
tion, the compressed data or seed (compressed Pi ) is stored in the overall compressed
stream for the core (compressed streamcore ). At the end, a list of all incompressible
patterns (incompressiblecore ) is provided in addition to the compressed stream and
the decompression hardware description. Note that the presence of incompressible
patterns will negatively affect fault coverage since the faults in the incompressible
patterns which cannot be applied may not be covered by other patterns. In this case
the designer has the option to increase the size of the LFSR and thus decrease the
probability of an incompressible pattern however, this results in poorer compression.
This has motivated the modifications to the architecture which will be discussed in
Sections 3.4 and 3.5. In the next paragraph we will focus on the details involved in
compressing a single pattern, the CompressPattern function which is actually the
core of the compression engine, where the packability of patterns is augmented.
Algorithm 2 gives the core of the compression procedure for a single pattern which
augments slack in the demand for seed bits, thus facilitating the packing algorithm
which will be used later to merge the compressed streams of all the cores at the system
level. The pattern to be compressed (Pi ) is first divided into blocks according to the
block size determined by the designer (Blocksize ). The basic idea behind making
patterns easier to pack is reducing the number of reseed points required, that is,
expanding as many blocks as possible from the same seed. To do this blocks are
checked for compatibility and strung together in a chain (if compatible). The counter
Blockcount keeps track of the starting point of the current chain, and the counter
Chaincount keeps track of how many blocks are in the current chain. Let us examine
the algorithm in detail with an example.
Example 1 Figure 3.13 illustrates the process of compressing a pattern using Al-
gorithm 2. The test pattern shown in the figure has been divided (line 1) into blocks
with Blocksize = 4 (upper right portion of the figure). Line 2 assigns Blockcount =
1 and the outer WHILE loop (line 3) is entered. Next, loop conditions for the inner
65
3.3. Seed-only Packing M.A.Sc. - A.B. Kinsman - McMaster
Algorithm 2: CompressPattern
Input : Pi , parameters
Output: compressed Pi
1 Divide Pi into N umblocks blocks (Bk ) of length Blocksize ;
2 Blockcount = 1;
3 while (Blockcount N umblocks ) do
4 prev equations = NULL;
5 compressible = TRUE;
6 Chaincount = 0;
7 k = Blockcount +Chaincount ;
8 while ((k N umblocks ) AND (compressible)) do
9 new equations = GenerateEquations(Bk , parameters);
10 compressible = CheckEquations(prev equations, new equations);
11 if (compressible) then
S
12 prev equations = prev equations new equations;
13 if (k = N umblocks ) then
14 Seedk = CalculateSeed (prev equations);
15 Add Seedk to compressed Pi ;
16 Assign ressed point at Blockcount in compressed Pi ;
end
17 Chaincount = Chaincount + 1;
18 k = Blockcount +Chaincount ;
else
19 if (Chaincount 6= 0) then
20 Seedk = CalculateSeed (prev equations);
21 Add Seedk to compressed Pi ;
22 Assign ressed point at Blockcount in compressed Pi ;
23 Blockcount = k;
else
24 compressed Pi = NULL;
25 Blockcount = N umblocks +1;
end
end
end
end
26 Return compressed Pi ;
66
3.3. Seed-only Packing M.A.Sc. - A.B. Kinsman - McMaster
WHILE loop (line 8) are set up. Inside the inner WHILE loop we begin with generat-
ing the equations for block number k - Bk (line 9). Since at this point Blockcount = 1,
Chaincount = 0 and k = Blockcount + Chaincount , k = 1, we generate equations for
the first block. Equations are generated in the same way as the example given above
in Figure 3.12 in fact the B1 in this figure is the same block from Figure 3.12, so the
call GenerateEquations(B1 , parameters) would follow precisely the steps outlined
in Figure 3.12 (with the exception of seed calculation since we intend to string multi-
ple blocks together if possible). In line 10 the consistency of the equations is checked
by CheckEquations which determines whether there are any linear incompatibilities
among the equations in new equations and prev equations. Since we are on the first
block of a chain at the moment, prev equations is still empty so we are checking for
linear inconsistencies in just the equations of B1 . Since the equations for this block
(as shown in Figure 3.13) are consistent, CheckEquations returns TRUE to com-
pressible and so the first branch of the IF statement on line 11 is taken. On line
12, the equations of the block are merged to the rest of the equations in the chain
(again, this is currently the first block in the chain so after line 12 prev equations
will contain only the equations for B1 ). Line 13 checks if we are currently at the
last block, and if so assigns a seed since there are no more blocks to chain. Lines
17 and 18 update Chaincount to 1 and k to 2 respectively. We now begin the second
iteration of the inner WHILE loop. The equations for B2 (since k = 2) are obtained
as shown in the figure beside the equations for B1 . Only the first equation for B2
has been given in the figure since this equation causes an linear inconsistency. As
a result of the inconsistency, the second branch of the IF on line 11 is taken. The
check at line 19 is to ensure that this is not the first block of the chain. If this were
the case, the equations for a single block would be inconsistent and thus the pattern
would be incompressible (and thus a NULL would be returned by the algorithm). Since
Chaincount = 1 at this point, the first branch is taken. At line 20 CalculateSeed
uses Gaussian elimination to obtain the reduced set of equations (not including those
of the most recent block) and thereby the seed. As the figure shows, (uppermost two
streams) the resulting seed is assigned to the compressed stream and a reseed point is
asserted to accompany the seed. Also, Blockcount which indicates the starting position
67
3.3. Seed-only Packing M.A.Sc. - A.B. Kinsman - McMaster
of the current chain, is updated to block B2 , the block which could not be added because
of a linear inconsistency to prev equations.
Because at this point compressible is false, the inner WHILE loop exits and a
second iteration of the outer WHILE loop is performed. This is illustrated as the
second row of systems of equations (middle of the three) in the figure, beginning with
a3 a2 = 1. It should be noted that this equation is different from the equation for
the same care bit when that block (B2 ) was in the previous chain - a2 a1 a0 = 1.
This is because in the last chain B2 was the second block, so the LFSR power (t from
figure 3.12) for the associated care bit would have been 4. Now, B2 is the first block
in the chain, so t = 0 for the care bit in question (bottom left 1 of B2 ). Moving on to
the next iteration of the inner WHILE loop we have the next set of equations in figure
3.13 corresponding to block B3 , where Blockcount = 2 and Chaincount = 1. Since the
equations remain consistent, we move to the next block B4 with Blockcount = 2 and
Chaincount = 2. Finally for Blockcount = 2 and Chaincount = 3 (B5 ), an inconsistency
arises. The seed is calculated (line 20) and passed to the compressed stream at position
2 (Blockcount ). Notice that the seed is calculated based on prev equations which at
this point contains equations for B2 , B3 and B4 . Next the outer WHILE loop iterates
again, with Blockcount = 5. This process continues until either an inconsistency arises
in a single block (Chaincount = 0) in which case the pattern is incompressible, or
until all the blocks have been covered in which case the compressed stream is returned
(compressed Pi ).
By applying this method to every pattern in the test set, a compressed stream
containing gaps during which the seed lines are not needed can be formed for each
core. The next section details the CAD support at the system level which reorders
the patterns from the multiple cores to enable sharing of a single set of tester channels
among all the cores, by interlacing periods of activity on the seed lines for one core
with periods of inactivity for the other cores.
68
3.3. Seed-only Packing M.A.Sc. - A.B. Kinsman - McMaster
Reseed LFSR (generated on-chip): 1000 1000 0000 0000 1000 0000 ...
Resultant Seed Stream: 0111 0110 XXXX XXXX 0001 XXXX ...
Test Vector
XXX0XXX1XX1XXXXXXX0X1XX0 XXX0 XXX1 XX1X XXXX XX0X 1XX0 ...
P S
L H
X0XXX0X1X1XXXX1XX00XXX0X X0XX X0X1 X1XX XX1X X00X XX0X ...
H
F A
XX1XXXXXXX11X10XXXXXX0XX XX1X XXXX XX11 X10X XXXX X0XX ...
A D
S
XX101XXXX0XXXXX1X1XXXX1X XX10 1XXX X0XX XXX1 X1XX XX1X ...
S O
E R W
a3 a2 a1 a0 = 0 a2 a1 a0 = 1
a3 a2 a1 a0 = 1
a3 a1 a2 a0 = 1 Fail (a0..a3)
a1 a0 a3 a2 = 0 prev seed = 0111
a2 a0 a3 a1 = 0 Next reseed after block 1
a3 a2 = 1 a2 a1 a0 = 1 a1 a0 = 1 a3 a2 a1 a0 = 0
a1 a0 = 0 a3 a2 a1 = 0 a3 a1 = 1 a3 a1 a1 a0 = 1
a3 a2 = 1 a3 a3 a1 = 1 a2 a1 = 0 a3 a2 a1 a2 = 0
a3 a2 = 1 a3 a2 a1 = 1 a2 a2 = 1
a3 a2 = 1 a3 a2 a2 = 1 a2 a1 = 0 Fail prev seed = 0110
a1 a0 = 0 a3 a2 a1 = 0 a3 a1 = 1 Next reseed after block 4
a2 a1 a0 = 0 a3 a3 a1 a0 = 1
a3 a1 a0 = 1 a3 a2 a1 a0 = 0
a3 a2 a1 = 0 a3 a1 a2 a1 = 0 Fail prev seed = 0001
. Next reseed ...
a3 a2 a1 = 0 a3 a2 a1 a0 = 1
a0 a1 a0 = 1 a3 a2 a1 a0 = 0
69
3.3. Seed-only Packing M.A.Sc. - A.B. Kinsman - McMaster
SOC
StreamBIST StreamBIST
Hardware Pattern
Generation Reordering
StreamPack
70
3.3. Seed-only Packing M.A.Sc. - A.B. Kinsman - McMaster
Algorithm 3: StreamPack
Input : compressed stream1,2,... , decomp hardware1,2... , core connect
Output: compressed streamSOC , decomp hardwareSOC
1 decomp hardwareSOC = GenSOCHW(decomp hardware1,2... , core connect);
2 foreach Chaini in core connect do
3 foreach Corej in Chaini do
4 foreach P atternk in compressed streamj (Corej ) do
5 Read reseed profile for P atternk into list INi ;
end
end
6 Sort patterns in INi by decreasing number of reseed points;
7 List OU Ti = NULL;
end
8 while Some INi still contains patterns do
9 Current OU T = Shortest OU Ti | INi still contains patterns;
10 Current IN = INi which corresponds to Current OU T ;
11 F it = FALSE;
12 while (F it 6= TRUE) do
13 Current pattern = First pattern in Current IN ;
14 F it = FALSE;
15 while ((F it 6= TRUE) AND (not end of Current IN )) do
16 F it = CheckFit(Current pattern, OU T1,2,... );
17 if (F it = TRUE) then
18 Move Current pattern from Current IN to Current OU T ;
else
19 Current pattern = Next pattern in Current IN ;
end
end
20 if (F it 6= TRUE) then
21 Insert N OOP block in Current IN ;
end
end
end
22 {Data,Control} = ParseLists(OU T1,2,... ,decomp hardwareSOC );
S
23 compressed streamSOC = Data Control;
24 Return compressed streamSOC , decomp hardwareSOC ;
71
3.3. Seed-only Packing M.A.Sc. - A.B. Kinsman - McMaster
compressed_stream1 Core 1: 0100 XXXX XXXX XXXX XXXX XXXX XXXX 1111 ...
compressed_stream2 Core 2: 1101 XXXX XXXX 1001 XXXX XXXX XXXX 0001 ...
compressed_stream3 Core 3: 0111 XXXX XXXX XXXX 0001 XXXX XXXX XXXX ...
compressed_stream4 Core 4: 0111 XXXX 0111 XXXX XXXX XXXX 1010 XXXX ...
Core 1: 0100 XXXX XXXX XXXX XXXX XXXX XXXX 1111 ...
Re-arranged to Core 2: NOOP 1101 XXXX XXXX 1001 XXXX XXXX XXXX 0001 ...
enable sharing Core 3: NOOP NOOP 0111 XXXX XXXX XXXX 0001 XXXX XXXX XXXX ...
Core 4: NOOP NOOP NOOP 0110 XXXX 0111 XXXX XXXX XXXX 1010 XXXX ...
Data: 0100 1101 0111 0110 1001 0111 0001 1111 0001 1010 XXXX ...
compressed_streamSOC Control: 00XX 10XX 01XX 11XX 10XX 11XX 01XX 00XX 10XX 11XX XXXX ...
72
3.3. Seed-only Packing M.A.Sc. - A.B. Kinsman - McMaster
NO NO NO NO
OP Pattern 1 (remaining) OP OP OP Pattern k (remaining)
73
3.3. Seed-only Packing M.A.Sc. - A.B. Kinsman - McMaster
74
3.4. Seed and Mux Packing M.A.Sc. - A.B. Kinsman - McMaster
D Q
0
1 D
3
Q 0
1
0
1
3
Seed
Lines D Q
0
1 D
2
Q 0
1
0
1
2
D Q
0
1 D
1
Q 0
1
0
1
1
D Q
0
1 D
0
Q 0
1
0
1
0
Reseed
Indicator
Mux
Lines
Mux EN
Clock
Seed Register LFSR Programmable Phase Shifter Scan Chains
Figure 3.18: Core level decompressor for seed and mux packing.
3.4 Seed and Mux Packing
As extensively covered thus far in the thesis, the use of linear machines as test data
decompressors is largely a result of the reduced complexity of seed calculation enabled
by the use of Gaussian elimination. Unfortunately, this benefit comes at the cost of
increased probability of lockout caused by correlations (linear dependencies) between
different points in the test set. If one of these linear dependencies is violated by
a given test pattern, that pattern cannot pass through the decompressor. In order
to cope with these linear dependencies, the most immediately obvious solution is to
introduce some non-linearity into the output sequence of the decompressor to enable
compression of an extended set of patterns. That is precisely the approach that has
been used in the seed and mux packing architecture, described in this section.
75
3.4. Seed and Mux Packing M.A.Sc. - A.B. Kinsman - McMaster
output. Likewise on the output (output muxes), each scan chain may be driven ei-
ther by its respective phase shifter output or the adjacent one. The Mux Lines allow
selection of what configuration in which to place the phase shifter. The end result of
this additional hardware is increased flexibility in what equations may be used when
calculating a seed. Each configuration value on the mux lines corresponds to a differ-
ent phase shifter, accompanied by a distinct phase shifter matrix. Thus if equations
in one frame conflict with equations from another, the value on the mux lines may be
altered during one of the clock cycles to break the linear dependency. This brings a
great deal of benefit as much denser test cubes can now be compressed however, this
comes at the cost slightly increased area, extra data required and computational over-
head. The additional computational complexity is addressed in Section 3.4.2 while
area is covered in the next paragraph.
Observing Figure 3.18, the additional hardware required amounts to two muxes
per scan chain and two AND gates (discussed in the next paragraph). Note that
although 2-to-1 muxes are shown in the figure, any size mux may be used. The choice
of 2-to-1 has been made for a few reasons. First, by keeping only 2-to-1 muxes the
area overhead is kept relatively low. Also, it has been observed that as the size of
the mux is increased more mux lines are obviously needed, and the benefit of solving
denser blocks is lost against the increased data which must be supplied. In addition,
the size of the mux used has a severe impact on computational complexity as will be
seen in Section 3.4.2. As a result of these facts, 2-to-1 muxes are preferred. Now we
will discuss hardware at the system level.
In the same way that sometimes seeds are frequently demanded, while at other
times seed requirements are scarce, it has been observed that some blocks are satis-
fiable without the need for a reconfigurable phase shifter while some require it. This
stems from exactly the same reasons as in the case of seeds, the clustering of care
bits. As a result, the mux lines can be shared in the same way as the seed lines.
Analogous to the reseed enable signal for the LFSR which instructs the it either to
accept or ignore the seed buffered in its seed register, the Mux EN signal tells the
core to either use the values coming on the mux lines or ignore them. This is accom-
plished through a simple AND gate, when the Mux EN signal is 0 the mux lines are
76
3.4. Seed and Mux Packing M.A.Sc. - A.B. Kinsman - McMaster
masked and the phase shifter configuration is the same as the case for seed-only. Also
similar to the reseed enable, the mux enable must be driven at the system level by
some control. The control is almost identical to that for seeds, consisting of a control
register to buffer the incoming control stream and a decoder implemented for both
input and output muxes, which allows the input and output mux lines to be passed
independently between cores on a block by block basis (during one block, one core
may receive the seed lines, another the input mux lines and another the output mux
lines). Although the area of the control logic is now in triplicate of the control area
for the seed-only case, it is still well within the acceptable overhead range. Next we
detail some of the implications on CAD when using this approach.
77
3.4. Seed and Mux Packing M.A.Sc. - A.B. Kinsman - McMaster
To solve for a seed to a block when a reconfigurable phase shifter is used, the
equations are added frame by frame as is the case for the seed-only architecture. The
difference is that for the seed-only case, once the equations for a frame have been
added there is no room to maneuver. If a linear inconsistency exists the block must
be abandoned. In the case of the reconfigurable phase shifter, if adding a frame causes
a linear inconsistency the other configurations are tried for that frame. If the linear
inconsistency occurs for all the configurations in that frame the algorithm will back-
track, change the configuration for the previous frame and try all the configurations
again. Having recorded what configuration the phase shifter used in each clock cycle,
the mux line stream can be determined immediately in the case that the block is
compressible. It is clear that if a block is incompressible all the configurations will
have to be tried, resulting in explosive runtimes. In order to combat this, a backtrack
limit has been imposed which allows only a set number of reconfiguration trials be-
fore abandoning the block. Although this moderately affects the compression since
some patterns which may be compressible after a large number of reconfigurations
are abandoned, the runtime is significantly improved. The loss of compression arising
from the backtrack limit is more than balanced by the increase in block length which
is permitted by it. Since a seed (of fixed width = LFSR size) is loaded during one
block, if the block length is increased the number of seed lines required is reduced.
Thus we desire a longer block length for improved compression, however this would
seriously impact computational time during compression if the backtrack limit were
not in place.
It is therefore for both the reasons of area overhead and computational complexity
that 2-to-1 muxes are preferred. As for hardware, we have discussed the hardware
at the system level which moderates whether the mux lines driven from the ATE
(outside the chip) are observed by a given core or ignored. In the next section we
discuss the impact of this modification on the algorithm at the system level.
78
3.5. Multi-mode Packing M.A.Sc. - A.B. Kinsman - McMaster
79
3.5. Multi-mode Packing M.A.Sc. - A.B. Kinsman - McMaster
D Q
0
1 D
3
Q
3
Seed
Lines D Q
0
1 D
2
Q
1
0
2
D Q
0
1 D
1
Q
1
0
1
D Q
0
1 D
0
Q
1
0
0
Reseed
Indicator
Re-Config
Control
Clock
Seed Register LFSR Phase Shifter SC Re-Config Scan Chains
D Q
0
1 D
3
Q 0
1
0
1
3
Seed
Lines D Q
0
1 D
2
Q 0
1
0
1
1
0
2
D Q
0
1 D
1
Q 0
1
0
1
1
0
1
D Q
0
1 D
0
Q 0
1
0
1
1
0
0
Reseed
Indicator
Mux
Lines
Mux EN
Re-Config
Control
Clock
Seed Register LFSR Programmable Phase Shifter SC Re-Config Scan Chains
80
3.5. Multi-mode Packing M.A.Sc. - A.B. Kinsman - McMaster
may be expanded by increasing the set of equations from which we may choose in any
clock cycle. The architecture given here uses reconfiguration to reduce the number
of care bits which should appear in any block by reducing the size of the block by
shrinking the number of scan chains. Note that both varieties of decompressor may
employ scan chain reconfiguration, both the seed-only packing (Figure 3.19(a) and
the seed-mux packing (Figure 3.19(b)). In the forthcoming discussion explanations
will be centered around the seed-only architecture for clarity, but anything discussed
applies equally to the multi-mode seed-mux architecture.
To see scan chain reconfiguration in effect, consider a design using 32 scan chains
and a 32 bit LFSR using a block length of 4. This results in 8 seed lines being needed.
During a given block, 4 32 = 128 bits are expanded from the LFSR while only 4
8 = 32 are shifted in on the seed lines. This indicates that the block may contain
32 care bits and still remain compressible - this translates to 32/128 = 25% care bit
density. After the first reconfiguration (we are now in mode 2), the muxes between
the phase shifter outputs and the scan chain inputs in Figure 3.19(a) are activated
through the Re-Config Control. This is done in such a way as to link the input of
every second scan chain to the output of the previous chain thereby doubling the
length of each chain, but cutting the number of chains in half. If the block length is
kept at 4, we now have 4 16 = 64 care bits expanded in each block but the number
of blocks which we must expand per pattern has doubled. Since the seed end has
not changed, we may still satisfy 32 care bits and so the maximum tolerable care bit
density is 32/64 = 50%. Thus by reducing the number of scan chains, the allowable
care bit density is increased at the expense of scan time which lowers compression for
the patterns tested in later modes.
In a sense, this architecture also exploits clustering of care bits within a test set but
across patterns instead of within a pattern. It has been observed that some patterns
have very high average care bit densities while others are very sparse. Thus we can
use the high compression mode (which will tolerate only lower care bit densities) to
compress the sparse patterns, and only the very dense patterns need to suffer the
reduced compression of the later scan chain reconfiguration modes. In this way the
average compression of the entire test set is improved. Of high importance in this
81
3.5. Multi-mode Packing M.A.Sc. - A.B. Kinsman - McMaster
82
3.5. Multi-mode Packing M.A.Sc. - A.B. Kinsman - McMaster
Algorithm 4: MMStreamBIST
Input : test cubes, parameters
Output: compressed stream1,2,...
core , decomp hardwarecore
83
3.6. Experimental Results M.A.Sc. - A.B. Kinsman - McMaster
84
3.6. Experimental Results M.A.Sc. - A.B. Kinsman - McMaster
85
3.6. Experimental Results M.A.Sc. - A.B. Kinsman - McMaster
60
30
20
10
0
Comp 50 Comp 150 Comp 400 Comp 1000 Mintest
Compaction Level
86
3.6. Experimental Results M.A.Sc. - A.B. Kinsman - McMaster
Uncompressed VTD
Compaction No. SC Scan Time VTD
8 1980667
Comp-50 16 990334 15845330
32 495167
8 488078
Comp-150 16 244039 3904623
32 122020
8 199724
Comp-400 16 99862 1597792
32 49931
8 168267
Comp-1000 16 84134 1346136
32 42067
8 77904
Mintest 16 38952 623230
32 19476
Table 3.2: Uncompressed VTD for 5 ISCAS89 benchmarks.
calculated as VTD / number of scan chains. Also, in this scheme the number of
data lines is the same for the uncompressed and compressed cases which shows more
clearly the tradeoff between control information overhead and test application time.
Said another way, additional control channels from the ATE increase VTD but si-
multaneously enable packing which reduces scan time (as in Figure 3.17) and thereby
VTD. By keeping the number of seed channels the same in compressed and uncom-
pressed cases we can more clearly see whether the benefit in scan time outweighs the
cost in extra control channels, thus we can evaluate the effect of packing directly. In
the next section, VTD and scan time are discussed for the seed-only and seed-mux
decompressors not using scan chain reconfiguration.
87
3.6. Experimental Results M.A.Sc. - A.B. Kinsman - McMaster
88
3.6. Experimental Results M.A.Sc. - A.B. Kinsman - McMaster
Seed-only Seed-mux
Compaction Scan Time VTD Scan Time VTD
Comp-50 438752 1316256 (16) 422128 2110640 (16)
Comp-150 120752 603760 (8) 122240 855680 (8)
Comp-400 56464 564640 (4) 81720 653760 (8)
Comp-1000 49300 493000 (4) 42968 558584 (4)
Mintest 23480 234800 (4) 22172 288236 (4)
Table 3.4: Seed-only and seed-mux multi-mode VTD results (Section 3.5).
architecture. As we have seen earlier, multi-mode addresses incompressible patterns
in a different way than seed-mux and the next section shows the benefits of multi-
mode testing in this respect, that any pattern may be applied and at the same time
the VTD and scan time can be further reduced.
89
3.6. Experimental Results M.A.Sc. - A.B. Kinsman - McMaster
450
400
350
250
200
150
16 bit LFSR, seed-mux
100 16 bit LFSR, seed-only
8 bit LFSR, seed-mux
50 8 bit LFSR, seed-only
0
0 1 2 3 4 5
Block size
90
3.6. Experimental Results M.A.Sc. - A.B. Kinsman - McMaster
1200
1000
600
0
1 2 3 4 5
Block size
91
3.6. Experimental Results M.A.Sc. - A.B. Kinsman - McMaster
30000
Control+cores (individual seed buffer)
15000
10000
5000
0
0 5 10 15 20 25 30 35
Number of cores
92
3.6. Experimental Results M.A.Sc. - A.B. Kinsman - McMaster
The uppermost curve (control+cores, individual seed buffers) of Figure 3.23 shows
the total area required at the system level, as the sum of the system level decom-
pressors and the control hardware. Since the area is dominated by the core level
decompressors, it is no surprise that it grows linearly with the number of cores. This
case includes a seed buffer for each core (see Figure 3.10(a)) however, earlier discus-
sion surrounding Figure 3.10(b) showed how the seed buffer can be shared because
the channels are time-multiplexed. As a result the area may be reduced to the values
shown in the middle curve (control+cores, shared seed buffer) of Figure 3.23. It should
be noted that the system level area for our approach is not dependent on the circuit
size, only on the number of scan chains and the number of cores. In the case of 30
cores and 32 scan chains per core, we may make a pessimistic assumption of 100 FFs
per scan chain and 20 gates per FF giving 1 920 000 gates for the total circuit area.
Figure 3.23 shows an area overhead of 20 000 gates for this case which amounts
to an overhead of 1% and it is expected that even with variations in FFs/chain
and gates/FF it is very unlikely that the overhead will exceed 2%. Furthermore, as
the number of cores is increased the area saved will increase (i.e. the number of seed
buffers replaced by a single seed buffer will increase), and if the phase shifter area
is optimized as mentioned before the seed buffer will constitute a greater portion of
each core level decompressor and as a result the savings will further increase. In this
way, time-multiplexing has enabled increased compression not at the expense of extra
area, but with reduced area compared to using [105] for each core.
93
3.6. Experimental Results M.A.Sc. - A.B. Kinsman - McMaster
finally the resulting stream is parsed into the form required by the user (e.g. func-
tional simulation streams or ATE pin vector sequences). The core processing in terms
of sophistication occurs in the middle two steps and in terms of execution time in
StreamBIST in particular. Input parsing and output parsing are straightforward al-
gorithms which essentially run in linear time (they make a single pass through the
test set to process it). For the experiments provided, the duration was less than a
second (per core) for input parsing and less than 1 minute for output parsing at the
system level. Experiments were performed on a 3.06 GHz Pentium 4 Xeon processing
node with 1GB of memory running Red Hat Linux 7.3.
As just mentioned, StreamBIST dominated the overall execution time. This was
a result of a number of factors. First, it is the complexity of StreamBIST which
produces the streams of increased packability and thereby facilitates easier and faster
execution on StreamPacks part. In addition, StreamPack operates once at the system
level while StreamBIST must be performed on each core in the system. At the heart
of StreamBIST (Algorithm 1) is the CompressPattern function (Algorithm 2). The
number of calls to CompressPattern is clearly linear in the number of test patterns
in the test set. Inspecting CompressPattern it can be seen that the number of calls
to CheckEquations is essentially linear in the number of blocks per pattern. At the
core of CheckEquations, Gaussian elimination is performed to reduce the system
of equations and check for linear inconsistencies which is of worst-case quadratic
complexity in the number of equations. Since the number of equations depends on
the number of care bits it is difficult to quantify, however it is strongly linked to the
block size. In the seed-only case then, we can say the entire algorithm is dependent
linearly on the total number of blocks in the test set and quadratically on the size
of each block, not entirely unmanageable. In the seed-mux case however, recall that
backtracking is used to search the possible systems of equations leading to explosive
runtimes if left uncapped (Section 3.4.2). To avoid this a backtrack limit has been
imposed but runtimes remain significantly longer than the seed-only case. For the
experiments performed, the runtimes per core for StreamBIST ranged anywhere from
under a minute to in excess of 40 hours, which can become serious since StreamBIST
must be performed on each core. At the same time this large result of 40 hours is for
94
3.6. Experimental Results M.A.Sc. - A.B. Kinsman - McMaster
the seed-mux case just mentioned, and as we have seen in the VTD discussion earlier,
seed-only packing actually provides better compression removing the need for the
computationally expensive seed-mux case especially since reconfiguration takes care
of the programmability issue which seed-mux was devised to address. If seed-mux
packing were required, this CPU runtime bottleneck could be considered a limitation
of this methodology as discussed in Section 4.2 however, the next paragraph discusses
why it is more important to have fast system level algorithms over core level ones.
Just before moving on, note that in multi-mode StreamBIST, StreamBIST is called
iteratively but with a reducing number of patterns each time. The execution times for
the multi-mode StreamBIST approaches essentially matched their respective single
mode partners, largely due to the fact that incompressible patterns are often spotted
before to much wasted execution time is spent trying to compress them.
The good news in light of the above is that once time has been spent generating
the streams for StreamPack, its operation is very fast. Noting Algorithm 3, it can be
seen that lists of patterns awaiting placement are cycled through, calling CheckFit
each time. Because CheckFit is a very simple function which checks only the reseed
points of the pattern to be placed against other placed patterns, it runs quickly. With
each placed pattern reducing the size of the remaining lists, operation continuously
accelerates. For this reason StreamPack execution times have been well below a
minute for all the experiments run. This can be used advantageously to address
StreamBIST runtime, by storing with each core its StreamBIST compressed test
set and simply plugging it in to StreamPack when the core is used in a system.
StreamBIST would only need to be run when there are changes to the test set, likely
an infrequent occurrence especially for IP-protected cores (at which this method is
targeted). So in this case it is advantageous to have the runtime intensive part at
the core level where changes are less frequent, as per the design reuse philosophy.
With it likely that the core will be used more times in its lifetime than it will undergo
changes, increased cost to the test set at the core level is acceptable provided it comes
to facilitate less cost at the system level, which is exactly the case here.
95
3.6. Experimental Results M.A.Sc. - A.B. Kinsman - McMaster
Having summarized the main aspects of the proposed test methodology and dis-
cussed tradeoffs and their influencing factors within the methodology, we will now
turn attention to how this method fits in the framework of test data compression
methods at large by comparison with other methods.
96
3.6. Experimental Results M.A.Sc. - A.B. Kinsman - McMaster
Before discussing other approaches with distinct features, note first of all that since
this method builds off the architecture proposed in [105] (see Section 3.3.1) a com-
parison against their work has been given inherently in the tester channel utilization
discussion in Section 3.6.2. The improved tester channel utilization showed how VTD
could be reduced from the case in [105] while simultaneously reducing the area (and
thereby providing scalability) through system level exploitations as discussed in Sec-
tion 3.6.6. Furthermore, the modifications we have provided enable programmability
and IP-consistency. Nevertheless, [105] can be deemed modular, having established
the modularity of our approach and given that [105] provided a good starting point
for the modifications we have made. The remainder of this section gives comparison
against other works with a variety of combinations of the desired features.
The first method shown in Table 3.5 was proposed by [95] and uses a very simple
compression code consisting of nine codewords. Because the circuit is not modified in
any way, the method is IP-consistent. A fairly significant drawback of this approach
is that it requires an acknowledgement signal to drive the tester for the sake of syn-
chronization. Because this means the design cannot be integrated easily into a test
infrastructure the method is not modular. In spite of this, the method is scalable and
programmable due to well scaling hardware sizes and a codeword set up to allow any
word of test data to be passed. Since results for the circuit s35932 were not provided
by [95], comparison against the proposed approach is done for a different hypotheti-
cal SOC discussed in the next paragraph which yields a total VTD of 212160 for our
approach, 12% more than [95] but without the need for an acknowledgement signal
from the tester.
The hypothetical SOC mentioned in the preceding paragraph is composed of only
s13207, s15850, s38417 and s38584 from the ISCAS89 benchmark circuits3 . The same
comparison applies for [5, 41, 59, 80] in the forthcoming paragraphs. The s35932
benchmark circuit is commonly neglected in experimental results for compression
methods (especially linear decompressor based methods) since it is random-pattern
resistant, which means the patterns contain long runs of 1s and 0s and are there-
fore not easily encodable in pseudo-random sequences like those produced by linear
3 +
VTDs marked by in the table indicate s35932 is not included.
97
3.6. Experimental Results M.A.Sc. - A.B. Kinsman - McMaster
[95] [41] [59] [5] [80] [28] [84] [68] [63] [82] Proposed
IP-consistent X X X X X
Modular X X X X X X X
Scalable X X X
Programmable X X X X X
Low cost ATE
Table 3.6: Criteria satisfied for compared approaches.
decompressors. In light of the programmability criteria any complete compression
method should be able to address random pattern resistant circuits as well as ran-
dom pattern amenable ones, and lack of results for s35932 is often an indicator of
lack of programmability. Although [95] is programmable, it will be shown in the
following paragraphs that [5, 41, 59, 80] are not programmable. A strength of our
approach is the ability to apply test sets for random pattern resistant circuits, even
if some patterns in the test set are 100% specified (containing no x-bits), hence the
programmability of our method. Furthermore, application of these fully specified pat-
terns can be done with a reduced VTD, both by providing compression to the other
partially specified patterns in the test set and by time-multiplexing at the system
level.
The next approach listed in the table was proposed in [41], which forms a virtual
scan chain appearing shorter than the number of actual scan cells contained therein,
98
3.6. Experimental Results M.A.Sc. - A.B. Kinsman - McMaster
accomplished through the use of an LFSR. A drawback of the approach is the use of
integrated fault simulation/ATPG to obtain compressible test cubes, removing IP-
consistency. A very nice feature of this approach on the other hand is that after
compression, the virtual scan chain appears exactly like a regular scan chain but
shorter (and with different values of course). Such an approach would certainly
be modular aside from the fact that there is no apparent system level exploitable
feature of the test sets. The method is scalable since the overhead in hardware and
computation are reasonable but since no mechanism for application of any pattern
has been provided, the method is not programmable. This method is able to achieve
lower VTD than our approach by compromising on the features of IP-consistency,
modularity and programmability.
Next in the table is [59] which proposes a multi-level scheme using Huffman de-
coders to decompress LFSR seeds, which LFSRs then expand into test patterns. This
method provides IP-consistency since no details internal to the circuit are required.
Modularity is lacked however because the Huffman compression takes advantage of
seed stream features for compression leaving little to exploit at the system level. With
regard to scalability, no challenges should be faced since the hardware and algorithms
seem manageable for increased design sizes however this is difficult to assess since no
details regarding implementation for the statistical coding part have been provided.
Statistical coding methods by in large are known to suffer from either synchronization
difficulty as a result of different ATE and CUT clock frequencies, or from requirement
of a synchronization signal from the CUT to the ATE and so this method is likely
to meet with methodology issues when implemented. Finally, the method cannot be
classified as programmable since the patterns must pass through the LFSR which
fails when the care bit density is high. Again, the VTD obtained is better than the
proposed method but it comes at the cost of modularity and programmability.
Another method against which we compare is [5]. In this approach the output
space of the linear decompressor is analyzed with the test set to determine a set of
scan cells to invert, which will increase the alignment between the output space and
the test set. An advantage of this method is that is can be applied to any linear type
decompressor (LFSRs, ring generators, etc.) and can guarantee improved encoding.
99
3.6. Experimental Results M.A.Sc. - A.B. Kinsman - McMaster
However, this comes at the cost of IP-consistency as a result of scan hardware modifi-
cations requiring access to internal circuit details, and programmability since there is
no mechanism for application of patterns outside the output space of the decompres-
sor. On the other hand, this method may still leave some un-specified bits in seeds
or allow multiple patterns to be expanded from a single seed, so it leaves room for
exploitation at the system level and is therefore modular. In addition, the method
is scalable since 1) the heart of the method is scan cell inversion used to improve
the compression of the linear decompressor which does not add any appreciable area
and 2) the algorithmic framework is centered around Gaussian elimination which is
tractable. By not remaining IP-consistent or programmable, this method is also able
to improve VTD.
The next method we consider is [80] which uses ring generator structures to provide
compression through variable injection as discussed much earlier along with Figure
2.13. During compression, ATPG and fault simulation are employed and so the
method is not IP-consistent, and no mechanism for application of high care bit density
patterns is provided so the method is not programmable. Explicit techniques for VTD
reduction by the exploitation of ATE features are provided in the work but they are
not required to achieve the reported compression, so the method can be used on
a low cost test platform. At the same time, although the framework for results
reported on the ISCAS circuits is not scalable the method is targeted at large circuits
indicating good scalability of the method on the whole. This method also exhibits
good modularity since there is a slack present in the variable streams which drive
the ring generator. Such streams could be packed using StreamPack, and thus this
method could be plugged immediately into the core level in our approach. As in the
previous paragraph, this method achieves very good compression in the case where
IP-consistency and programmability are of no concern.
The previously discussed approaches neglected reporting of results for s35932. In
[28], the opposite is true - more cores have been included than in our experimental
SOC4 . This method uses statistical coding not reliant on the internal circuit details in
4
VTDs marked by in the table indicate s5378 and s9234 have also been included.
100
3.6. Experimental Results M.A.Sc. - A.B. Kinsman - McMaster
any way and thus is IP-consistent. The issue of ATE stall synchronization (require-
ment of the Sync data signal of Figure 2.5 - see Section 2.5) has been addressed in this
method by passing data from the tester to other cores on the chip while a given core
is still busy decompressing. This system level exploitation of a feature is indicative of
modularity, and in essence makes this method the temporal analogue of our spatial
approach. The method also scales reasonably, one decompressor per core and some
system level decompression logic, and as is usually the case with statistical coding any
test set may be applied giving programmability. On the whole this approach meets
all the criteria. The only drawback here is the issue of synchronizing the ATE clock
and CUT clock which run at different frequencies. In addition power issues can arise
due to fast clocking of the scan chains as discussed before. With respect to VTD, this
method reports 279837 bits for the 7 largest ISCAS89 cores: s5378, s9234, s13207,
s15850, s35932, s38417. Our result for VTD in this case is 262000, an improvement
of 6%.
The approach given in [84] places a register and decoder before the scan chains,
and compresses data by encoding the bits which need to change from the previous
vector. Since no netlist interaction is required, this method can be considered IP-
consistent. However, because the compressed streams leave no room for maneuvering
there can be no system level exploitation and thus the method cannot be considered
modular. At the same time, although hardware remains small increased numbers of
scan chains will lead to large runtimes to determine the mapping of care bits and
the order in which mutations should be applied, making this method not scalable. A
benefit is that the decoder architecture allows as many bits to be mutated as desired
and this approach is therefore programmable. With respect to VTD comparison, it
should be noted that due to a statement in the paper there is some ambiguity as
to what data has been included in the reported results (some control information
is used by the decompressor which must be driven by the tester and it is not clear
whether this has been reported). In the best case, if all the data has been reported
this method is only about 4% better than the proposed approach while not providing
as many features as shown in the Table 3.6.
101
3.6. Experimental Results M.A.Sc. - A.B. Kinsman - McMaster
Scan trees are used by [68] to reduce volume of test data, and as a result access
to the scan structure is required which renders the method without IP-consistency.
Because the main benefits come from dependencies in the test set, the compressed
data will likely not have any features which are sufficiently exploitable at the system
level to deem this approach modular. In addition, large MISRs for response collection
from a larger number of scan tree outputs makes this method not scalable. With
respect to programmability however, a serial mode is provided by which any pattern
may be scanned in and thus the method is programmable. Although the reported
VTD is quite good, the number of features suffers significantly.
A dictionary based coding method is used in [63] which does not require internal
circuit details and is thus IP-consistent. The decompression hardware may potentially
leave some room for system level exploitation, but due to some other methodology
related issues (such as scan clock gating) the method cannot be deemed modular. In
this approach, the hardware does not appear to scale too quickly with the number
of scan chains and at the same time, provision is made to apply patterns which do
not appear in the dictionary so this method is programmable. With most of the
desired features in place the method performs well at 40% better than StreamPack
in VTD for the same configuration. Unfortunately as mentioned just above, the
method suffers from methodology issues related to scan clock gating but if these can
be overcome the method has good value.
The final approach against which we make a comparison is [82]. Table 3.5 shows
that this method provides superior results for VTD. However, as Table 3.6 shows,
this comes at the high cost of all features except the low cost test platform. VTD is
improved through 1) integrated fault simulation (removing IP-consistency), 2) craft-
ing the phase shifter around the test set (removing programmability) and 3) using
unreasonably short internal scan chains (removing scalability). After squeezing out
this level of compression through exploitation of these multiple facets, there is noth-
ing left to exploit at the system level, and thus the approach lacks modularity. In
addition, the scheme is in some respects similar to that from [84] and like that ap-
proach it is unclear whether control data has been reported in the totals or not. It
is important to note that our results for VTD may likewise be improved through the
102
3.7. Summary M.A.Sc. - A.B. Kinsman - McMaster
application of techniques used by this approach and others (e.g., fault simulation) but
we have deemed provision of desired features as more important than VTD reduction
only. Nevertheless, this method will provide good compression where IP-consistency,
modularity, scalability and programmability are un-important.
In light of all these discussed methods, what is clear is that tradeoffs exist be-
tween all aspects of a compression methodology, features and compression results
included. Many approaches exist which allow a system integrator with carefully for-
mulated requirements to judiciously select a method that ensures they are getting the
compression they can, while still retaining all the features they need.
3.7 Summary
In this chapter we have devised five basic objectives for test data compression method-
ologies by analyzing the needs which EDT is set up to address in the first place. These
objectives of IP-consistency, modularity, scalability, programmability and the use of
low cost test platforms have been set up as criteria for the design of a new test data
compression scheme. As a result, StreamBIST and StreamPack have been proposed
to address challenges in core based SOC testing and improvements have been made
to the architecture with accompanying discussion. The new architecture and sophis-
ticated algorithmic support have been developed and described in detail. We have
seen how this new architecture is able to produce improved tester channel utilization,
which results in lower VTD. In many cases, the VTD for StreamPack has rivaled
other approaches, but it has been achieved with the presence of all five criteria as
discussed in the first part of the chapter. In light of this fact, some comparison to
other approaches in terms of both VTD and the features which they provide has been
given. It has been shown that some levels of compression come at the expense of one
or more features, and it is up to the system integrator to make careful decisions when
planning for test. Advantages and drawbacks of this method have also been provided,
and will be discussed further in the next chapter.
103
M.A.Sc. - A.B. Kinsman - McMaster
Chapter 4
Conclusion
In this thesis, we have provided a justification for manufacturing test of digital inte-
grated circuits. Volume of test data considerations for those circuits have led us to
conclude that test data compression methods must be applied to keep up with scaling
trends in both circuit size and volume of test data. In light of this conclusion, a survey
of recent test data compression methods has been given which provides a structured
framework for evaluation of these methods. This has led to the development of five
desirable criteria for compression methods, specifically those in a core based SOC
design paradigm. A new method for test data compression in core based SOCs called
StreamBIST/StreamPack has been proposed in this thesis which provides competi-
tive compression while simultaneously addressing all five of the key objectives laid
out for compression methods. The proposed method achieves this compression at
acceptable expense in area and computational complexity, by sharing tester channels
among cores in an SOC at the system level, made possible by the discontinuous de-
mand for data which occurs at the core level. In proposing this method, we have
opted to ensure that we provide the five discussed criteria as features rather than
focus narrowly on minimization of test data. As comparison with other methods in
the previous chapter shows, we have been able to accomplish provision of all five
features at a comparable level of compression. In this final chapter, the advantages
and limitations of the proposed method are discussed leading to considerations for
future work.
104
4.1. Advantages M.A.Sc. - A.B. Kinsman - McMaster
4.1 Advantages
As has been emphasized throughout this thesis, the main advantage of this work is the
enabling of the five criteria set forth for compression methods in core based SOCs. By
providing compression in a way that does not require any internal details the method
is particularly suitable for SOCs containing IP-protected cores. The structured way
in which compression is done, and in which the decompression hardware is generated
enables smooth integration into existing tool flows making this method conducive to
automation. Furthermore, the scalability considerations which have been provided
will enable this method to be applied to SOCs of the future, not just the present
and the past. Moreover, movement toward widespread use of the core based SOC
paradigm has received recent attention through an initiative to standardize test data
compression [24]. Because of the features provided by StreamBIST / StreamPack,
the compression methodology proposed in this thesis fits very well in a standardized
framework and thus has significant potential for applicability to designs conforming to
potentially emerging standards. Furthermore, although our compression method does
not rely on some techniques (e.g., fault simulation) which compromise the provision
any of the five discussed features, our method does not inherently exclude these
techniques. As a result, these techniques could be applied to our method to further
reduce the volume of test data in the case that some features are not required. Having
discussed the advantages of this work, we discuss its limitations in the next section.
4.2 Limitations
As with all things in design, there is always room for improvement. Test data com-
pression methods can be summarized by the four fundamental aspects of compression
achieved, area overhead, CPU runtime and the features they provide. Having ex-
hausted discussion on the features provided, focus will be given here to how the other
three aspects may be improved.
105
4.2. Limitations M.A.Sc. - A.B. Kinsman - McMaster
The issue of CPU runtime was addressed in Chapter 3 and notice was made that
the complexity of StreamBIST can lead to excessive runtimes in some cases and as
such, there is obvious room for improvement here. As a brief example, the branch and
bound portion of StreamBIST for seed-mux is a clear processing bottleneck. Instead
of approaching each block naively frame by frame, equations could be added according
to number of care bits since those with more are likely to be the source of lockout.
In this way, the solution space of the branch and bound algorithm is pruned more
effectively leading to shorter runtimes. Other aspects of the overall runtime may be
reduced in a similar way, by development of more sophisticated algorithms which
exploit a feature of the tasks to which they are applied.
Turning attention to area overhead, we have seen that at the system level control
overhead is miniscule compared to the overhead brought about by the core level
decompressors. Thus although there is not much room for improvement on the control
end, area may be improved by focusing on the core level decompressor. Again as a
single example, the phase shifter within the core level decompressor which consists of
many XOR gates can play a significant role in the overhead of the entire decompressor.
By exploiting overlap in the linear expressions which describe the channels of the
phase shifter portions of the network may be shared and thus the overall area reduced.
As before, other opportunities for area improvement exist as well.
With respect to compression, comparison to prior work in the previous chapter
has made it clear that improvement can be made in terms of VTD. The challenge is to
provide improvement without compromising any features. One way in particular that
this could be achieved is by mutual integration of StreamBIST and SteamPack. By
providing feedback between the two processes, reseed positions could be recalculated
as needed, thus reducing the insertion of NOOPs and improving test time and VTD.
This comes with its own challenges, and would require significant effort. Another
avenue of opportunity for improved VTD is multi-level compression as discussed in
the next section.
106
4.3. Future Work M.A.Sc. - A.B. Kinsman - McMaster
107
M.A.Sc. - A.B. Kinsman - McMaster
Bibliography
108
BIBLIOGRAPHY M.A.Sc. - A.B. Kinsman - McMaster
Scan Test Efficiency. IEEE Design and Test of Computers, 19(5):65 73,
September 2002.
[9] I. Bayraktaroglu and A. Orailoglu. Test Volume and Application Time Reduc-
tion through Scan Chain Concealment. In Proc. IEEE/ACM Design Automa-
tion Conference (DAC), pages 151 155, 2001.
109
BIBLIOGRAPHY M.A.Sc. - A.B. Kinsman - McMaster
[17] A. Chandra and K. Chakrabarty. Test Resource Partitioning for SOCs. IEEE
Design and Test of Computers, 18(5):80 91, September 2001.
[19] A. Chandra and K. Chakrabarty. Test Resource Partitioning and Reduced Pin-
Count Testing Based on Test Data Compression. In Proc. IEEE/ACM Design,
Automation and Test in Europe (DATE), pages 598 603, 2002.
[20] A. Chandra and K. Chakrabarty. A Unified Approach to Reduce SOC Test Data
Volume, Scan Power and Testing Time. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, 22(3):352 363, March 2003.
[21] A. Chandra and K. Chakrabarty. Test Data Compression and Test Re-
source Partitioning for System-on-a-Chip using Frequency-Directed Run-
Length (FDR) Codes. IEEE Transactions on Computers, 52(8):1076 1088,
August 2003.
[22] D. Das and N. A. Touba. Reducing Test Data Volume using External/LBIST
Hybrid Test Patterns. In Proc. IEEE International Test Conference (ITC),
pages 115 122, 2000.
110
BIBLIOGRAPHY M.A.Sc. - A.B. Kinsman - McMaster
[30] I. Hamzaoglu and J. H. Patel. Test Set Compaction Algorithms for Combi-
national Circuits. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, 19(8):957 963, August 2000.
111
BIBLIOGRAPHY M.A.Sc. - A.B. Kinsman - McMaster
[37] V. Iyengar and A. Chandra. A Unified SOC Test Approach Based on Test Data
Compression and TAM Design. In Proc. IEEE Defect and Fault Tolerance in
VLSI Systems (DFT), pages 511 518, 2003.
[39] A. Jas, J. Ghosh-Dastidar, M.-E. Ng, and N. A. Touba. An Efficient Test Vector
Compression Scheme using Selective Huffman Coding. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 22(6):797 806,
June 2003.
[41] A. Jas, B. Pouya, and N. A. Touba. Test Data Compression Technique for
Embedded Cores using Virtual Scan Chains. IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, 12(7):775 781, July 2004.
112
BIBLIOGRAPHY M.A.Sc. - A.B. Kinsman - McMaster
[45] D. Kay, S. Chung, and S. Mourad. Embedded Test Control Schemes for Com-
pression in SOCs. In Proc. IEEE/ACM Design Automation Conference (DAC),
pages 679 684, 2002.
[46] D. Kay and S. Mourad. Compression Technique for Interactive BIST Applica-
tion. In Proc. IEEE VLSI Test Symposium (VTS), pages 103 108, 2001.
[47] A. Khoche, E. Volkerink, J. Rivoir, and S. Mitra. Test Vector Compression using
EDA-ATE Synergies. In Proc. IEEE VLSI Test Symposium (VTS), pages 97
102, 2002.
113
BIBLIOGRAPHY M.A.Sc. - A.B. Kinsman - McMaster
[54] B. Koenemann. LFSR-Coded Test Patterns for Scan Designs. In Proc. IEEE
European Test Conference (ETC), pages 237 242, 1991.
[55] B. Koenemann. Care Bit Density and Test Cube Clusters: Multi-Level Com-
pression Opportunities. In Proc. IEEE International Conference on Computer
Design (ICCD), pages 320 325, 2003.
[58] C. V. Krishna, A. Jas, and N. A. Touba. Test Vector Encoding using Partial
LFSR Reseeding. In Proc. IEEE International Test Conference (ITC), pages
885 893, 2001.
[59] C. V. Krishna and N. A. Touba. Reducing Test Data Volume using LFSR
Reseeding with Seed Compression. In Proc. IEEE International Test Conference
(ITC), pages 321 330, 2002.
114
BIBLIOGRAPHY M.A.Sc. - A.B. Kinsman - McMaster
[62] L. Lai, T. Rinderknecht, W.-T. Cheng, and J. H. Patel. Logic BIST using
Constrained Scan Cells. In Proc. IEEE VLSI Test Symposium (VTS), pages
199 205, 2004.
[64] L. Li and K. Chakrabarty. Test Set Embedding for Deterministic BIST using
a Reconfigurable Interconnection Network. IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, 23(9):1289 1305, September
2004.
[65] H.-G. Liang, S. Hellebrand, and H.-J. Wunderlich. Two-Dimensional Test Data
Compression for Scan-Based Deterministic BIST. In Proc. IEEE International
Test Conference (ITC), pages 894 902, 2001.
[66] S. Mitra and K. S. Kim. XMAX: X-Tolerant Architecture for MAXimal Test
Compression. In Proc. IEEE International Conference on Computer Design
(ICCD), pages 326 330, 2003.
[68] K. Miyase, S. Kajihara, and S. M. Reddy. Multiple Scan Tree Design with Test
Vector Modification. In Proc. IEEE Asian Test Symposium (ATS), pages 76
81, 2004.
115
BIBLIOGRAPHY M.A.Sc. - A.B. Kinsman - McMaster
[71] G. Mrugalski, J. Rajski, and J. Tyszer. Ring Generators - New Devices for
Embedded Test Applications. IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, 23(9):1306 1320, September 2004.
[76] I. Pomeranz and S. M. Reddy. Static Test Compaction for Full-Scan Circuits
Based on Combinational Test Sets and Nonscan Input Sequences and a Lower
Bound on the Number of Tests. IEEE Transactions on Computers, 53(12):1569
1581, December 2004.
[77] J. Rajski. DFT for High-Quality Low Cost Manufacturing Test. In Proc. IEEE
Asian Test Symposium (ATS), pages 3 8, 2001.
116
BIBLIOGRAPHY M.A.Sc. - A.B. Kinsman - McMaster
[81] J. Rajski, J. Tyszer, C. Wang, and S. M. Reddy. Finite Memory Test Response
Compactors for Embedded Test Applications. IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, 24(4):622 634, April 2005.
[82] W. Rao, I. Bayraktaroglu, and A. Orailoglu. Test Application Time and Volume
Compression through Seed Overlapping. In Proc. IEEE/ACM Design Automa-
tion Conference (DAC), pages 732 737, 2003.
[83] W. Rao, A. Orailoglu, and G. Su. Frugal Linear Network-Based Test Decom-
pression for Drastic Test Cost Reductions. In Proc. IEEE/ACM International
Conference on Computer Aided Design (ICCAD), pages 721 725, 2004.
[84] S. Reda and A. Orailoglu. Reducing Test Application Time through Test Data
Mutation Encoding. In Proc. IEEE/ACM Design, Automation and Test in
Europe (DATE), pages 387 393, 2002.
117
BIBLIOGRAPHY M.A.Sc. - A.B. Kinsman - McMaster
[88] O. Sinanoglu and A. Orailoglu. Test Data Manipulation Techniques for Energy-
Frugal, Rapid Scan Test. In Proc. IEEE Asian Test Symposium (ATS), pages
202 207, 2003.
[90] C. Stroud, J. Sunwoo, S. Garimella, and J. Harris. Built-In Self-Test for System-
on-Chip: A Case Study. In Proc. IEEE International Test Conference (ITC),
pages 837 846, 2004.
[91] X. Sun, L. Kinney, and B. Vinnakota. Combining Dictionary Coding and LFSR
Reseeding for Test Data Compression. In Proc. IEEE/ACM Design Automation
Conference (DAC), pages 944 947, 2004.
[96] N. A. Touba. Circular BIST with State Skipping. IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, 10(5):668 672, October 2002.
118
BIBLIOGRAPHY M.A.Sc. - A.B. Kinsman - McMaster
[100] E. H. Volkerink and S. Mitra. Efficient Seed Utilization for Reseeding Based
Compression. In Proc. IEEE VLSI Test Symposium (VTS), pages 232 237,
2003.
[102] L.-T. Wang, X. Wen, H. Furukawa, F.-S. Hsu, S.-H. Lin, S.-W. Tsai, K. S.
Abdel-Hafez, and S. Wu. VirtualScan: A New Compressed Scan Technology
for Test Cost Reduction. In Proc. IEEE International Test Conference (ITC),
pages 916 925, 2004.
119
BIBLIOGRAPHY M.A.Sc. - A.B. Kinsman - McMaster
[111] D. Xiang, M.-J. Chen, K.-W. Li, and Y.-L. Wu. Scan-Based BIST using an
Improved Scan Forest Architecture. In Proc. IEEE Asian Test Symposium
(ATS), pages 88 93, 2004.
[113] G. Zeng and H. Ito. Non-Intrusive Test Compression for SOC using Embed-
ded FPGA Core. In Proc. IEEE Defect and Fault Tolerance in VLSI Systems
(DFT), pages 413 421, 2004.
120
M.A.Sc. - A.B. Kinsman - McMaster
Index
121
INDEX M.A.Sc. - A.B. Kinsman - McMaster
Layout, 5 SOC, 3
LFSR, 27 Spatial compression, 32
Lockout, 38 Specification, 5
Low cost tester, 51 StreamBIST, 61
StreamPack, 69
Manufacturing test, 7
Structural test, 9
MISR, 28
Stuck-at-0, 9
Modularity, 50
Stuck-at-1, 9
Moores law, 2
STUMPS, 27
Multi-mode, 79
Synthesis, 5
Netlist, 5 System integrator, 36
NP-complete, 11
TCU, 31
Phase shifter, 28 Temporal compression, 32
Phase shifter reconfiguration, 75 Test cube, 11
Physical layout, 5 Test data compression, 18
Process variation, 8 Test pattern, 10
Programmability, 51 Test set, 20
PRPG, 22 Test set dependent, 32
Pseudo-random BIST, 22 Test set independent, 32
Testbench, 7
RTL, 4
Run-length encoding, 26 Verilog, 4
VHDL, 4
Scalability, 51
VLSI, 1
Scan, 13
VTD, 14
Scan chain, 13
Scan chain reconfiguration, 79 X-bits, 11
Scan frequency, 25
SCU, 31
Seed calculation, 58
Seed-mux, 75
Seed-only, 58
122