Guide To FPGA
Guide To FPGA
Guide To FPGA
Volume 149
Guide to FPGA
Implementation of
Arithmetic Functions
123
Jean-Pierre Deschamps
University Rovira i Virgili
Tarragona
Spain
Enrique Cant
University Rovira i Virgili
Tarragona
Spain
Gustavo D. Sutter
School of Computer Engineering
Universidad Autonoma de Madrid
Ctra. de Colmenar Km. 15
28049 Madrid
Spain
ISSN 1876-1100
ISBN 978-94-007-2986-5
DOI 10.1007/978-94-007-2987-2
e-ISSN 1876-1119
e-ISBN 978-94-007-2987-2
Preface
Field Programmable Gate Arrays constitute one of the technologies at hand for
developing electronic systems. They form an attractive option for small production
quantities as their non-recurrent engineering costs are much lower than those
corresponding to ASICs. They also offer flexibility and fast time-to-market.
Furthermore, in order to reduce their size and, hence, the unit cost, an interesting
possibility is to reconfigure them at run time so that the same programmable
device can execute different predefined functions.
Complex systems, generally, are made up of processors executing programs,
memories, buses, input-output interfaces, and other peripherals of different types.
Those components are available under the form of Intellectual Property (IP) cores
(synthesizable Hardware Description Language descriptions or even physical
descriptions). Some systems also include specific components implementing
algorithms whose execution on an instruction-set processor is too slow. Typical
examples of such complex algorithms are: long-operand arithmetic operations,
floating-point operations, encoding and processing of different types of signals,
data ciphering, and many others. The way those specific components can be
developed is the central topic of this book. So, it addresses to both, FPGA users
interested in developing new specific componentsgenerally for reducing execution timesand, IP core designers interested in extending their catalog of
specific components.
This book distinguishes itself with the following aspects:
The main topic is circuit synthesis. Given an algorithm executing some complex
function, how can it be translated to a synthesizable circuit description? Which
are the choices that the designer can make in order to reduce the circuit cost,
latency, or power consumption? Thus, this is not a book on algorithms. It is a
book on how to efficiently translate an algorithm to a circuit using, for that
purpose, techniques such as parallelism, pipeline, loop unrolling, and others.
In particular, this is not a new version of a previous book by two of the authors.1
1
Deschamps JP, Bioul G, Sutter G (2006) Synthesis of Arithmetic Circuits. Wiley, New York.
vi
Preface
Overview
This book is divided into sixteen chapters. In the first chapter the basic building
blocks of digital systems are briefly reviewed, and their VHDL descriptions are
presented. It constitutes a bridge with previous courses, or books, on Hardware
Description Languages and Logic Circuits.
Chapters 24 constitute a first part whose aim is the description of the basic
principles and methods of algorithm implementation. Chapter 2 describes the
breaking up of a circuit into Data Path and Control Unit, and tackles the scheduling
and resource assignment problems. In Chaps. 3 and 4 some special topics of Data
Path and Control Unit synthesis are presented.
Chapter 5 recalls important electronic concepts that must be taken into account
for getting reliable circuits and Chap. 6 gives information about the main Electronic Design Automation (EDA) tools that are available for developing systems
on FPGAs.
Chapters 713 are dedicated to the main arithmetic operations, namely addition
(Chap. 7), multiplication (Chap. 8), division (Chap. 9), other operations such as
square root, logarithm, exponentiation, trigonometric functions, base conversion
(Chap. 10), decimal arithmetic (Chap. 11), floating-point arithmetic (Chap. 12),
and finite-field arithmetic (Chap. 13). For every operation, several configurations
are considered (combinational, sequential, pipelined, bit serial or parallel), and
several generic models are available, thus, constituting a library of virtual
components.
The development of Systems on Chip (SoC) is the topic of Chaps. 1416. The
main concepts are presented in Chap. 14: embedded processors, memories, buses,
IP components, prototyping boards, and so on. Chapter 15 presents two case
studies, both based on commercial EDA tools and prototyping boards. Chapter 16
is an introduction to dynamic reconfiguration, a technique that allows reducing the
area by modifying the device configuration at run time.
Acknowledgments
The authors thank the people who have helped them in developing this book,
especially Dr. Matthew Copley, for correcting the text. They are grateful to the
following universities for providing them the means for carrying this work through
to a successful conclusion: University Rovira i Virgili (Tarragona, Spain) and
Autonomous University of Madrid (Spain).
vii
Contents
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
1
3
5
8
10
10
11
11
13
17
19
21
21
22
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23
23
31
34
34
37
40
44
47
53
54
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
ix
Contents
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
55
55
55
58
62
66
72
77
78
81
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
83
83
86
91
94
94
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
95
95
95
99
101
103
104
104
105
108
110
112
113
113
115
119
120
122
123
123
125
EDA Tools . . . . . . . . . . . . . . . . . . . . . .
6.1 Design Flow in FPGA EDA Tools .
6.1.1
Design Entry . . . . . . . . . .
6.1.2
Synthesis . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
127
127
128
130
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
xi
6.1.3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
132
133
133
134
135
135
135
137
137
137
138
139
140
141
141
142
148
149
151
Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1 Addition of Natural Numbers. . . . . . . . . . . .
7.2 Binary Adder . . . . . . . . . . . . . . . . . . . . . . .
7.3 Radix-2k Adder . . . . . . . . . . . . . . . . . . . . .
7.4 Carry Select Adders . . . . . . . . . . . . . . . . . .
7.5 Logarithmic Adders . . . . . . . . . . . . . . . . . .
7.6 Long-Operand Adder . . . . . . . . . . . . . . . . .
7.7 Multioperand Adders . . . . . . . . . . . . . . . . .
7.7.1
Sequential Multioperand Adders. . . .
7.7.2
Combinational Multioperand Adders.
7.7.3
Parallel Counters . . . . . . . . . . . . . .
7.8 Subtractors and AdderSubtractors . . . . . . . .
7.9 FPGA Implementations . . . . . . . . . . . . . . . .
7.9.1
Binary Adder . . . . . . . . . . . . . . . . .
7.9.2
Radix 2k Adders. . . . . . . . . . . . . . .
7.9.3
Carry Select Adder . . . . . . . . . . . . .
7.9.4
Logarithmic Adders . . . . . . . . . . . .
7.9.5
Long Operand Adder . . . . . . . . . . .
7.9.6
Sequential Multioperand Adders. . . .
7.9.7
Combinational Multioperand Adders.
7.9.8
Comparison . . . . . . . . . . . . . . . . . .
7.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
153
153
155
155
159
160
162
164
164
168
171
175
177
178
178
179
179
179
180
180
181
181
182
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xii
Contents
Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1 Basic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Combinational Multipliers . . . . . . . . . . . . . . . . . . . . .
8.2.1
Ripple-Carry Parallel Multiplier . . . . . . . . . . .
8.2.2
Carry-Save Parallel Multiplier . . . . . . . . . . . .
8.2.3
Multipliers Based on Multioperand Adders . . .
8.2.4
Radix-2k and Mixed-Radix Parallel Multipliers
8.3 Sequential Multipliers . . . . . . . . . . . . . . . . . . . . . . . .
8.3.1
Shift and Add Multiplier . . . . . . . . . . . . . . . .
8.3.2
Shift and Add Multiplier with CSA . . . . . . . .
8.4 Integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.4.1
Mod 2Bn+m Multiplication . . . . . . . . . . . . . . .
8.4.2
Modified Shift and Add Algorithm . . . . . . . . .
8.4.3
Post Correction Multiplication . . . . . . . . . . . .
8.4.4
Booth Multiplier. . . . . . . . . . . . . . . . . . . . . .
8.5 Constant Multipliers . . . . . . . . . . . . . . . . . . . . . . . . .
8.6 FPGA Implementations . . . . . . . . . . . . . . . . . . . . . . .
8.6.1
Combinational Multipliers . . . . . . . . . . . . . . .
8.6.2
Radix-2k Parallel Multipliers . . . . . . . . . . . . .
8.6.3
Sequential Multipliers . . . . . . . . . . . . . . . . . .
8.6.4
Combinational Multipliers for Integers . . . . . .
8.6.5
Sequential Multipliers for Integers . . . . . . . . .
8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
183
183
185
185
187
189
191
195
195
198
199
200
202
206
208
211
215
215
216
216
217
219
219
220
Dividers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.1 Basic Digit-Recurrence Algorithm . . . . . . . .
9.2 Radix-2 Division . . . . . . . . . . . . . . . . . . . .
9.2.1
Non-Restoring Divider . . . . . . . . . .
9.2.2
Restoring Divider . . . . . . . . . . . . . .
9.2.3
Binary SRT Divider . . . . . . . . . . . .
9.2.4
Binary SRT Divider with Carry-Save
9.2.5
Radix-2k SRT Dividers . . . . . . . . . .
9.3 Radix-B Dividers . . . . . . . . . . . . . . . . . . . .
9.4 Convergence Algorithms . . . . . . . . . . . . . . .
9.5 FPGA Implementations . . . . . . . . . . . . . . . .
9.5.1
Digit-Recurrence Algorithms . . . . . .
9.5.2
Convergence Algorithms . . . . . . . . .
9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
221
221
224
224
227
229
232
234
239
245
246
247
248
248
249
10 Other Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.1 Binary to Radix-B Conversion (B even) . . . . . . . . . . . . . . . .
10.2 Radix-B to Binary Conversion (B even) . . . . . . . . . . . . . . . .
251
251
252
.....
.....
.....
.....
.....
.....
Adder
.....
.....
.....
.....
.....
.....
.....
.....
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
xiii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
254
254
258
260
260
262
264
268
272
272
273
274
274
274
275
11 Decimal Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.1 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.1.1 Decimal Ripple-Carry Adders . . . . . . . . . . .
11.1.2 Base-B Carry-Chain Adders. . . . . . . . . . . . .
11.1.3 Base-10 Carry-Chain Adders . . . . . . . . . . . .
11.1.4 FPGA Implementation of the Base-10
Carry-Chain Adders . . . . . . . . . . . . . . . . . .
11.2 Base-10 Complement and Addition: Subtration . . . . .
11.2.1 Tens Complement Numeration System. . . . .
11.2.2 Tens Complement Sign Change . . . . . . . . .
11.2.3 10s Complement BCD Carry-Chain
Adder-Subtractor . . . . . . . . . . . . . . . . . . . .
11.2.4 FPGA Implementations of Adder Subtractors.
11.3 Decimal Multiplication . . . . . . . . . . . . . . . . . . . . . .
11.3.1 One-Digit by One-Digit BCD Multiplication .
11.3.2 N by One BCD Digit Multiplier. . . . . . . . . .
11.3.3 N by M Digits Multiplier . . . . . . . . . . . . . .
11.4 Decimal Division . . . . . . . . . . . . . . . . . . . . . . . . . .
11.4.1 Non-Restoring Division Algorithm . . . . . . . .
11.4.2 An SRT-Like Division Algorithm. . . . . . . . .
11.4.3 Other Methods for Decimal Division . . . . . .
11.5 FPGA Implementation Results . . . . . . . . . . . . . . . . .
11.5.1 Adder-Subtractor Implementations . . . . . . . .
11.5.2 Multiplier Implementations . . . . . . . . . . . . .
11.5.3 Decimal Division Implementations . . . . . . . .
11.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
277
277
277
278
280
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
281
282
282
283
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
283
284
286
286
288
289
290
291
293
297
298
298
299
301
303
303
xiv
Contents
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
305
305
306
307
309
309
311
313
314
315
316
318
319
320
321
322
327
331
332
334
335
336
13 Finite-Field Arithmetic . . . . . . . . . . . . . . . . . . . . . . .
13.1 Operations Modulo m . . . . . . . . . . . . . . . . . . . .
13.1.1 Addition and Subtraction Mod m . . . . . .
13.1.2 Multiplication Mod m . . . . . . . . . . . . . .
13.2 Division Modulo p . . . . . . . . . . . . . . . . . . . . . .
13.3 Operations Over Z2 x=fx . . . . . . . . . . . . . . . .
13.3.1 Addition and Subtraction of Polynomials.
13.3.2 Multiplication Modulo f(x). . . . . . . . . . .
13.4 Division Over GF(2m) . . . . . . . . . . . . . . . . . . . .
13.5 FPGA Implementations . . . . . . . . . . . . . . . . . . .
13.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
337
337
338
339
346
347
347
347
351
353
354
355
14 Systems on Chip. . . . . . . . . . . . . . . . . . .
14.1 System on Chip . . . . . . . . . . . . . . .
14.2 Intellectual Property Cores. . . . . . . .
14.3 Embedded Systems . . . . . . . . . . . . .
14.3.1 Embedded Microprocessors .
14.3.2 Peripherals. . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
357
357
357
358
359
362
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
14.3.3
14.3.4
14.3.5
References .
xv
Coprocessors
Memory . . .
Busses. . . . .
...........
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
365
366
368
369
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
371
371
374
375
394
397
397
406
409
414
416
417
422
424
429
431
434
434
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
435
435
437
437
439
445
452
459
461
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
463
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Chapter 1
NOT
AND2
OR2
NAND3
NOR3
NAND2
NOR2
XOR3
XNOR3
XOR2
XNOR2
The same basic Boolean operations can be used to define sets of logic gates
working in parallel. As an example, assume that signals a = (an-1, an-2,, a0) and
b = (bn-1, bn-2,, b0) are n-bit vectors. The following assignation defines a set of
n AND2 gates that compute c = (an-1bn-1, an-2bn-2,, a0b0):
and the two following instructions define a 1-bit full adder (a 3-input 2-output
component implementing a+b+cin mod 2 and ab v acin v bcin, Fig. 1.3).
cout
FA
cin
1.1.2 Tables
Combinational components can also be defined by their truth tables, without the
necessity to translate their definition to Boolean equations. As a first example,
consider a 1-digit to 7-segment decoder, that is a 4-input 7-output combinational
circuit.
Example 1.3
The behavior of a 4-to-7 decoder (Fig. 1.4) can be described by a conditional
assignment instruction
segment 4
segment5
4
decoder
digit
segment 3
7
segments
segment2
segment 1
segment 0
16-bit ROM
a 3,a 2,a 1,a 0
4
stored data:
truth vector of f
b = f(a 3,a 2,a1,a0)
Example 1.4
The following entity defines a 4-input Boolean function whose truth vector is
stored within a generic parameter (Fig. 1.5). Library declarations are omitted.
sel
2
sel
a
b
n
n
a
n
1
MUX2-1
b
c
d
n
n
n
n
00
01
10
11
MUX4-1
Comment 1.1
The VHDL model of the preceding Example 1.4 can be used for simulation
purposes, independently of the chosen FPGA vendor. Nevertheless, in order to
implement an actual circuit, the corresponding vendors primitive component
should be used instead (Chap. 5).
Demultiplexers, address decoders and tri-state buffers are other components frequently used for implementing controllable connections such as buses.
Example 1.6
The following equations define a 1-bit 1-to-2 demultiplexer (Fig. 1.7a)
6
Fig. 1.7 Demultiplexers
(a)
(b)
sel
0
00
a
1
sel
2
DEMUX1-2
01
10
11
n
n
n
n
b
c
c
d
DEMUX1-4
In the second case the signal types must be compatible with the conditional
assignments: a, b, c, d and e are assumed to be n-bit vectors for some constant
value n.
An address decoder is a particular case of 1-bit demultiplexer whose input is 1.
Example 1.7
The following equations define a 3-input address decoder (Fig. 1.8).
row(0)
row(1)
row(2)
row(3)
row(4)
000
001
010
011
100
101
address 3
row(5)
row(6)
110
111
row(7)
a n
enable
y2y 1
2
11
1
c
8
enable_b
10
1
0
y0
d
8
enable_c
01
00
b
8
enable_a
enable_d
8
enable_e
8
data_bus
Example 1.8
The following conditional assignment defines a tri-state buffer (Fig. 1.9): when the
enable signal is equal to 1, output b is equal to input a, and when the enable signal
is equal to 0, output b is disconnected (high impedance state). The signal types
must be compatible with the conditional assignments: a and b are assumed to be nbit vectors for some constant value n.
An example of the use of demultiplexers and tri-state buffers, namely a data bus, is
shown in Fig. 1.10.
Example 1.9
The circuit of Fig. 1.10, made up of demultiplexers and three-state buffers, can be
described by Boolean equations (the multiplexers) and conditional assignments
(the three-sate buffers).
e
n
d
n
a
n
Generally, this second implementation is considered safer than the first one. In
fact, tri-state buffers should not be used within the circuit core. They should only
be used within I/O-port components (Sect. 1.4).
(a)
(b)
mod 2n adder
a
0
c in
c out
c in
a
b
n
m
n-bit by m-bit
multiplier
n+m
product
By adding a most significant bit 0 to one (or both) of the n-bit operands a and b, an
n-bit adder with output carry cOUT can be defined (Fig. 1.12b). The internal signal
sum is an (n+1)-bit vector.
Comment 1.2
In most VHDL models available at the Authors web page, the type unsigned has
been used, so that bit-vectors can be treated as natural numbers. In some cases, it
could be better to use the signed type, for example when bit-vectors are interpreted
as 2s complement integers, and when magnitude comparisons or sign-bit extensions are performed.
10
Fig. 1.14 D flip-flops
(a)
(b)
D
(clk)
Q
(clkb)
PRESET
D
Q
Qb
CLEARb
while the following, which is controlled by the negative edge of clkb, has two
asynchronous inputs clearb (active at low level) and preset (active at high
level), having clearb priority, and has two complementary outputs q and qb
(Fig. 1.14b).
Comment 1.3
The use of level-controlled, instead of edge-controlled, components is not recommendable. Nevertheless, if it were necessary, a D-latch could be modeled as
follows:
11
d
n
CE
register
(clk)
n
q
1.2.2 Registers
Registers are sets of D-flip-flops controlled by the same synchronization and
control signals, and connected according to some regular scheme (parallel, left or
right shift, bidirectional shift).
Example 1.12
The following component is a parallel register with ce (clock enable) input,
triggered by the positive edge of clk (Fig. 1.15).
As a second example (Fig. 1.16), the next component is a right shift register with
parallel input (controlled by load) and serial input (controlled by shift).
1.2.3 Counters
A combination of registers and arithmetic operations permits the definition of
counters.
12
parallel_in
n
serial_in
load
shift
(clk)
n
q
parallel_in
n
load
count
Example 1.13
This defines an up/down counter (Fig. 1.17) with control signals load (input
parallel_in), count (update the state of the counter) and upb_down (0: count up, 1:
count down).
The following component is a down counter (Fig. 1.18) with control signals load
(input parallel_in) and count (update the state of the counter). An additional binary
output equal_zero is raised when the state of the counter is zero (all 0s vector).
13
parallel _in
n
down counter
equal_zero
load
count
(clk)
n
q
combinational
circuit 1
(t1)
next state
internal state
combinational
circuit 2
(t2)
output state
input state
(tSUinput )
(clk)
clk
internal state
input state
output state
next state
tSUinput
t1
t2
1:1
1:2
14
combinational
circuit 1
(t1)
next state
internal state
input state
(tSUinput )
combinational
circuit 2
(t2)
output state
(clk)
clk
internal state
input state
output state
next state
tSUinput
t2
t1
The set up and hold times of the register (Chap. 6) have not been taken into
account.
Example 1.14
A Moore machine is shown in Fig. 1.21. It is the control unit of a programmable
timer (Exercise 2.6.2). It has seven internal states, three binary input signals start,
zero and reference, and two output signals operation (2 bits) and done. It can be
described by the following processes.
15
start = 1
start = 0
reset
start = 0
start = 1
zero = 1
reference = 1
zero = 0
reference = 0
reference = 0
4
reference = 1
5
state
operation
done
0
1
2
3
4
5
6
00
00
11
00
00
00
01
1
1
0
0
0
0
0
Example 1.15
Consider the Mealy machine of Table 1.1. It has four internal states, two
binary inputs x1 and x0, and one binary output z. Assume that x0 and x1 are
16
Table 1.1 A Mealy
machine: next state/z
A
B
C
D
X1 x0 : 00
01
10
11
A/0
B/1
B/1
A/0
B/0
B/0
C/1
C/1
A/1
A/1
D/0
D/0
D/1
C/0
C/0
D/1
periodic, but out of phase, signals. Then the machine detects if x0 changes
before x1 or if x1 changes before x0. In the first case the sequence of internal
states is A B C D A B and z = 0. In the second case the sequence is D C B
A D C and z = 1.
It can be described by the following processes.
17
18
ROM
address
stored data :
generic parameter
m
word
data_in
m
address
write
clk
RAM
data_out
Then the following component instantiation defines a ROM storing 16 4-bit words,
namely
Example 1.17
The following entity defines a synchronous Random Access Memory storing 2n
m-bit words (Fig. 1.23). A write input enables the writing operation. Functionally,
it is equivalent to a Register File made up of 2n m-bit registers whose clock enable
inputs are connected to write, plus an address decoder. Library declarations are
omitted.
19
20
(a)
(d)
b
a
(b)
(e)
c
b
a
(c)
a
b
en
en
An input buffer with a pull-up resistor can be defined as follows (Fig. 1.24b).
21
The definition of a tri-state output buffer (Fig. 1.24c) is the same as in example 1.8,
that is
1.6 Exercises
1. Generate the VHDL model of a circuit that computes y = ax where a is a bit,
and x and y are n-bit vectors, so that y = (axn-1, axn-2,, ax0).
2. Generate several models of a 1-bit full subtractor (Boolean equations, table,
LUT instantiation).
3. Generate a generic model of an n-bit 8-to-1 multiplexer.
22
Reference
1. Hamblen JO, Hall TS, Furman MD (2008) Rapid prototyping of digital systems. Springer,
New York
Chapter 2
This chapter describes the classical architecture of many digital circuits and presents,
by means of several examples, the conventional techniques that digital circuit
designers can use to translate an initial algorithmic description to an actual circuit. The
main topics are the decomposition of a circuit into Data Path and Control Unit and the
solution of two related problems, namely scheduling and resource assignment.
In fact, modern Electronic Design Automation tools have the capacity to directly
generate circuits from algorithmic descriptions, with performanceslatency, cost,
consumptioncomparable with those obtained using more traditional methods.
Those development tools are one of the main topics of Chap. 5. So, it is possible
that, in the future, the concepts and methods presented in this chapter will no longer
be of interest to circuit designers, allowing them to concentrate on algorithmic
innovative aspects rather than on scheduling and resource assignment optimization.
23
24
The same method can be used for computing the square root of x. For that, the loop
execution is controlled by the condition s B x.
Algorithm 2.1: Square root
The loop is executed as long as s B x, that is (r ? 1)2 B x. Thus, at the end of the
loop execution,
r 2 x\ r 12 :
Obviously, this is not a good algorithm as its computation time is proportional to
the square root itself, so that for great values of x (x % 2n) the number of steps is
of the order of 2n/2. Efficient algorithms are described in Chap. 10.
In order to implement Algorithm 2.1, the list of operations executed at each
clock cycle must be defined. In this case, each iteration step includes three operations: evaluation of the condition s B x, s ? 2(r ? 1) ? 1 and r ? 1. They can
be executed in parallel. On the other hand, the successive values of r and s must be
stored at each step. For that, two registers are used. Their initial values (0 and 1
respectively) are controlled by a common load signal, and their updating at the end
of each step by a common ce (clock enable) signal. The circuit is shown in Fig. 2.1.
To complete the circuit, a control unit in charge of generating the load and ce
signals must be added. It is a finite state machine with one input greater (detection
of the loop execution end) and two outputs, load and ce. A start input and a done
output are added in order to allow the communication with other circuits. The
finite state machine is shown in Fig. 2.2.
The circuit of Fig. 2.1 is made up of five blocks whose VHDL models are the
following:
computation of next_r:
computation of next_s:
25
adder
r+1
x2
adder
next_s
next_ r
load
ce
register
initial value : 0
s+2(r+1)+1
load
ce
register
initial value : 1
s
x
comparator
greater
Fig. 2.1 Square root computation: data path
start=1/
nop
reset
start=0/
nop
start=0/
nop
greater=0/
update
start=1/
begin
1
greater=1/
nop
nop
begin
update
ce
0
0
1
load
0
1
0
done
1
0
0
26
register r:
register s:
The control unit is a Mealy finite state machine that can be modeled as follows:
next state computation:
27
The circuit of Fig. 2.1 includes three n-bit adders: a half adder for computing
next_r, a full adder for computing next_s and another full adder (actually a subtractor) for detecting the condition s [ x. Another option is to use one adder and to
decompose each iteration step into three clock cycles. For that, Algorithm 2.1 is
slightly modified.
Algorithm 2.2: Square root, version 2
28
x2
1,2
operand1
adder
operation (1..0)
operand 2
not(operation(1))
result(n)
result(n-1..0)
register s:
29
x
ce_greater
FF
operation
programmable
resource
greater
load
ce_r
register
initial value: 0
load
ce_s
register
initial value: 1
start=0/
nop
start=1/
nop
reset
start=0/
nop
start=1/
begin
greater=1/
nop
commands
nop
begin
update_r
update_s
update_greater
greater=0/
update_r
-/
update_g
3
-/
update_s
0
0
0
1
0
0
0
0
0
1
0
1
0
0
0
0
0
0
1
2
1
0
0
0
0
30
flip-flop greater
The control unit is a Mealy finite state machine whose VHDL model is the
following:
next state computation:
31
Complete VHDL models (square_root.vhd) of both circuits (Figs. 2.1, 2.4) are
available at the Authors web page.
32
data_in
start done
computation
resources
(t3)
command
generation
(t2)
commands
internal state
next_data
registers
next-state
computation
(t1)
(clk)
data
(t4)
next state
(clk)
conditions
clk
internal_state
data
conditions
commands
next_state
next_data
t4
t2
t3
t1
Fig. 2.6 Structure of a digital circuit: data path and control unit
machine (the output state only depends on the internal state) while the control unit
could be a Moore or a Mealy machine. An important point is that, when two finite
state machines are interconnected, one of them must be a Moore machine in order
to avoid combinational loops.
According to the chronograms of Fig. 2.6, there are two critical paths: from the
data registers to the internal state register, and from the data registers to the data
registers. The corresponding delays are
Tdatastate t4 t1
2:1
33
and
Tdatadata t4 t2 t3 ;
2:2
where t1 is the computation time of the next internal state, t2 the computation time
of the commands, t3 the maximum delay of the computation resources and t4 the
computation time of the conditions (the set up and hold times of the registers have
not been taken into account).
The clock period must satisfy
Tclk [ maxft4 t1 ; t4 t2 t3 g:
2:3
If the control unit were a Moore machine, there would be no direct path from the
data registers to the data registers, so that (2.2) and (2.3) should be replaced by
Tstatedata t2 t3
2:4
Tclk [ maxft4 t1 ; t2 t3 g:
2:5
and
In fact, it is always possible to use a Moore machine for the control unit. Generally
it has more internal states than an equivalent Mealy machine and the algorithm
execution needs more clock cycles. If the values of t1 to t4 do not substantially
vary, the conclusion could be that the Moore approach needs more, but shorter,
clock cycles. Many designers also consider that Moore machines are safer than
Mealy machines.
In order to increase the maximum frequency, an interesting option is to insert a
command register at the output of the command generation block. Then relation
(2.2) is substituted by
Tdatacommands t4 t2
and Tcommandsdata t3 ;
2:6
so that
Tclk [ maxft4 t1 ; t4 t2 ; t3 g:
2:7
With this type of registered Mealy machine, the commands are available one cycle
later than with a non-registered machine, so that additional cycles must be
sometimes inserted in order that the data path and its control unit remain
synchronized.
To summarize, the implementation of an algorithm is based upon a decomposition of the circuit into a data path and a control unit. The data path is in charge
of the algorithm operations and can be roughly defined in the following way:
associate registers to the algorithm variables, implement resources able to execute
the algorithm operations, and insert programmable connections (multiplexers)
between the register outputs (the operands) and the resource inputs, and between
the resource outputs (the results) and the register inputs. The control unit is a finite
state machine whose internal states roughly correspond to the algorithm steps, the
34
input states are conditions (flags) generated by the data path, and the output states
are commands transmitted to the data path.
In fact, the definition of a data path poses a series of optimization problems,
some of them being dealt with in the next sections, for example: scheduling of the
operations, assignment of computation resources to operations, and assignment of
registers to variables. It is also important to notice that minor algorithm modifications sometimes yield major circuit optimizations.
2:8
It is made up of 1-bit full adders working in parallel. An example where x1, x2 and
x3 are 4-bit numbers, and y1 and y2 are 5-bit numbers, is shown in Fig. 2.7.
The delay of a carry-save adder is equal to the delay TFA of a 1-bit full adder,
independently of the number of bits of the operands. Let CSA be the function
associated to (2.8), that is
y1 ; y2 CSAx1 ; x2 ; x3 :
2:9
35
x13 x23
y14 y24
FA
FA
FA
FA
y13 y23
y12 y22
y11 y21
y10 y20
x30
op1 : a1 ; a2 CSAx1 ; x2 ; x3 ;
op2 : b1 ; b2 CSAx4 ; x5 ; x6 ;
op3 : c1 ; c2 CSAa2 ; b2 ; x7 ;
op4 : d1 ; d2 CSAa1 ; b1 ; c1 :
2:10
36
x1,x2,x3
x4,x5,x6
op1
op 2
a1
a2 b 1
x7
b2
op3
c1
op4
c2
y3
d1,d2
y1,y2
The computation time is equal to 3Tclk, where Tclk [ TFA, and the cost equal to
2CCSA, plus the cost of the additional registers, controllable connections and
control unit.
3. A data path including one carry-save adder and several registers. The computation is executed in four cycles:
The computation time is equal to 4Tclk, where Tclk [ TFA, and the cost equal to
CCSA, plus the cost of the additional registers, controllable connections and
control unit.
In conclusion, there are several implementations, with different costs and delays,
corresponding to the set of operations in (2.10). In order to get an optimized
circuit, according to some predefined criteria, the space for possible implementations must be explored. For that, optimization methods must be used.
37
2:11
where xi, xk, xl, xm, are variables of the algorithm and f one of the algorithm
operation types (computation primitives). Then, the precedence graph (or data flow
graph) is defined as follows:
associate a vertex to each operation opJ,
draw an arc between vertices opJ and opM if one of the results generated by opJ
is used by opM.
An example was given in Sect. 2.3.1 (operations (2.10) and Fig. 2.8).
Assume that the computation times of all operations are known. Let tJM be the
computation time, expressed in number of clock cycles, of the result(s) generated
by opJ and used by opM. Then, a schedule of the algorithm is an application Sch
from the set of vertices to the set of naturals that defines the number Sch(opJ) of
the cycle at the beginning of which the computation of opJ starts. A necessary
condition is that
SchopM SchopJ tJM
2:12
2:13
2:14
38
(a) x1,x2,x3
(b)
x4,x5,x6
op1
op2 1
a1
a2 b 1
x1,x2,x3
2
x7
b2
op1
a1
op2 2
a2 b1
op3 2
c1
op4
y3
d1,d2
y1,y2
c2
(c)
x4,x5,x6
x1,x2,x3
1
x7
b2
x4,x5,x6
op1
a1
op2 2
a 2 b1
op3 3
c1
op4
d1,d2
c2
y3
x7
b2
op3 3
c1
op4
c2
y3
d1,d2
y1,y2
Fig. 2.9 7-to-3 counter: a ASAP schedule. b ALAP schedule. c Admissible schedule
initial step: Sch(opM) = latest admissible starting cycle of opM for all final vertices opM;
step number n ? 1: choose an unscheduled vertex opJ whose all successors, say
opP, opQ, have already been scheduled, and define Sch(opJ) = minimum{Sch(opP) - tJP, Sch(opQ) - tJQ,}.
Applied to (2.10), with Sch(op4) = 4, the ALAP algorithm generates the data
flow graph of Fig. 2.9b.
Let ASAP_Sch and ALAP_Sch be ASAP and ALAP schedules, respectively.
Obviously, if opM is a final operation, the previously specified value ALAP_Sch(opM) must be greater than or equal to ASAP_Sch(opM). More generally,
assuming that the latest admissible starting cycle for all the final operations has been
previously specified, for any admissible schedule Sch the following relation holds:
ASAP SchopJ SchopJ ALAP SchopJ ; 8opJ :
2:15
39
In fact, the preceding algorithm computes the value of four variables xA, zA, xB and
zB in function of k and xP. A final, not included, step would be to compute the
coordinates of kP in function of the coordinates of P (xP and yP) and of the final
values of xA, zA, xB and zB.
Consider one step of the main iteration of Algorithm 2.3, and assume that
km - i = 0. The following computation scheme computes the new values of xA, zA,
xB and zB in function of their initial values and of xP. The available
computation primitives are the addition, multiplication and squaring in GF(2m)
(Chap. 13).
40
xA,zB
xB,zA
xA,zA
xA,zA
b
+
c
squ
xP
squ
squ
k
squ
e
+
41
xA,zB
1
xB,zA
1
301
squ
squ
k
3
303
squ
xP
h
301
301
xA,zA
1
c
302
xA,zA
squ
e
603
xA,zB
899
xB,zA
899
1199
1200
b
+
1201
xA,zA
1499
h
1501 squ
i
j
1500
squ
squ
xP
1201
1201
xA,zA
1501
squ
e
1501
2. Assuming that the number of available computation resources of each type has
been previously specified, minimize the computation time.
An important concept is the computation width w(f) with respect to the computation primitive (operation type) f. First define the activity intervals of f. Assume
that f is the primitive corresponding to the operation opJ, that is
opJ : xi ; xk ; . . . f xl ; xm ; . . .:
Then
SchopJ ; SchopJ maximumftJM g
42
43
(a)
(b)
c1
c1
a
f
c2
c1
c1
c2
c3
a
f
b
c2
c1
c3
xA,zB
1
301
a
601
901
squ
xP
1201
c
602
xB,zA
xA,zA
601
xA,zA
1
h
901
squ
j
2
squ
i
3
squ
e
1501
They do not overlap hence the incompatibility graph does not include any edge
and can be colored with one color. The computation width with respect to the
multiplication is equal to 1.
Thus, the two optimization problems mentioned above can be expressed in
terms of computation widths:
1. Assuming that the maximum computation time has been previously specified,
look for a schedule that minimizes some cost function
C c1 wf 1 c2 wf 2 cm wf m
2:16
where f1, f2,, fm are the computation primitives and c1, c2,, cm their corresponding costs.
2. Assuming that the maximum computation width w(f) with respect to every
computation primitive f has been previously specified, look for a schedule that
minimizes the computation time.
Both are classical problems of scheduling theory. They can be expressed in
terms of integer linear programming problems whose variables are xIt for all
44
Fig. 2.15 Example 2.3:
schedule corresponding to the
first optimization problem
xB,zA
299
a
599
b
+
c
600
squ
xP
601
601
xA,zA
601
xA,zA
1
901 squ
i
j
2
squ
k
3
squ
e
901
45
A key concept for assigning registers to variables is the lifetime [tI, tJ] of every
variable: tI is the number of the cycle during which its value is generated, and tJ is
the number of the last cycle during which its value is used.
Example 2.4
Consider the computation scheme of Example 2.1 and the schedule of Fig. 2.14.
The computation width is equal to 1 for all primitives (multiplication, addition and
squaring). The computation is executed as follows:
In order to compute the variable lifetimes, it is assumed that the multiplier reads
the values of the operands during some initial cycle, say number I, and generates
the result during cycle number I ? tm - 1 (or sooner), so that this result can be
stored at the end of cycle number I ? tm - 1 and is available for any operation
beginning at cycle number I ? tm (or later). As regards the variables xA, zA, xB and
zB, in charge of passing values from one iteration step to the next (Algorithm 2.3),
their initial values must be available from the first cycle up to the last cycle during
which those values are used. At the end of the computation scheme execution they
must be updated with their new values. The lifetime intervals are given in
Table 2.1.
The definition of a minimum number of registers can be expressed as a graph
coloring problem. For that, associate a vertex to every variable and draw an edge
between two variables if their lifetime intervals are incompatible, which means
that they have more than one common cycle. As an example, the lifetime intervals
of j and k are compatible, while the lifetime intervals of b and d are not.
The following groups of variables have compatible lifetime intervals:
46
a
j
k
l
b
h
c
d
f
i
e
g
xA
zA
xB
zB
[300, 901]
[1, 2]
[2, 3]
[3, final]
[600, 901]
[900, 901]
[601, 602]
[602, final]
[1200, 1501]
[901, final]
[1500, 1501]
[1501, final]
[initial, 601]
[initial, 601]
[initial, 301]
[initial, 1]
zB initial ! 1; j1 ! 2; k2 ! 3; l3 ! final;
xB initial ! 301; b600 ! 901; f 1200 ! 1501; g1501 ! final;
zA initial ! 601; c601 ! 602; d602 ! final;
xA initial ! 601; h900 ! 901; e1500 ! 1501;
a300 ! 901; i901 ! final:
Thus, the computing scheme can be executed with five registers, namely xA, zA, xB,
zB and R:
47
The data processed by Algorithm 2.4 are m-bit vectors (polynomials of degree
m over the binary field GF(2)) and the computation resources are field multiplication, addition and squaring. Field addition amounts to bit-by-bit modulo 2
additions (XOR functions). On the other hand, VHDL models of computation
resources executing field squaring and multiplication are available at the Authors
web page, namely classic_squarer.vhd and interleaved_mult.vhd (Chap. 13). The
classic_squarer component is a combinational circuit. The interleaved_mult
component reads and internally stores the input operands during the first cycle
after detecting a positive edge on start_mult and raises an output flag mult_done
when the multiplication result is available.
48
Consider now the storing resources. Assuming that xP and k remain available
during the whole algorithm execution, there remain five variables that must be
internally stored: xA, xB, zA, zB and R. The origin of the data stored in every register
must be defined. For example, the operations that update xA are
49
zA zB R
sel_p1
xA zB R
sel_p2
xB zA R
sel_a1
sel_a2
start_mult
XOR gates
interleaved_mult
mult_done
product
z A z B xA x B
adder_out
sel_sq
classic_squarer
square
So, the updated value can be 1 (initial value), product, adder_out or zB. A similar
analysis must be done for the other registers. Finally, the part of the data path
corresponding to the registers and the multiplexers that select their input data is
shown in Fig. 2.17. The corresponding VHDL model is easy to generate. As an
example, the xA register, with its input multiplexers, can be described as follows.
50
adder_ out
zB
product
zA
product
sel_xA
load
en_xA
register
initially: 1
register
initially:xP
square
sel_xB
load
en_xB
register
initially: 0
xB
xA
R zB
adder_ out
sel_zA
load
en_zA
zA
square
zA R
adder_ out
register
initially: 1
zB
product square
sel_zB
load
en_zB
register
sel_R
en_R
It is made up of
the data path;
a shift register allowing sequential reading of the values of km
a counter for controlling the loop execution;
- i;
51
a finite state machine in charge of generating all the control signals, that is
start_mult, load, shift, en_xA, en_xB, en_zA, en_zB, en_R, sel_p1, sel_p2,
sel_a1, sel_a2, sel_sq, sel_xA, sel_xB, sel_zA, sel_zB and sel_R. In particular,
the control of the multiplier operations is performed as follows: the control unit
generates a positive edge on the start_mult signal, along with the values of
sel_p1 and sel_p2 that select the input operands; then, it enters a wait loop until
the mult_done flag is raised (instead of waiting for a constant time, namely 300
cycles, as was done for scheduling purpose); during the wait loop the start_mult
is lowered while the sel_p1 and sel_p2 values are maintained; finally, itgenerates
the signals for updating the register that stores the result. As an example, assume that
the execution of the fourth instruction of the main loop, that is xB:= xBzA, starts at
state 6 and uses identifiers start4, wait4 and end4 for representing the corresponding
commands. The corresponding part of the next-state function is
a command decoder (Chap. 4). Command identifiers have been used in the
definition of the finite state machine output function, so that a command decoder
must be used to generate the actual control signal values in function of the
identifiers. For example, the command start4 initializes the execution of
xB:= xBzA and is decoded as follows:
In the case of operations such as the first of the main loop, that is R:= xAzB,
zB:= xA ? zA, the 1-cycle operation zB:= xA ? zA is executed in parallel with the
final cycle of R:= xAzB and not in parallel with the initial cycle. This makes the
algorithm execution a few cycles (3) longer, but this is not significant as tm is
generally much greater than 3. Thus, the control signal values corresponding to
the identifier end1 are:
52
The control unit also detects the start signal and generates the done flag. A complete model scalar_product.vhd is available at the Authors web page.
Comment 2.1
The interleaved_mult component is also made up of a data path and a control unit,
while the classic_squarer component is a combinational circuit. An alternative
solution is the definition of a data path able to execute all the operations, including
those corresponding to the interleaved_mult and classic_squarer components. The
so-obtained circuit could be more efficient than the proposed one as some computation resources could be shared between the three algorithms (field multiplication, squaring and scalar product). Nevertheless, the hierarchical approach
consisting of using pre-existing components is probably safer and allows a
reduction in the development times.
Instead of explicitly disassembling the circuit into a data path and a control unit,
another option is to describe the operations that must be executed at each cycle,
and to let the synthesis tool define all the details of the final circuit. A complete
model scalar_product_DF2.vhd is available at the Authors web page.
Comment 2.2
Algorithm 2.4 does not compute the scalar product kP. A final step is missing:
The design of a circuit that executes this final step is left as an exercise.
2.6 Exercises
53
2.6 Exercises
1. Generate several VHDL models of a 7-to-3 counter. For that purpose use the
three options proposed in Sect. 2.3.1.
2. Generate the VHDL model of a circuit executing the final step of the scalar
product algorithm (Comment 2.2). For that purpose, the following entity,
available at the Authors web page, is used:
4. Design a circuit for computing the greatest common divisor of two natural
numbers, based on the following Euclidean algorithm.
5. The distance d between two points (x1, y1) and (x2, y2) of the (x, y)-plane is
equal to d = ((x1 - x2)2 ? (y1 - y2)2)0.5. Design a circuit that computes
d with only one subtractor and one multiplier.
6. Design a circuit that, within a three-dimensional space, computes the distance
between two points (x1, y1, z1) and (x2, y2, z2).
7. Given a point (x, y, z) of the three-dimensional space, design a circuit that
computes the following transformation.
2 3 2
3 2 3
xt
a11 a21 a31
x
4 yt 5 4 a21 a22 a32 5 4 y 5
z
zt
a31 a32 a11
8. Design a circuit for computing z = ex using the formula
54
ex 1
x x2 x3
1! 2! 3!
9. Design a circuit for computing xn, where n is a natural, using the following
relations: x0 = 1; if n is even then xn = (xn/2)2, and if n is odd then
xn = x(x(n-1)/2)2.
10. Algorithm 2.4 (scalar product) can be implemented using more than one
interleaved_multiplier. How many multipliers can operate in parallel? Define
the corresponding schedule.
References
1. Hankerson D, Menezes A, Vanstone S (2004) Guide to elliptic curve cryptography. Springer,
New York
2. Deschamps JP, Imaa JL, Sutter G (2009) Hardware implementation of finite-field arithmetic.
McGraw-Hill, New York
3. Lpez J, Dahab R (1999) Improved algorithm for elliptic curve arithmetic in GF(2n). Lect
Notes Comput Sci 1556:201212
Chapter 3
3.1 Pipeline
A very useful implementation technique, especially for signal processing circuits,
is pipelining [1, 2]. It consists of inserting additional registers so that the maximum
clock frequency and input data throughput are increased. Furthermore, in the case
of FPGA implementations, the insertion of pipeline registers has a positive effect
on the power consumption.
55
56
(a)
(b)
x1 x2 x3
x4 x 5 x6
CSA
CSA
a1
a2 b1
b2
x7
x 1 x2 x 3
x 4 x 5 x6
CSA
CSA
a1
a2 b1
x7
b2
clk
CSA
c1
CSA
c2
c1
c2
clk
CSA
d1
y1
CSA
d2
y2
d1
y3
y1
d2
y2
y3
in Fig. 3.1a. As previously commented, this is probably a bad circuit because its
cost is high and its maximum clock frequency is low.
Consider now the circuit of Fig. 3.1b in which registers have been inserted in
such a way that operations scheduled in successive cycles, according to the ASAP
schedule of Fig. 2.9a, are separated by a register. The circuit still includes four
carry-save adders, but the minimum clock period of a synchronous circuit
including this counter must be greater than TFA, plus the set-up and hold times of
the registers, instead of 3TFA. Furthermore, the minimum data introduction
interval is now equal to Tclk: as soon as a1, a2, b1 and b2 have been computed, their
values are stored within the corresponding output register, and a new computation,
with other input data, can start; at the same time, new computations of c1 and c2,
and of d1 and d2 can also start. Thus, at time t, three operations are executed in
parallel:
To summarize, assuming that the set-up and hold times are negligible,
Tclk [ TFA ; latency 3 Tclk ; r Tclk ;
3.1 Pipeline
57
x1x2x3
x4x5x6
x7
stage 1
CSA
a1a2
x7
b1b2
conexiones
a2b2x7
a1b1c1
stage 2
CSA
c1
y1
y2
y3
where latency is the total computation time and r is the minimum data introduction
interval.
Another implementation, based on the admissible schedule of Fig. 2.9c is show
in Fig. 3.2. In this case the circuit is made up of two stages separated by a pipeline
register. Within every stage the operations are executed in two cycles. During the
first cycle the following operations are executed
and during the second cycle, the following ones are executed
The circuit of Fig. 3.2 includes two carry-save adders instead of four, and its
timing constraints are the following:
58
3.1.2 Segmentation
Given a computation scheme and its precedence graph G, a segmentation of G is
an ordered partition {S1, S2,, Sk} of G. The segmentation is admissible if it
respects the precedence relationship. This means that if there is an arc from opJ [
Si to opM then either opM belongs to the same segment Si or it belongs to a different
segment Sj with j [ i. Two examples are shown in Fig. 3.3 in which the segments
are separated by dotted lines.
3.1 Pipeline
59
(a)
(b)
op1
op1
S1
op2
op2
S1
op3
op5
S2
op3
op5
S2
op4
op4
S3
S3
op6
op6
60
stage 1:
op1 and op2
stage 2:
op3 and op4
stage 3:
op5 and op6
xP
xA,zB
xB,zA
xA,zA
xA,zA
S1
a
b
+
S2
squ
f
h
squ
squ
d
S3
e
S4
k
squ
l
S5
3.1 Pipeline
61
Segment 1:
Segment 2:
Segment 3:
Segment 4:
Segment 5:
So, every segment includes a product over a finite field plus some additional
1-cycle operations (finite field additions and squares) in segments 2, 4 and 5. The
corresponding pipelined circuit, in which it is assumed that the output results are
g, d, l and i, is shown in Fig. 3.6.
A finite field product is a complex operation whose maximum computation time
tm, expressed in number of clock cycles, is much [1. Thus, the latency T and the
time interval d between successive data inputs of the complete circuit are
T 5tm and d tm :
The cost of the circuit is very high. It includes five multipliers, three adders, four
squarers and four pipeline registers. Furthermore, if used within the scalar product
circuit of Sect. 2.5, the fact that the time interval between successive data inputs has
been reduced (d % tm) does not reduce the execution time of Algorithm 2.4 as the
input variables xA, zA, xB and zB are updated at the end of every main loop execution.
As regards the control of the pipeline, several options can be considered. A simple
one is to previously calculate the maximum multiplier computation time tm and to
choose d [ tm ? 2 (computation time of segment 2). The control unit updates the
pipeline registers and sends a start pulse to all multipliers every d cycles. In the
following VHDL process, time_out is used to enable the pipeline register clock every
delta cycles and sync is a synchronization procedure (Sect. 2.5):
62
In order to describe the complete circuit, five multipliers, three adders and four
squarers are instantiated, and every pipeline register, for example the segment 1
output register, can be described as follows:
3.1 Pipeline
Fig. 3.6 Pipelined circuit
(V s: s0 = s(t - 1),
s00 = s(t - 2), s000 =
s(t - 3), siv = s(t - 4))
63
xP
xA zB
xB zA
*
a
xP
xB
zA
xA
b zA
xA
*
b
+
c
sq
d
d
xP
a
*
e
e
zA
xA
*
f
+
g
xAiv
zAiv
giv
liv
div
sq
+
kiv
sq
jiv
*
hiv
sq
i iv
circuit and that all inputs come from register outputs and all outputs go to register
inputs. Then the minimum clock cycle TCLK is defined by the following relation:
TCLK [ 6tcell 7tconnection tSU tP ;
3:1
where tSU and tP are the minimum set-up and propagation times of the registers
(Chap. 6).
If the period defined by condition (3.1) is too long, the circuit must be segmented. A 2-stage segmentation is shown in Fig. 3.8. Registers must be inserted in
64
all positions where a connection crosses the dotted line. Thus, seven registers must
be added. Assuming that the propagation time of every part of a segmented
connection is still equal to tconnection, the following condition must hold:
TCLK [ 3tcell 4tconnection tSU tP :
3:2
3:3
3.1 Pipeline
65
cout
x9564
y9564
x6332
y6332
x310
y310
32-bit
adder
32-bit
adder
32-bit
adder
32-bit
adder
z12796
z9564
z6332
z310
cin
x12796 y12796
cout
x9564
y9564
x6332
y6332
x310
y310
32-bit
adder
32-bit
adder
32-bit
adder
32-bit
adder
z12796
z9564
z6332
z310
cin
A 4-stage segmentation is shown in Fig. 3.11. Every stage includes one 32-bit
adder so that the minimum clock cycle, as well as the minimum time interval
between successive data inputs, is equal to Tadder. The corresponding circuit is
shown in Fig. 3.12. In total, (732 ? 1) ? (632 ? 1) ? (532 ? 1) = 579 additional flip-flops are necessary in order to separate the pipeline stages.
66
x9564
y9564
x6332
y6332
x310
y310
32-bit
adder
cin
32-bit
adder
32-bit
adder
32-bit
adder
cout
z12796
z9564
z6332
z310
Comments 3.1
The extra cost of the pipeline registers could appear to be prohibitive. Nevertheless, the basic cell of a field programmable gate array includes a flip-flop,
so that the insertion of pipeline registers does not necessarily increase the total
cost, computed in terms of used basic cells. The pipeline registers could consist
of flip-flops not used in the non-pipelined version.
Most FPGA families also permit implementing with LUTs those registers that
do not need reset signals. This can be another cost-effective option.
The insertion of pipeline registers also has a positive effect on the power consumption: the presence of synchronization barriers all along the circuit drastically reduces the number of generated spikes.
3.1 Pipeline
67
S1
S2
S3
S4
S5
S6
*
S7
g
pipelined adder, both of them working at the same frequency 1/Tclk and with the
same pipeline rate r = 1. The operations can be scheduled as shown in Fig. 3.13.
Some inputs and outputs must be delayed: input c must be delayed 2 cycles, input
d must be delayed 5 cycles, and output g must be delayed 5 cycles. The corresponding additional registers, which maintain the correct synchronization of the
data, are sometimes called skewing (c and d) and deskewing (g) registers.
An alternative solution, especially in the case of large circuits, is self-timing.
As a generic example, consider the pipelined circuit of Fig. 3.14a. To each
stage, for example number i, are associated a maximum delay tMAX(i) and an
average delay tAV(i). The minimum time interval between successive data inputs is
d maxftMAX 1; tMAX 2; . . .; tMAX ng;
3:4
3:5
A self-timed version of the same circuit is shown in Fig. 3.14b. The control is
based on a Request/Acknowledge handshaking protocol:
a req_in signal to stage 1 is raised by an external circuit; if stage 1 is free, the
input data is registered (ce = 1), and an ack_out signal is issued;
the start signal of stage 1 is raised; after some amount of time, the done signal of
stage 1 elevates indicating the completion of the computation;
a req_out signal to stage 2 is issued by stage 1; if stage 2 is free, the output of
stage 1 is registered and an ack_out signal to stage 1 is issued; and so on.
If the probability distribution of the internal data were uniform, inequalities (3.4)
and (3.5) would be substituted by the following:
68
ce
req_in
ce
stage 1
ack_out
handshaking
protocol
start
start
done
done
req_out
stage 1
ce
req_in
ce
ack_in
ack_out
stage 2
handshaking
protocol
start
start
done
done
req_out
stage 2
ce
stage n
ack_in
req_in
ce
ack_out
handshaking
protocol
start
start
done
done
req_out
stage n
(a)
ack_in
(b)
3:6
3:7
Example 3.3
The following process describes a handshaking protocol component. As before,
sync is a synchronization procedure (Sect. 2.5):
3.1 Pipeline
69
reqi-1
cei
acki
starti
donei
reqi
acki
segment i exectution
70
xP
ack1
req1
xA
xB zA
xB
req_in
ce
ack_out
*
handshaking
protocol
start
+
c
done
req_out
ack_in
req2
ack2
sq
xP
zA xA
s1
s0
Reset or in transition
0
1
0
0
1
0
1
0
With regards to the generation of the done signal in the case of combinational
components, an interesting method consists of using a redundant encoding of the
binary signals (Sect. 10.4 of [3]: every signal s is represented by a pair (s1, s0)
according to the definition of Table 3.1.
The circuit will be designed in such a way that during the initialization (reset),
and as long as the value of s has not yet been computed, (s1, s0) = (0, 0). Once the
value of s is known s1 = s and s0 = not(s).
Assume that the circuit includes n signals s1, s2,, sn. Every signal si is
substituted by a pair (si1, si0). Then the done flag is computed as follows:
done s11 s10 s21 s20 . . .sn1 sn0 :
During the initialization (reset) and as long as at least one of the signals is in
transition, the corresponding pair is equal to (0, 0), so that done = 0. The done
flag will be raised only when all signals have a stable value.
In the following example, only the signals belonging to the critical path of the
circuit are encoded.
Example 3.5
Generate an n-bit ripple-carry adder (Chap. 7) with end of computation detection.
For this purpose, all signals belonging to the carry chain, that is c0, c1, c2,, cn - 1,
are represented by the form (c0, cb0), (c1, cb1), (c2, cb2),, (cn - 1, cbn - 1). During
the initialization, all ci and cbi are equal to 0. When reset goes down,
3.1 Pipeline
71
reset
xi
reset
yi
FA
ci+1
ci
cin
c0
zi
xi
yi
FA
cbi+1
eoci+1
cbi
cb0
eoci
eoc0
(a)
(b)
c0 cin ;
ci1 xi yi xi ci yi ci ;
8i 2 f0; 1; . . .n 1g:
cb0 cin ;
cbi1 xi yi xi cbi yi cbi ;
8i 2 f0; 1; . . .; ng:
72
initial_values
iterative_operations
data_1
iterative_operations
iterative_operations
data_2
data_p-1
registers
initially: initial_values
control
iterative_operations
final_results
final_results
(b)
(a)
ci ? 1 and cbi
deleted.
? 1,
? 1).
73
iterative_operations
iterative_operations
registers
initially: initial_values
control
final_results
An example, with s = 3, is shown in Fig. 3.19. Obviously, the clock cycle, say
Tclk0 , must be longer than in the sequential implementation of Fig. 3.18b (Tclk).
Nevertheless, it will be generally shorter than sTclk. On the one hand, the critical
path length of s serially connected combinational circuits is generally shorter than
the critical path length of a single circuit, multiplied by s. For example, the delay
of an n-bit ripple-carry adder is proportional to n; nevertheless the delay of two
serially connected adders, that compute (a ? b) ? c, is proportional to n ? 1, and
not to 2n. On the other hand, the register delays are divided by s. Furthermore,
when interconnecting several circuits, some additional logical simplifications can
be performed by the synthesis tool. So,
Cunrolled s Ccomponent Cregisters Ccontrol ; Tunrolled
0
0
p=s Tclk
; where Tclk
\s Tclk :
Example 3.6
Given two naturals x and y, with x \ y, the following restoring division algorithm
computes two fractional numbers q = 0.q-1 q-2 q-p and r \ y2-p such that
x = qy ? r and, therefore, q B x/y \ q ? 2-p:
74
2r
q-i
sign
subtractor
next_r
4r
sign1
subtractor
z1
sign1,sign2,sign3
q-2i+1
q-2i
1--
75
01-
2y
sign2
subtractor
z2
001
3y
sign3
subtractor
z3
000
comb.
circ.
next_r
76
Example 3.7
Consider again a restoring divider (Example 3.6). Algorithm 3.1 is modified in
order to generate two quotient bits (D = 2) at each step.
Algorithm 3.2: Base-4 restoring division algorithm (p even)
77
The components of Fig. 3.20 (initial restoring algorithm) and Fig. 3.21 (digitserial restoring algorithm, with D = 2) have practically the same delay, namely
the computation time of an n-bit subtractor, so that the minimum clock period of
the corresponding dividers are practically the same. Nevertheless, the first divider
needs p clock periods to perform a division while the second only needs p/2. So, in
this example, the digit-serial approach practically divides by 2 the divider latency.
On the other hand the second divider includes three subtractors instead of one.
Loop unrolling and digit-serial processing are techniques that allow the
exploration of costperformance tradeoffs, in searching for intermediate options
between completely combinational (maximum cost, minimum latency) and completely sequential (minimum cost, maximum latency) circuits. Loop unrolling can
be directly performed at circuit level, whatever the implemented algorithm, while
digit-serial processing looks more like an algorithm transformation. Nevertheless it
is not always so clear that they are different techniques.
78
3.4 Exercises
1. Generate VHDL models of different pipelined 128-bit adders.
2. Design different digit-serial restoring dividers (D = 3, D = 4, etc.).
3. The following algorithm computes the product of two natural numbers x and y:
3.4 Exercises
79
xA
xB
zA
zB
xP
acc
mult
XOR
square
register
initially: 1
register
initially: xP
register
initially: 0
register
initially: 1
register
xA
xB
zA
zB
80
6. The following pipelined floating-point components are available: fpmul computes the product in 2 cycles, fpadd computes the sum in 3 cycles, and fpsqrt
computes the square root in 5 cycles, all of them with a rate r = 1.
a. Define the schedule of a circuit that computes the distance d between two
points (x1, y1) and (x2, y2) of the (x, y)-plane.
b. Define the schedule of a circuit that computes the distance d between two
points (x1, y1, z1) and (x2, y2, z2) of the three-dimensional space.
c. Define the schedule of a circuit that computes d = a ? ((a - b)c)0.5.
d. In every case, how many registers must be added?
References
81
References
1. Parhami B (2000) Computer arithmetic: algorithms and hardware design. Oxford University
Press, New York
2. De Micheli G (1994) Synthesis and optimization of digital circuits. McGraw-Hill, New York
3. Rabaey JM, Chandrakasan A, Nikolic B (2003) Digital integrated circuits: a design
perspective. Prentice Hall, Upper Saddle River
Chapter 4
Modern Electronic Design Automation tools have the capacity to synthesize the
control unit from a finite state machine description, or even to extract and synthesize the control unit from a functional description of the complete circuit
(Chap. 5). Nevertheless, in some cases the digital circuit designer can himself be
interested in performing part of the control unit synthesis. Two specific synthesis
techniques are presented in this chapter: command encoding and hierarchical
decomposition [1]. Both of them pursue a double objective. On the one hand they
aim at reducing the circuit cost. On the other hand they can make the circuit easier
to understand and to debug. The latter is probably the most important aspect.
The use of components whose latency is data-dependent has been implicitly
dealt with in Sect. 2.5. Some additional comments about variable-latency operations are made in the last section of this chapter.
83
84
commands
command
decoder
generation of
encoded
commands
p
conditions
internal state
m 1 2pn bits;
4:1
4:2
Obviously, this complexity measure only takes into account the numbers of outputs and inputs of the combinational blocks, and not the functions they actually
implement.
Another generic complexity measure is the minimum number of LUTs (Chap. 1)
necessary to implement the functions, assuming that no LUT is shared by two or
more functions. If k-input LUTs are used, the minimum number of LUTs for
implementing a function of r variables is
dr 1=k 1eLUTs;
and the minimum delay of the circuit is
dlogk r e TLUT
being TLUT the delay of a k-input LUT.
The complexities corresponding to the two previously described options are
m 1 dp n 1=k 1eLUTs
4:3
4:4
and
4:5
Example 4.1
Consider the circuit of Sect. 2.5 (scalar_product.vhd, available at the Authors
web page). The commands consist of 26 bits: eight one-bit signals
85
4:6
4:7
and the complexities in numbers of LUTs (4.3 and 4.4), assuming that 4-input
LUTs are used, are
m 1 dp n 1=3e 27 d9=3e 81 LUTS;
4:8
4:10
4:11
The second complexity measure (number of LUTs) is surely more accurate than
the first one. Thus, according to (4.84.11), the encoding of the commands
hardly reduces the cost and increases the delay. So, in this particular case, the
main advantage is clarity, flexibility and ease of debugging, and not cost
reduction.
86
data path
(interleaved
multiplier)
control
(interleaved
multiplier)
control
other components
p
x2 y2 :
87
A first solution is to use three components: a squaring circuit, an adder and a square
rooting circuit, for example that of Sect. 2.1. The corresponding circuit would include
two adders, one for computing c, and the other within the square_root component
(Fig. 2.3). Another option is to substitute, in the preceding algorithm, the call to
square_root with the corresponding sequence of operations. After scheduling the
operations and assigning registers to variables, the following algorithm is obtained:
that are encoded with three bits. The following process describes the command
decoder:
88
sel_ sq
squaring
sel_a1
cy_ out
x2
adder
square
sel_a2
cy_ in
sum
sel_r
sel_s
en_r
load
en_ c
en_s
load
c
en_ signb
cy_ out
signb
start
done
conditions
commands
start_ root
control unit 1
(main algorithm)
root _done
control unit 2
(square root )
89
The two control units communicate through the start_root and root_done signals. The first control unit has six states corresponding to a wait for start loop,
four steps of the main algorithm (operations 1, 2, 3, and the set of operations 48),
and an end of computation detection. It can be described by the following
process:
The second control unit has five states corresponding to operations 4, 5, 6, and
7, and end of root computation detection:
90
The code corresponding to nop is 000, so that the actual command can be
generated by ORing the commands generated by both control units:
91
This type of approach to control unit synthesis is more a question of clarity (well
structured control unit) and ease of debugging and maintenance, than of cost
reduction (control units are not expensive).
In an implementation that strictly respects the schedule of Fig. 2.14, these particular sentences should be substituted by constructions equivalent to
In fact, the pipelined circuit of Fig. 3.6 (pipeline_DF2.vhd) has been designed
using such an upper bound of tm. For that, a generic parameter delta was defined
and a signal time_out generated by the control unit every delta cycles. On the other
hand, the self-timed version of this same circuit (Example 3.4) used the mult_done
flags generated by the multipliers.
Thus, in the case of variable-latency components, two options could be considered: a first one is to previously compute an upper bound of their computation
times, if such a bound exists; another option is to use a start-done protocol: done is
lowered on the start positive edge, and raised when the results are available. The
second option is more general and generates circuits whose average latency is
shorter. Nevertheless, in some cases, for example for pipelining purpose, the first
option is better.
Comment 4.2
A typical case of data-dependent computation time corresponds to algorithms that
include while loops: some iteration is executed as long as some condition holds
true. Nevertheless, for unrolling purpose, the algorithm should be modified and the
while loop substituted by a for loop including a fixed number of steps, such as for i
92
93
94
4.4 Exercises
1. Design a circuit that computes z = (x1 - x2)1/2 ? (y1 - y2)1/2 with a hierarchical control unit (separate square rooter control units, see Example 4.2).
2. Design a 2-step self-timed circuit that computes z = (x1 - x2)1/4 using two
square rooters controlled by a start/done protocol.
3. Design a 2-step pipelined circuit that computes z = (x1 - x2)1/4 using two
square rooters, with a start input, whose maximum latencies are known.
4. Consider several implementations of the scalar product circuit of Sect. 2.5,
taking into account Comment 2.2. The following options could be considered:
Reference
1. De Micheli G (1994) Synthesis and optimization of digital circuits. McGraw-Hill, New York
Chapter 5
This chapter is devoted to those electronics aspects important for digital circuit
design. The digital devices are built with analog components, and then some
considerations should be taken into account in order to obtain good and reliable
designs.
Some important electronics aspects, related to the circuit design, the timing and
synchronization aspects are discussed in this chapter. Most of those details are
hidden in reprogrammable logic for simplicity, but these do not eliminate the
consequences.
95
96
(a)
(b)
p-type
Vdd
n-type
Vdd
Vdd
Vdd
0
i0
i0
i1
out
out
i1
i0
1
out
0
i0
i1
i1
gnd
switching action
gnd
i
out
i0
i1
gnd
out
NOR
gnd
i0
i1
out
NAND
Fig. 5.1 CMOS transistors. a Switching action for p-type and n-type. b Some basic gates
97
Table 5.1 Fan-in, fan-out, internal and external delay of typical gates
fan-in
fan-out
t_int_hl
t_int_lh
t_ext_hl
(pf)
(pf)
(ns)
(ns)
(ns/pf)
t_ext_lh
(ns/pf)
INV
BUF
AND2
AND3
NAND2
NAND3
NOR2
XOR2
CKBUF
CKBUF_N
DFF
4.891
2.399
4.271
2.376
4.271
7.212
8.035
2.560
1.782
1.881
2.360
0.003
0.004
0.003
0.007
0.004
0.003
0.004
0.008
0.006
0.006
0.003
0.337
0.425
0.334
0.673
0.197
0.140
0.205
0.645
1.160
1.460
0.703
0.079
0.265
0.105
0.211
0.105
0.071
0.091
0.279
0.355
0.183
0.402
0.151
0.056
0.144
0.131
0.144
0.192
0.162
0.331
0.350
0.537
0.354
2.710
1.334
4.470
1.362
4.470
6.088
3.101
1.435
0.782
0.628
1.256
capacitance that an input of a gate has. Thus, the fan-out is the maximum
capacitance controllable by a gate, while providing voltage levels in the guaranteed range. The fan-out really depends on the amount of electric current a gate can
source or sink while driving other gates. Table 5.1 shows examples of fan-in and
fan-out for gates in 0.25 lm technologies (a ten years old technology). Observe,
for example, that a NAND2 gate can drive up 49 similar gates (197pf/4pf) if we
neglect the capacitance of interconnection.
In the case of FPGA the concepts of fan-in and fan-out are simplified and
measured in number of connections. Remember that most of the logic is implemented in look-up tables (LUTs). Then the fan-in is the number of inputs a
computing block has, like a two input AND gate implemented in a LUT has a fanin of two, and a three input NAND gate a fan-in of three. The concept of fan-out is
used to express the number of gates that each gate has connected at the output. In
some cases there appears the concept of maximum fan-out as the maximum
number of connections that we can control by a gate.
98
vcc
pull-up
resistor
Internal value
pull-down
resistor
input pad
Internal value
(a)
(b)
bus
in
out
In G out
0 1 0
1 1 1
0
Z
in
resistor
out
G
1.0
0.8
transition time
high-to-low
99
transition time
low-to-high
1.0
Logic 1
noise margin
0.9
0.6
invalid
0.4
0.2
0.0
noise margin
Logic 0
0.1
Rise time
Fall time
Fig. 5.4 Logic values and transition times (rise and fall time)
100
1.1
1.5
1.0
Typcial Case.
Derating Factoris 1
0.9
0.8
-60 -40 -20 0
Typcial Case.
Derating Factoris 1
0.5
20
40
60
1.8
2.2
2.4
2.6
2.8
3.2
3.4
3.6
Fig. 5.5 Typical derating curves for temperature and supply voltage
101
120
100
80
60
40
32
20
-20
1.39
1.29
1.23
1.18
1.16
1.14
1.36
1.25
1.20
1.15
1.13
1.11
1.32
1.22
1.17
1.12
1.10
1.08
1.27
1.18
1.12
1.08
1.06
1.04
1.24
1.14
1.09
1.05
1.03
1.01
1.20
1.11
1.06
1.02
1.00
0.98
1.19
1.10
1.05
1.01
0.99
0.97
1.15
1.07
1.02
0.98
0.96
0.94
1.12
1.03
0.99
0.95
0.93
0.91
connection is equal to 0.1 pf. What is the time required for the INV gate to
propagate the transitions 0 to 1 and the 1 to 0?
Thl t inthl t exthl capacity
0:079ns 2:710ns=pf 10 0:003pf 0:1pf 0:4313ns
Tlh t intlh t extlh capacity
0:151ns 4:891ns=pf 10 0:003pf 0:1pf 0:7868ns
In FPGA technologies these concepts are hidden, given only a propagation
delay for internal nodes (LUTs, multiplexers, nets, etc.), and do not distinguish
between high to low and low to high transitions.
102
A
B
a < b
C
A
B
a > b
C
`1
`0
`1
`0
`0
`1
`0
`1
`1
`1
`0
`0
`1
`0
`1
`1
`1
`0
a
t0 t1 t2
`0
`0
a
t0 t1 t2
103
c
b
d
A
`1
`0
`0
`1
`1
`0
`0
`1
B
C
D
E
F
G
`1
`0
`1
`1
`0
`1
`0
`0
`0
b B
c C
d D
t0
t1 t 2 t 3 t 4
104
Fig. 5.8 Typical single
phase synchronization
dff
Q
Combinational
circuit
ffi
D n
ff j
clk
clk period
clk
B
C
D
Combinational delay
105
dff
IN
OUT
clk
t su
th
data
stable
IN
clk
t pr
t h = hold time
t su = setup time
OUT
output
stable
t pr = propagation delay
output
stable
time
Fig. 5.9 Setup time, hold time and propagation delay of a register
the data are reliably sampled. These times are specified for any device or technology, and are typically between a few tens of picoseconds and a few nanoseconds for modern devices.
The data stored in the flip-flop is visible at its output as a propagation delay (tpr)
after the clock edge. This timing value is also known as clock-to-output delay.
Another related concept is the minimum clock pulse of a flip-flop. It is the
minimum width of the clock pulse necessary to control the register.
5.2.3 Metastability
Whenever there is a setup or a hold time violation, the flip-flop could enter in a
metastable state (a quasi-stable state). In this state the flip-flop output is unpredictable and it is considered as a failure of the logic design. At the end of a
metastable state, when the output could reveal an in between value, the flip-flop
settles down to either 1 or 0. This whole process is known as metastability.
Figure 5.10 illustrates this situation. The duration of the metastable state is a
random variable that depends on the technology of the flip-flop. The circuits
vendors provide information about the metastability in their devices. As an
example, the main vendors of FPGA provide this kind of information [35].
Comments 5.1
1. Not all the setup-hold window violations imply a metastable state. It is a
probabilistic event.
2. Not all the metastability states cause design failures. In fact, if the data output
signal resolves to a valid state before the next register captures the data, then
the metastable signal does not impact negatively in the system operation.
106
clk
dff
IN
t su t h
OUT
t su t h
t su t h
IN
t pr
clk
t pr
t pr
OUT
Metastable
State
eTrec K2
K1 fclk fdata
Where:
fclk is the frequency of the clock receiving the asynchronous signal.
fdata is the toggling frequency of the asynchronous input data signal.
5:1
107
1e18
1 Billon Years
1e15
1 Millon Years
1e12
1000 Years
1e9
1 Years
1e6
1e3
1 Day
1 Hr
1 min
1e0
1e-3
1e-6
0.5
1.0
1.5
2.0
2.5
3.0
Trec (ns)
Fig. 5.11 MTBF for 300 MHz clock and 50 MHZ data in a 130 nm technology
Trec is the available metastability settling time (recovery time), or the timing
until the potentially metastable signal goes to a known value 0 or 1.
K1 (in ns) and K2 (in 1/ns) are constants that depend on the device process and
on the operating conditions.
If we can tolerate more time to recover from the metastability (Trec) the MTBF
is reduced exponentially. Faster clock frequencies (fclk) and faster-toggling (fdata)
data worsen (reduce) the MTBF since the probability to have a setup-hold window
violation augments. Figure 5.11 shows a typical MTBF graphic. These graphics
are average data for 300 MHz clock and 50 MHz data in a 130 nm technology.
Comments 5.2
1. Observe the sensitivity to the recovery time. Following Fig. 5.11, if you are
able to wait 2 ns to recover from metastability, the MTBF is less than two
weeks. But if you can wait for 2.5 ns the MTBF is more than three thousand
years.
2. Suppose for the previous data (fclk = 300 MHz, fdata = 50 MHz, Trec =
2.5 ns) that give a MTBF of around 3200 years. But, if we have a system with
256 input bits the MTBF is reduced to 12.5 years. If we additionally produce
100,000 systems we have MTBF of 1 h!
3. Measuring the time between metastability events using real designs under real
operating conditions is impractical, because it is in the order of years. FPGA
vendors determine the constant parameters in the MTBF equation by characterizing the FPGA for metastability using a detector circuit designed to have a
short, measurable MTBF [3, 5].
108
syncronization chain
asynch
input
clk
dff
dff
Q
INT
t su th
Q
OUT
asynch
input
t pr
clk
INT
th = hold time
Metastable
OUT
tpr
t pr
109
(a)
(b)
asynch
input
comb
n
D
asynch
input
clk
comb
clk
(c)
asynch
input
syncronization chain
D
n
D
clk
Fig. 5.13 Synchronizing asynchronous inputs in a design due to the metastability problem. a No
registers. b One synchronization register. c Two register in synchronization chain
(a)
D Q
...
(b)
.
.
.
ck2
D Q
clk
ck1
.
.
.
ck2
...
ck3
.
.
.
clk
D Q
ck3
...
.
.
.
.
.
.
D Q
Fig. 5.14 Typical clock tree distribution. a Buffers tree. b Ideal H-routing distribution
110
(a)
Launching
flip-flop(s)
m A
cki
Capturing
flip-flop(s)
B
Combinational
circuit
ffi
ckj
D n
ffj
clk
(b)
(c)
clk
clk
ti
ti
tj
cki
ck i
ckj
ck j
tprop
B
C
old_c
new_c
ERROR
Combinational delay
tj
t prop
old_b
old_c
new_b
new_c
new_c
Combinational
delay
Fig. 5.15 Setup and hold violations due to clock skew. a Example circuit. b Setup violation or
long-path fault. B. Hold violation or race-through
64,000 FF we need at least three levels of those buffers. Real clock trees use
special buffers with more driving capability and with other special characteristics,
and have additional problems related with the interconnections. Figure 5.14 shows
a classic clock tree that present multiple levels of buffers, and the typical H-routing
used to reduce the skew.
5:2
That leads to two types of clock skew: positive skew (skewi,j [ 0) and negative
skew (skewi,j \ 0). Positive skew occurs when the transmitting register (launching flip-flop, ffi) receives the clock later than the receiving register (capturing
111
flip-flop, ffj). Negative skew is the opposite; the sending register gets the clock
earlier than the receiving register. Two types of time violation (synchronization
failures) can be caused by clock skew: setup and hold violations. They are
described in what follows.
5:3
Observe that this inequation could be always satisfied if the clock period (T) could
be incremented. In other words, a positive clock skew, as shown in Fig 5.15b,
places a lower bound on the allowable clock period (or an upper bound on the
operating frequency) as follows:
max
max
max
T [ tprop
tcomb
tsu
skewmax
i;j
5:4
112
actual clock
edges
jitter
double-clocking (because two FFs could capture with the same clock tick) or
race-through and is common in shift registers.
To describe this situation, consider the example in Fig. 5.15c. At time tmin
i , the
clock edge triggers flip-flops ffi, and the signals propagates through the flip-flop
min
and the combinational logic tmin
prop + tcomb (we use the shortest propagation delay,
because we are considering the possibility of data racing). The input signal at ffj
has to remain stable for tmax
hold after the clock edge of the same clock cycle arrives
(tmax
). We are using max value since we are considering the worst case. To
j
summarize the constraint to avoid race condition (hold violation) is:
max
min
min
tjmax thold
\timin tprop
tcomb
5:5
5:6
Observe that a hold violation is more serious than a setup violation because it
cannot be fixed by increasing the clock period.
In the previous analysis we consider several launching and capturing flip-flops.
If the entire FFs have the same characteristics we do not need to consider maximum or minimum setup and hold time in equations 5.3, 5.4, 5.5 and 5.6.
113
There are several sources of clock jitter. The analog component of the clock
generation circuitry and the clock buffer tree in the distribution network are both
significant contributors to clock jitter. There are also environmental variations that
cause jitter, such as power supply noise, temperature gradients, coupling of
adjacent signals, etc.
114
(a)
(b)
(c)
dff
dff
IN
en
D
en
IN
OUT
clk
OUT
en
clk enable
clk enable
clk
clk
en
OUT
delay
...
delay
delay
clk_in
delay
Fig. 5.17 Clock gating. a Register with enable. b Register with clock gating. c Clock gating risk
clock
network
clk_out
...
phase
controller
clk_fb
Fig. 5.18 A typical delay locked loop used to synchronize the internal clock
Todays FPGA have several of those components, for example the Xilinx
Digital Clock Managers (DCM) and the Altera Phase-Locked Loops (PLL). One of
the main actions performed by those components is to act as a Delay Locked Loop
(DLL).
In order to illustrate the necessity of a DLL, consider the clock tree described in
Sect. 5.3.1, Fig. 5.14, based on the cells of Table 5.1, that is, 64,000 FF that uses
three levels of buffers. The interconnection load is neglected (an unreal case).
Assume, additionally, that we decide to connect 40 buffers to the first buffer, 40
buffers at the output of each buffer of the second level, and finally 40 FFs at the
output of the last 1600 buffers. We will calculate the transition time of a rising
edge at input clk using the concepts of Sect. 5.1.2.4. (Tlh = t_intlh ? t_extlh
capacity). In this scenario the signal ck1 will see the rising edge 0.44 ns later
(0.056 ns ? 2.399 ns/pf (40 0.004pf ? interconnection)). The ck2 rising edge
will be 0.44 ns after the ck1, and finally the ck3 will be propagated 0.344 ns after
the ck1. That is the clk signal will be at the FF 1.22 ns later (remember that for
simplicity we do not consider the interconnection load). If we want to work with a
clock frequency of 400 MHz (2.5 ns period) the clock will arrive half a cycle later.
(a)
115
(b)
System 1
100 MHZ
System 2
75 MHz
asynch
input
System 1
syncronization chain
D
synch
input
75 MHz
clk1
clk 2
clk
clk1
clk 2
System 2
116
sig1[0]
sig0[0]
D
Q
clk
sig0[0]
sig0[1]
sig1[1]
D
sig0[1]
sig1[0]
sig1[1]
clk
t prop
Using synchronizer.
Handshake signaling method.
Asynchronous FIFO.
Open loop communication.
117
clk1
clk2
new_value
data_s
req
ack
data_r
D Q
en
new_value
data_r
data_s
D Q
en
req
sending
hand
shaking
FSM
D Q
D Q
clk 2
Q D
Q D
receiving
hand
shaking
FSM
ack
clk2
clk 1
clk1
System 1
System 2
118
(a)
data_in
data_out
we
full
empty
read
clk 1
clk 2
(b)
WR_ ptr
data_in
FIFO
Write
logic
we
full
WR_data
data_out
FIFO
Read
logic
RD_ptr
we
clk1
clk2
RD_data
clk1
Port 1
Port 2
empty
read
clk2
port FIFOs write with one clock, and read with another. The FIFO storage provides
buffering to help rate match between different frequencies. Flow control is needed
in case the FIFO gets totally full or totally empty. These signals are generated with
respect to the corresponding clock. The full signal is used by system 1 (when the
FIFO is full, we do not want system 1 to write data because this data will be lost or
will overwrite an existing data), so it will be driven by the write clock. Similarly,
the empty signal will be driven by the read clock.
FIFOs of any significant size are implemented using an on-chip dual port RAM
(it has two independent ports). The FIFO is managed as a circular buffer using
pointers. A write pointer to determine the write address and a read pointer to
determine the read address are used (Fig. 5.22b). To generate full/empty conditions, the write logic needs to see the read pointer and the read logic needs to see
the write pointer. That leads to more synchronization problems (in certain cases,
they can produce metastability) that are solved using synchronizers and Gray
encoding.
Comments 5.4
1. Asynchronous FIFOs are used at places where the performance matters, that is,
when one does not want to waste clock cycles in handshake signals.
2. Most of todays FPGAs vendors offer blocks of on-chip RAMs that can also be
configured as asynchronous FIFOs.
3. A FIFO is the hardware implementation of a data stream used in some computation models.
119
clk1
data_s
a_value
new_value
req
D Q
en
data_r
System 2
data_s
D Q
en
D Q
D Q
receiving
hand
shaking
FSM
sending
hand
shaking
FSM
clk2
clk 1
System 1
Assuming:
clk 1@clk 2 5%
120
(a)
121
(b)
Vdd
Vdd
Q 01= CL . Vdd
pMOS
network
i0
i1
out
out
in
gnd
nMOS
network
CL
gnd
gnd
gnd
Fig. 5.24 Dynamic power consumption. a An inverter. b A general CMOS gate charging a
capacitance
A clock in a system has an activity factor a = 1, since it rises and falls every
cycle. Most data has an activity factor lower than 0.5, i.e. switches less than one
time per clock cycle. But real systems could have internal nodes with activity
factors grater then one due to the glitches.
If correct load capacitance is calculated on each node together with its activity
factor, the dynamic power dissipation at the total system could be calculated as:
X
2
P
f ai ci Vdd
5:7
i
Another component in the dynamic component is the short circuit power dissipation. During transition, due to the rise/fall time both pMOS and nMOS transistors will be on for a small period of time in which current will find a path
directly from Vdd to gnd, hence creating a short circuit current. In power estimation tools this current also can be modeled as an extra capacitance at the output
of the gates.
5:8
The energy is associated with the battery life. Thus, less energy indicates less
power to perform a calculation at the same frequency. Energy is thus independent
122
of the clock frequency. Reducing clock speed alone will degrade performance, but
will not achieve savings in battery life (unless you change the voltage).
That is why, typically, the consumption is expressed in mW/Mhz when comparing circuits and algorithms that produce the same amount of data results per
clock cycle, and normally nJoules (nanoJoules) are used when comparing the total
consumption of different alternatives that require different numbers of clock cycles
for computing.
123
5.5 Exercises
1. Redraw Fig. 5.5 (glitches for unbalanced paths) considering: (a) NAND gate,
(b) NOR gate, (c) XOR gate.
2. Redraw Fig. 5.6 (cascade effect of glitches paths) considering the transition
(a) from 0000 to 1111; (b) from 0100 to 1101; (c) from 0101 to 1101;
3. Supposing one has a simple ripple carry adder (Fig. 7.1) of four bits. Analyze
the glitch propagation when changing to add 0000 ? 0000 to 1111 ? 1111.
4. What is the rise time and fall time of an AND2 gate that connects at his output
four XOR2 gates? Assume a total interconnection load of 12 pf (use data of
Table 5.1).
5. How many DFF can drive a CKBUF assuming the unreal case of no interconnection capacitance? Which is the propagation delay of the rising edge?
6. Assuming the derating factors of Table 5.2, what is the delay of exercise 4 and
5 for 80C and a supply voltage of 3.0 V.
7. Determine the MTBF for K1 = 0.1 ns, K2 = 2 ns-1; with clock frequency of
100 MHz and data arrival at 1 MHz, for recovery time of 1, 5, 10 and 20 ns.
8. What is the MTBF expected for an asynchronous input that uses two synchronization flip-flops working at 100 MHz and using the data of the previous
problem? The FFs have a setup time of one ns.
9. What is the delay of the falling edge with the data used in Sect. 5.3.4. i.e.
64,000 FF that uses three levels of BUF, neglecting the interconnection load?
10. Calculate the level of clock buffers (CKBUF) necessary to control 128,000
registers (DFF) of Table 5.1. Suppose additionally that any interconnection
has a load of 0.006 pf.
11. For the previous clock tree. Determine the propagation delay from the input
clock signal to the clk input of a FF.
124
12. Draw a timing diagram of a two stage handshaking protocol. Assume that the
sending clock is faster than the receiving clock.
13. For the circuit of the figure, suppose that the FF has a propagation delay
between 0.9 and 1.2 ns, a setup time between 0.4 and 0.5 ns and a hold time
between 0.2 and 0.3 ns.
Launching
flip-flop(s)
m
Capturing
flip -flop(s)
Combinational
circuit
Q
ff i
ck i
D
ckj
Q
ffj
clk
The clock arrives to the different FF of level i with a delay between 2.1 ns and
3.3 ns, and to level j with delays between 2.5 ns and 3.9 ns. What is the
maximum combinational delay acceptable to work at 100 MHz?
14. Using the data of the previous exercise, what is the minimum combinational
delay necessary to ensure a correct functionality?
15. A system A works with a supply voltage of 1.2 V and needs 1.3 mA during
10 s to perform a computation. A second system B powered at 1.0 V consumes an average of 1.2 mA and needs 40 s to perform the same task. Which
consumes less power and energy?
16. In the shift register of the figure, assuming that all the flip-flops have a
propagation delay of 0.9 ns, a setup time of 0.3 ns and a hold time of 0.2 ns,
what is the maximum skew tolerated if the interconnection has a delay (d1 and
d2) of 0.1 ns?
D
ck i
d1
ff i
Q
ffj
ckj
d2
D
ckk
Q
ffk
clk
17. For the previous problem. What is the maximum frequency of operation?
18. The following FF (with same temporal parameters as in exercise 16) is used to
divide the clock frequency. What is the minimum delay d of the inverter and
interconnection necessary for it to work properly?
d
ffi
clk
References
125
References
1. Rabaey JM, Chandrakasan A, Nikolic B (2003) Digital integrated circuits, 2nd edn. PrenticeHall, Englewood Cliffs
2. Wakerly JF (2005) Digital design principles and practices, 4th edn. Prentice-Hall, Englewood
Cliffs. ISBN 0-13-186389-4
3. Xilinx Corp (2005) XAPP094 (v3.0) Metastable Recovery in Virtex-II Pro FPGAs, XAPP094
(v3.0). http://www.xilinx.com/support/documentation/application_notes/xapp094.pdf
4. Actel Corp (2007) Metastability characterization report for Actel antifuse FPGAs; http://
www.actel.com/documents/Antifuse_MetaReport_AN.pdf
5. Altera corp (2009) White paper: Understanding metastability in FPGAs. http://www.altera.
com/literature/wp/wp-01082-quartus-ii-metastability.pdf
6. Pedram M (1996) Tutorial and survey paperPower minimization in IC design: principles and
applications. ACM Trans Design Autom Electron Syst 1(1):356
7. Rabaey JM (1996) Low power design methodologies. Kluwer Academic Publishers, Dordrecht
Chapter 6
EDA Tools
127
128
6 EDA Tools
system specification
Design Verification
Design Entry
Behavioural
Simulation
Synthesis
Functional
Simulation
Implementation
Post Implem.
Simulation
Timing
Simulation
Mapping
Place & Route
Back
annotation
Netlist (propietary)
Generate
program. File
In-circuit
Testing
Program.
Tool
Bitstream
Using
Using
Using
Using
Todays EDA tools allow the mixing of different design entries in a hierarchical
structure. It is common to see a schematic top level, with several predesigned IPs,
some specific components developed in VHDL and or Verilog and subsystems
designed in an ESL language.
129
HDL is the de facto design entry in most digital designs. In ASIC designs
Verilog is much more used but, FPGA designers use either VHDL and/or Verilog.
All FPGA tools support both languages and even projects mixing the use of booth
languages.
The Verilog hardware description language has been used far longer than
VHDL and has been used extensively since it was launched by Gateway Design
Automation in 1983. Cadence bought Gateway and opened Verilog to the public
domain in 1990. It became IEEE standard 1364 in 1995. It was updated in 2001
(IEEE 1364-2001) and had a minor update in 2005 (IEEE 1364-2001).
On the other hand, VHDL was originally developed at the request of the U.S
Department of Defense in order to document the behavior of existing and future
ASICs. VHDL stand for VHSIC-HDL (Very high speed integrated circuit Hardware Description Language) became IEEE standard 1076 in 1987. It was updated
in 1993 (IEEE standard 1076-1993), and the last update in 2008 (IEEE 1076-2008,
published in January 2009).
Xilinx ISE and Altera Quartus have text editors with syntax highlighting and
language templates for VHDL and Verilog to help in the edition.
130
6 EDA Tools
6.1.2 Synthesis
The synthesis (or logic synthesis) is the process by which an abstract form (an
HDL description) of the circuit behavior is transformed into a design implementation in terms of logic gates and interconnections. The output is typically a netlist
and various reports. In this context, a netlist describes the connectivity of the
electronic design, using instances, nets and, perhaps, some attributes.
There are several proprietary netlist formats but, most synthesizers can generate
EDIF (Electronic Design Interchange Format) that is a vendor-neutral format to
store electronic netlist.
FPGA vendors have their own synthesizers (Xilinx XST, Altera Quartus II Integrated
Synthesis) but, main EDA vendors have syntheses for FPGA (Precision by Mentor
Graphics and Synplify by Synplicity) that can be integrated in the FPGA EDA tools.
131
The optimizations available depend on the synthesizer but, most typical optimizations are present in all of them. Typical optimizations are for the area
reduction, the speed optimization, the low power consumption, and the target
frequency of the whole design.
Nevertheless, much more details can be controlled, such as:
Hierarchy Preservation: control if the synthesizer flattens the design to get better
results by optimizing entities and module boundaries or maintaining the hierarchy during Synthesis.
Add I/O buffers: enables or disables the automatic input/output buffer insertion:
this option is useful to synthesize a part of a design to be instantiated, later on.
FSM Encoding: selects the Finite State Machine (FSM) coding technique.
Automatic selection, one-hot, gray, Johnson, user defined, etc.
Use of embedded components: use embedded memory or multipliers blocks or
use general purpose LUTs to implement these functionalities.
Maximum fan-out: limits the fan-out (maximum number of connections) of nets
or signals.
Register duplication: allows or limit the register duplication to reduce fan-out
and to improve timing.
Retiming or Register Balancing: automatically moves registers across combinatorial gates or LUTs to improve timing while maintaining the original behavior.
The complete description of synthesis optimization can be found at the
synthesis tool documentation (for example for FPGA [3, 4, 6, 8]).
The synthesis behavior and optimization can be controlled using synthesis
constraints. The constraints could be introduced using the integrated environment,
a constraint file or embedding in the HDL code.
132
6 EDA Tools
the HDL (encoding of FSM, type of memories or multipliers) or specific optimization (duplicate registers, registers balancing, maximum fan-out, etc.). In
VHDL this is done using attributes defined in the declaration. The following
lines of code define the maximum fan-out of signal a_signal to 20 connections
in XST [8].
133
The inputs to the implementation are the netlist(s) generated in synthesis and
the design implementation constraints. The output is a proprietary placed and
routed netlist (ncd file in Xilinx, an internal database representation in Altera) and
several reports summarizing the results. Typically the design implementation uses
timing and area constraints; a general description is in Sect. 6.2.
134
6 EDA Tools
input speed
clk speed
X
D
cki
clk
Q
ff i
clk
output speed
d2
d1
Combinational
path
Logic
Location
D
ckk
Comb
path
Z
ffk
Pin
Location
d3
Combinational
path
135
6.3.1 Simulation
Logic simulation is the primary tool used for verifying the logical correctness of a
hardware design. In many cases, logic simulation is the first activity performed in
the process of design. There are different simulation points were you can simulate
your design, the three more relevant are:
RTL-level (behavioral) simulation. No timing information is used.
Post-Synthesis simulation. In order to verify synthesis result.
136
6 EDA Tools
Design Verification
HDL Desing
HDL RTL
(behavioual)
simulation
Tesbench
Stimulus
Synthesis
Implementation
(Place & Route)
Back
anotation
Generate
program. File
post-synthesis
(gate level)
Simulation
Timming
Simulation
Vendor
Libraries
Timing
Libraries
In-circuit
Simulation
Bitstream
Program.
Tool
In-circuit
Testing
Post implementation (post place & route) simulation. Also known as timing
simulation because it includes blocks and nets delays.
The behavioral (RTL-level) simulation enables you to verify or simulate a
description at the system level. This first pass simulation is typically performed to
verify code syntax, and to confirm that the code is functioning as intended. At this
step, no timing information is available, and simulation is performed in unit-delay
mode to avoid the possibility of a race condition. The RTL simulation is not
architecture-specific, but can contain instantiations of architecture-specific components, but in this case additional libraries are necessary to simulate.
Post-synthesis simulation, allows you to verify that your design has been
synthesized correctly, and you can be aware of any differences due to the lower
level of abstraction. Most synthesis tools can write out a post-synthesis HDL
netlist in order to use a simulator. If the synthesizer uses architecture-specific
components, additional libraries should be provided.
The timing simulation (post implementation or post Place & Route full timing)
is performed after the circuit is implemented. The general functionality of the
design was defined at the beginning but, timing information cannot be accurately
calculated until the design has been placed and routed.
After the implementation tools, a timing simulation netlist can be created and
this process is known as back annotation. The result of a back annotation is an
HDL file describing the implemented design in terms of low level components and
additional SDF (Standard Delay Format) file with the internal delays that allows
you to see how your design behaves in the actual circuit.
Xilinx ISE has his own simulator (ISE simulator, ISim) but, can operate with
Mentor Modelsim or Questa and operate with external third-party simulation tools
137
138
6 EDA Tools
logic analyzer for visibility but, can see only a limited number of signals which
they determined ahead of time.
The reprogramability of FPGA opens new ways to debug the designs. It is
possible to add an internal logic analyzer within the programmed logic. Xilinx
ChipScope Pro Analyzer and Altera SignalTap II Logic Analyzer are tools that
allow performing in-circuit verification, also known as on-chip debugging. They
use the internal RAM to store values of internal signals and communicate to the
external world using the JTAG connection.
Another intermediate solution includes inserting probes of internal nets
anywhere in the design and connecting of the selected signals to unused pins
(using Xilinx FPGA Editor, Altera Signal Probe, Actel Designer Probe Insertion).
139
IN
D
dff
IN
SIN
test
OUT
IN
OUT
IN
OUT
IN
...
Q
IN
OUT
...
...
OUT
...
OUT
clk
reset
IN
D
IN
IN
OUT
IN
IN
OUT
...
IN
OUT
OUT
IN
OUT
OUT
OUT
140
6 EDA Tools
Since the timing analysis is capable of verifying every path, it can detect the
problems in consequence of the clock skew (Sect. 5.3) or due to glitches
(Sect. 5.1.3).
The timing analysis can be performed interactively, asking for different paths
but, typically is used to report the slack upon the specified timing requirements
expressed in the timing constraint (Sect. 6.2.1).
The FPGA tools, after implementing the design, make a default timing analysis
to determine the design system performance. The analysis is based on the basic
types of timing paths: Clock period; input pads to first level of registers; last level
of registers to output pads, and pad to pad (in asynchronous paths).
Each of these paths goes through a sequence of routing and logic. In Xilinx ISE
the tool calls Timing Analyzer and in Altera Quartus II Altera TimeQuest.
More advanced options can be analyzed upon using the specific tool.
141
The power analyzer takes the design netlist, the activity of the circuit, the
supply voltage and the ambient temperature and reports the consumed current
(power) and the junction temperature. Nevertheless, the junction temperature itself
depends on the ambient temperature, voltage level, and total current supplied. But
the total current supplied includes the static current as a component that depends
on temperature and voltage, so a clear circular dependency exists. The tools use a
series of iterations to converge on an approximation of the static power for given
operating conditions.
A significant test bench that models the real operation of the circuit provides the
necessary activity of each node. The simulation result of the activity is saved in a
file (SAIF or VCD file) that is later used in conjunction with the capacitance
information by the power analyzer.
The Value Change Dump (VCD) file is an ASCII file containing the value
change details for each step of the simulation and can be generated by most
simulators. The computation time of this file can be very long, and the resulting
file size is typically huge. On the other hand, the SAIF (Switching Activity
Interchange format) file contains toggle counts (number of changes) on the signals
of the design and is supported also for most simulators. The SAIF file is smaller
than the VCD file, and recommended for power analysis.
With regards to the EDA automatic improvements for low power, as mentioned
previously, in synthesis it is possible to have, as a target, the power reduction. In
implementation, it is possible to specify optimal routing to reduce power
consumption. In this case, it allows to specify an activity file (VCD or SAIF) to
guide place & route when it optimizes for power reduction.
142
6 EDA Tools
floating point adder of Chap. 12 that we will implement in both vendor tools. The
simple example uses three VHDL files: FP_add.vhd, FP_right_shifter, and
FP_leading_zeros_and_shift.vhd.
If you want to begin with no previous experience, it is highly recommendable to
start with the tutorial included in the tools. In the case of Quartus, follow the
Quartus II Introduction for VHDL/verilog Users accessible form help menu. If
you will start using ISE we recommend using the ISE In-Depth Tutorial,
accessible using help ? Xilinx on the web ? Tutorials.
143
144
6 EDA Tools
Fig. 6.8 Assigning timing constraints, and the generated ucf text file
After the synthesis you can review the graphical representation of the synthesized circuit (View RTL schematic) or review the synthesis report (.syr file). This
report gives relevant information. The HDL synthesis part of report describes
the inferred component as a function of the HDL and the final report summarizes the resource utilization and performs a preliminary timing report.
145
The result of the implementation is a proprietary placed and routed netlist (ncd
file) and several reports. There are three main reports generated: the map report
(.mrp), the place and route report (.par) and the Post-PAR static timing report
(.twr) (Fig. 6.9).
The map report shows detailed information about the used resources (LUTs, slices,
IOBs, etc.), the place and route report gives clock network report including the skew
and informs which constraints were met and which were not. Finally, Post-PAR static
timing report gives the worst case delay with respect to the specified constraints.
You can obtain additional graphical information about the place and route
results using FPGA editor and PlanAhead. Use the FPGA Editor to view the actual
design layout of the FPGA (in the Processes pane, expand Place & Route, and
double-click View/Edit Routed Design (FPGA Editor)). The PlanAhead software can be used to perform post place & route design analysis. You can observe,
graphically, the timing path onto the layout, and also perform floorplaning of the
design. In order to open PlanAhead in the Processes pane, expand Place & Route,
and double-click Analyze Timing/Floorplan Design (PlanAhead).
146
6 EDA Tools
and the implementation that we perform is slower. Then, some input pattern
combination violates setup time and gives, consequently, errors. You can modify
the test bench in order to operate at a lower frequency (Fig. 6.10).
147
for the various steps of the implementation process (Design Utilities ? View
Command Line Log File). This allows you to verify the options being used or to
create a command batch file to replicate the design flow.
148
6 EDA Tools
149
Quartus II as was explained for Xilinx ISE (Sect. 6.6.1). Altera has an application
note Altera Design Flow for Xilinx Users [1], where this process is carefully
explained.
We can start creating a new project (file ? New Project Wizard) with name
fp_add. We add to the project the three VHDL source files (step 2 of 5).We select a
Stratix III device, EP3SL50F484C3. Leave the default synthesis and simulation
options (Fig. 6.14).
For the rest of the steps in the implementation and simulation flow, Table 6.1
summarizes the GUI (graphical user interface) names for similar task in Xilinx ISE
and Altera Quartus II.
6.7 Exercises
1. Implement the floating adder of Chap. 12 in Altera Quartus in a Stratix III
device. What are the area results? Add implementation constraints in order to
add FP numbers at 100 MHz. Constraints are met?
2. Implement the floating point multiplier of Chap. 12 in Xilinx ISE in a Virtex 5
device. What are the area results? It is possible to multiply at 100 MHz?
Remember to add implementation constraints. Multiplying at 20 MHZ what is the
expected power consumption? Use the XPower tool and the provided test bench.
150
6 EDA Tools
Table 6.1 GUI names for similar tasks in Xilinx ISE and Altera Quartus II
GUI feature
Xilinx ISE
Altera quartus II
HDL design entry
Schematic entry
IP entry
Synthesis
Synthesis constraints
Implementation
constraint
Timing constraint
wizard
Pin constrain wizard
Implementation
Static timing analysis
Generate programming
file
Power estimator
Power analysis
Simulation
Co-simulation
In-chip verification
View and editing
placement
Configure device
HDL editor
Schematic editor
CoreGen and architecture
wizard
Xilinx synthesis technology
(XST)
Third-party EDA synthesis
XCF (xilinx contraint file)
UCF (user constraint file)
HDL editor
Schematic editor
Megawizard plug-in manager
PlanAhead (PinPlanning)
Translate, map, place and
route
Xilinx timing analyzer
BitGen
Quartus II integrated
synthesis (QIS)
Third-party EDA synthesis
Same as implementation
QSF (quartus II settings file) and
SDC (synopsys design
constraints file)
Quartus II timequest timing analyzer
SDC editor
Pin planner
Quartus II integrated synthesis (QIS),
fitter
Timequest timing analyzer
Assembler
XPower estimator
XPower analyzer
ISE simulator (ISim)
Third-party simulation tools
ISim co-simulation
Chipscope pro
PlanAhead, FPGA editor
iMPACT
Programmer
3. Implement the pipelined adder of Chap. 3. Analyze the area-time-power trade off
for different logic depths in the circuit. Use, for the experiment, a Virtex 5 device
and compare the result with respect to the new generation Virtex 7. What happened in Altera, in comparing stratix III devices with respect to Stratix V?
4. Implement the adders of Chap. 7 (use the provided VHDL models). Add to the
model input and output registers in order to estimate the maximum frequency of
operation.
5. Compare the results of implementation of radix 2 k adders of Sect. 7.3 in Xilinx
devices using muxcy and a behavioural description of the multiplexer (use the
provided VHDL models of Chap. 7).
6. Implement the restoring, non-resorting, radix-2 SRT and radix 2 k SRT divider
of Chap. 9 in Xilinx and Altera devices (use the provided VHDL models of
Chap. 9).
7. Compare the results of square root methods of Chap. 10 in Altera and Xilinx
design flow.
References
151
References
1. Altera corp (2009) AN 307: Altera design flow for Xilinx users. http://www.altera.com/
literature/an/an307.pdf
2. Altera corp (2011a) Altera quartus II software environment. http://www.altera.com/
3. Altera corp (2011b) Design and synthesis. In: Quartus II integrated synthesis, quartus II
handbook version 11.0, vol 1. http://www.altera.com/
4. Mentor Graphics (2011) Mentor precision synthesis reference manual 2011a. http://
www.mentor.com/
5. Microsemi SoC Products Group (2010) Libero integrated design environment (IDE) v9.1.
http://www.actel.com/
6. Synopsys (2011) Synopsys FPGA synthesis user guide (Synplify, Synplify Pro, or Synplify
Premier). http://www.synopsys.com/
7. Xilinx inc (2011a) Xilinx ISE (Integrated Integrated software environment) design suite. http://
www.xilinx.com/
8. Xilinx inc (2011b) XST (Xilinx Synthesis Technology) user guide, UG687 (v 13.1). http://
www.xilinx.com/
Chapter 7
Adders
153
154
7 Adders
xn-1 yn-1
cn
1-digit
adder
x1
cn-1
....
y1
x0 y0
c 2 1-digit
adder
c 1 1-digit
adder
zn-1
z1
c0
z0
? 1
155
xi
yi
xi
c i+1
yi
0
1
ci
7:1
and the delay from the input carry to the output carry is equal to
Tcarrytocarry n n Tmux :
7:2
Comment 7.1
Most FPGAs include the basic components to implement the structure of Fig. 7.3,
and the synthesis tools automatically generate this optimized adder from a simple
VHDL expression such as
7:3
and the delay from the input carry to the output carry to
Tcarrytocarry n m Tmux :
7:4
156
7 Adders
xn-1 yn-1
zn
0
1
z n-1
x1 y1
c n-1
....
x0 y0
c1
z1
c0
z0
The following VHDL model describes the basic cell of Fig. 7.4.
The corresponding circuit is a k-bit half adder that computes t = (s mod 2k) + 1.
The most significant bit tk of t is equal to 1 if, and only if, all the bits of (s mod 2k)
are equal to 1. As mentioned above (Comment 7.1) most FPGAs include the basic
components to implement the structure of Fig. 7.3. In the particular case where
x = 0, y = s and c0 = 1, the circuit of Fig. 7.4 is obtained. The apparently
unnecessary XOR gates are included because there is generally no direct connection between the adder inputs and the multiplexer control inputs. Actually, the
157
xi (k-1) yi (k-1)
si (k)
1-bit
adder
xi (1) yi (1)
xi (0) yi (0)
1-bit
adder
1-bit
adder
....
si(k-1)
pi 2-input
AND
......
si(1)
si(0)
2-input
AND
....
2-input
AND
0
1
ci+1
si(k-1)
1-bit
half adder
....
zi(k-1)
ci
si(1)
si(0)
1-bit
half adder
1-bit
half adder
zi(1)
zi(0)
......
0 si (k-1)
pi
0
1
0 si (1)
....
0
1
0 si (0)
0
1
XOR gates are LUTs whose outputs are permanently connected to the carry-logic
multiplexers.
A complete generic model base_2 k_adder.vhd is available at the Authors web
page and examples of FPGA implementations are given in Sect. 7.9.
According to (7.3), the non-constant terms of Tadder(n) are:
mTmux,
kTmux included in Tadder(k) according to (7.1),
kTmux included in Thalf-adder(k) according to (7.1).
Thus, the sum of the non-constant terms of Tadder(n) is equal to (2k ? m)Tmux.
The value of 2k ? m, with mk = n, is minimum when 2k % m, that is, when
k % (n/2)1/2. With this value of k, the sum of the non-constant terms of Tadder(n) is
equal to (8n)Tmux. Thus, the computation time is O(n) instead of O(n).
158
7 Adders
xi (k-1) yi (k-1)
ci0
1-bit
adder
xi (1) yi (1)
xi (0) yi (0)
1-bit
adder
1-bit
adder
....
zi0(k-1)
1-bit half
adder
1-bit half
adder
zi0(1)
....
zi0(0)
1-bit half
adder
1-bit half
adder
zi1(1)
zi1(0)
ci1
zi1(k-1)
0
1
c i+1
......
ci
......
zi 0
zi1
zi
Fig. 7.6 Carry select adder
Comments 7.2
1. The circuit of Fig. 7.4 is an example of carry-skip adder. For every group of
k bits, both the carry-propagate and carry-generate functions are computed.
If the carry-propagate function is equal to 1, the input carry is directly propagated to the carry output of the k-bit group, thus skipping k bits.
2. A mixed-radix numeration system could be used. Assume that n = k1 ?
k2 ? ? km; then a radix
2k1 ; 2k2 ; ; 2km
representation can be considered. The corresponding adder consists of m
blocks, similar to that of Fig. 7.3, whose sizes are k1, k2,, and km, respectively. Nevertheless, within an FPGA it is generally better to use adders that fit
within a single column. Assuming that the chosen device has r carry-logic cells
per column, a good option could be a fixed-radix adder with k B r. In order to
minimize the computation time, k must be approximately equal to (n/2)1/2, so
that n must be smaller than 2r2, which is a very large number.
159
xi(k-10)
yi(k-10)
k-bit adder
ci0
zi0(k-10)
k-bit adder
ci1
zi1(k-10)
adder 2
m 1 Tmux Tmux ;
7:5
and the delay from the input carry to the output carry to
Tcarrytocarry m k m Tmux :
7:6
The following VHDL model describes the basic cell of Fig. 7.6.
160
7 Adders
Comment 7.3
As before (Comments 7.2) a mixed-radix numeration system could be
considered.
As a matter of fact, the FPGA implementation of a half-adder is generally not
more cost-effective than the implementation of a full adder. So, the circuit of
Fig. 7.6 could be slightly modified: instead of computing ci0 and ci1 with a full
adder and a half adder, two independent full adders of any type can be used
(Fig. 7.7).
The following VHDL model describes the modified cell:
7:7
161
162
7 Adders
xn-1..n-s
x2s-1..s
xs-1..0 yn-1..n-s
.....
k-1
y2s-1..s
ys-1..0
.....
cout
k-1
s-digit adder
sel
cin
DFF
init.:cin
enk-1
zn
zn-1..n-s
en1
.....
z2s-1..s
en0
zs-1..0
163
The complete circuit (Fig. 7.8, with k = n/s) is made up of an s digit adder,
connection resources (k-to-1 multiplexers) giving access to the s-digit groups,
a D-flip-flop which stores the carries (ci in Algorithm 7.1), an output register
storing z, and a control unit whose main component is a k-state counter.
The following VHDL model describes the circuit of Fig. 7.8 (B = 2).
164
7 Adders
x0
x1 x2
x m-1
.....
m -1
sel
n-digit adder
n-digit register
initially: 0
en_acc
load
7:8
165
zn-1
x1 y-1
....
dn-1
c1
1-digit
adder
z1
x0 y0
c0
1-digit
adder
d1
z0
0
d0
7:9
In order to reduce the computation time, a carry-save adder can be used. The basic
component is shown in Fig. 7.10: it consists of n 1-digit adders working in parallel.
Given two n-digit numbers x and y, and an n-bit number c, it expresses the sum
(x ? y ? c) mod Bn under the form z ? d, where z is an n-digit number and d an
n-bit number. In other words, the carries are stored within the output binary vector
d instead of being propagated (stored-carry encoding). As all cells work in parallel
the computation time is independent of n.Let CSA be the function implemented by
the circuit of Fig. 7.10, that is
CSAx; y; c z; d;
where
zi xi yi ci mod B; di bxi yi ci =Bc; 8i 2 f0; 1; . . .; n 1g:
Assume that at every step of Algorithm 7.2 the value of accumulator is represented
under the form u ? v, where u is an n-digit number and v an n-bit number. Then, at
step j, the following operation must be executed:
u; v : CSA u; xj ; v :
The following formal algorithm computes z.
166
Fig. 7.11 Carry save adder
7 Adders
x0
x1 x2
xm-1
.....
m-1
sel
n-bit register
initially: 0
n-digit register
initially: 0
en_acc
load
n-digit adder
167
x0,n-1 x1,n-1
x0,n-2 x1,n-2
x0,n-3 x1,n-3
x0,0 x1,0
x2,n-1
x2,n-2
x2,n-3
x2,0
x3,n-1
x3,n-2
x3,n-3
x3,0
...
....
xm-1,n-1
zn-1
xm-1,n-2
zn-2
xm-1,n-3
zn-3
xm-1,0
z0
csa m; n
m Tadder 1 Tadder n:
7:10
168
7 Adders
z2
z1
z0
7:11
The following VHDL model describes the circuit of Fig. 7.12 (B = 2). As before,
x is the concatenation of x0, x1,, xm-1.
x1,n-1
169
x2,n-1
x0,1
x1,1
x2,1
x0,0
x1,0
x2,0
...
x3,n-1
x3,1
x3,0
x4,1
x4,0
xm-1,0
...
x4,n-1
...
xm-1,n-1
xm-1,1
...
un-1
vn-1
u1
v1
u0
v0
170
7 Adders
The depth of the tree is equal to dlog2me and its computation time (one of the
critical paths has been shaded) is approximately equal to
Taddertree m; n n log2 m 1 Tadder 1:
7:12
The following VHDL model describes the circuit of Fig. 7.13 (B = 2).
csa m; n
m 2 Tadder 1 Tadder n:
7:13
The following VHDL model describes a 2-operand carry-save adder, also called 3to-2 counter (Sect. 7.7.3). It corresponds to a file of the circuit of Fig. 7.14.
171
7:14
172
Fig. 7.15 6-to-3 counter
7 Adders
x0,n-1x1,n-1 x5,n-1
6-operand
1-bit adder
x0,1x1,1 x5,1
x0,0x1,0 x5,0
6-operand
1-bit adder
6-operand
1-bit adder
...
un-1
vn-1
u1
u0
v0
v1
wn-1
w0
w1
x0x1 x5
6-to-3
counter
x6x7 x11
x18x19 x23
x12x13 x17
6-to-3
counter
6-to-3
counter
6-to-3
counter
6-to-3
counter
6-to-3
counter
6-to-3
counter
173
x0x1 x23
24-to-3
counter
3-to-2
counter
2-operand adder
z
Then, by connecting in parallel n circuits of this type, a binary 6-to-3 counter is
obtained (Fig. 7.15):
The counter of Fig. 7.15 can in turn be used as a building block for generating
174
7 Adders
xn yn
xn-10
yn-10
cin
n-digit adder
zn-10
zn
Fig. 7.19 Radix-B Bs
complement subractor
xn yn
xn-10
yn-10
(B-1)s
compl.
yn-10
n-digit
adder
zn
zn-10
175
yn0
xn0
xn
yn0
yn
(n+2)-bit
adder
xn0
xn
cin
zn+10
yn yn0
(n+2)-bit
adder
zn+10
logic circuitry, and permit the use of efficient design techniques such as pipelining
and digit-serial processing.
176
Table 7.1 Binary adders
7 Adders
n
LUTs
Delay
32
64
128
256
512
1024
32
64
128
256
512
1024
2.25
2.98
4.44
7.35
13.1
24.82
LUTs
Delay
32
64
64
64
64
128
128
128
256
256
256
512
1024
1024
1024
1024
4
4
5
8
16
8
12
16
16
12
13
16
16
22
23
32
88
176
143
152
140
304
286
280
560
572
551
1120
2240
2303
2295
2304
2.92
3.11
3.05
3.64
4.95
3.85
4.73
5.04
5.22
4.98
4.99
5.59
6.31
6.15
6.13
6.41
177
LUTs
Delay
32
32
32
64
64
64
128
128
256
256
256
512
512
512
1024
1024
1024
6
8
4
8
16
4
12
16
16
32
13
16
32
23
16
64
32
84
72
60
144
199
120
417
399
799
783
819
1599
1567
1561
3199
3103
3135
4.83
3.99
3.64
6.06
4.17
4.03
5.37
4.87
5.69
5.26
5.64
6.10
6.09
6.16
6.69
6.74
6.52
Delay
32
256
512
1024
8
16
32
32
3.99
5.69
6.09
6.52
n1
n2
n3
LUTs
Delay
256
256
512
512
512
1024
1024
16
4
8
4
16
16
16
4
16
8
16
4
4
16
4
4
8
8
8
16
4
1452
684
2120
1364
2904
5808
6242
6.32
6.81
10.20
7.40
7.33
10.33
7.79
178
7 Adders
FF
LUTs
Period
Total time
128
128
128
128
256
256
256
512
512
512
1024
1024
1024
2048
2048
135
134
133
132
263
262
261
520
519
518
1033
1034
1031
2063
2056
107
97
132
137
187
177
234
381
347
337
757
717
667
1427
1389
3.21
3.14
3.18
3.45
3.40
3.51
3.87
3.92
3.78
4.26
4.20
4.32
4.55
4.45
5.04
51.36
25.12
12.72
6.90
54.40
28.08
15.48
125.44
60.48
34.08
268.80
138.24
72.80
284.80
161.28
8
16
32
64
16
32
64
16
32
64
16
32
64
32
64
16
8
4
2
16
8
4
32
16
8
64
32
16
64
32
FF
LUTs
Period
Total time
8
8
16
16
32
32
32
64
64
64
4
8
16
8
32
16
64
64
32
16
12
13
22
21
39
38
40
72
71
70
23
32
90
56
363
170
684
1356
715
330
2.25
2.37
2.71
2.57
3.72
3.09
3.89
4.62
4.48
4.41
9.00
18.96
43.36
20.56
119.04
49.44
248.96
295.68
143.36
70.56
179
FFs
LUTs
Period
Total time
8
8
16
16
32
32
32
64
64
64
4
8
16
8
32
16
64
64
32
16
19
20
37
36
70
69
71
135
134
133
37
46
120
86
425
232
746
1482
841
456
1.81
1.81
1.87
1.84
2.57
1.88
2.68
2.69
2.61
1.9
7.24
14.48
29.92
14.72
82.24
30.08
171.52
172.16
83.52
30.40
LUTs
Delay
8
8
16
16
32
32
32
32
64
64
64
4
8
16
8
32
8
16
64
64
32
16
21
47
219
103
947
215
459
1923
3939
1939
939
2.82
5.82
11.98
6.00
24.32
6.36
12.35
47.11
49.98
25.04
13.07
180
7 Adders
LUTs
Delay
8
8
16
16
32
32
32
32
64
64
64
4
8
16
8
32
8
16
64
64
32
16
22
68
314
135
1388
279
649
2868
5844
2828
1321
2.93
5.49
10.26
5.59
20.03
5.95
10.65
37.75
39.09
20.31
11.35
LUTs
Delay
8
16
32
64
50
106
218
442
3.78
3.97
4.33
5.06
LUTs
Delay
8
16
24
32
64
157
341
525
709
1445
4.59
4.77
4.95
5.13
5.86
181
30
25
20
normal
Base_2k
15
Carry_sel
10
log_add
5
0
0
200
400
600
800
1000
1200
Fig. 7.21 Delay in function of the number of bits for several 2-operand adders
7.9.8 Comparison
A comparison between four types of 2-operand adders, namely binary (normal),
radix-2k, carry-select and logarithmic adders, has been done: Fig. 7.21 gives the
corresponding adder delays (ns) in function of the number n of bits.
7.10 Exercises
1. Generate a generic model of a 20 s complement addersubtractor with overflow
detection.
2. An integer x can be represented under the form (1)s m where s is the sign of
x and m its magnitude (absolute value). Design an n-bit sign-magnitude adder
subtractor.
3. Design several n-bit counters, for example
7-to-3,
31-to-3,
5-to-2,
26-to-2.
4. Design a self-timed 64-bit adder with end of computation detection (done
signal).
182
7 Adders
References
1. Parhami B (2000) Computer arithmetic: algorithms and hardware design. Oxford University
Press, New York
2. Ling H (1981) High-speed binary adder. IBM J Res Dev 25(3):156166
3. Brent R, Kung HT (1982) A regular layout for parallel adders. IEEE Trans Comput C-31:260
264
4. Ladner RE, Fischer MJ (1980) Parallel prefix computation. J ACM 27:831838
5. Ercegovac MD, Lang T (2004) Digital arithmetic. Morgan Kaufmann, San Francisco
6. Deschamps JP, Bioul G, Sutter G (2006) Synthesis of arithmetic circuits. Wiley, New York
Chapter 8
Multipliers
183
184
8 Multipliers
ab c
b
c
x
d
(a)
FA
(b)
8:1
8:2
(Fig. 8.1a).
If B = 2, it amounts to a 2-input AND gate and a 1-digit adder (Fig. 8.1b).
An n-digit by 1-digit multiplier made up of n 1-digit by 1-digit multipliers is
shown in Fig. 8.2. It computes as
zx bud
8:3
where x and u are n-digit numbers, b and d are 1-digit numbers, and z is an
(n ? 1)-digit number. Observe that the maximum value of z is
Bn 1B 1 Bn 1 B 1 Bn1 1:
185
u n-1 xn-2
un-2
x1
.....
u 1 x0
u0
d
zn
z n-1
z n-2
z1
z0
Using the iterative circuit of Fig. 8.2 as a computation resource, the computation
of (8.1) amounts to computing the m n-digit by 1-digit products
z0 x y0 u v0 ;
z1 x y1 v1 B;
z2 x y2 v2 B2 ;
8:4
...
z
m1
8:5
For that, one of the multioperand adders of Sect. 7.7 can be used. As an example, if
Algorithm 7.2 is used, then z is computed as follows.
Algorithm 8.1: Multiplication, right to left algorithm
8:6
186
8 Multipliers
x3
y0
u3 x2
y0
u 2 x1
y0
u 1 x0
y0
u0
x
v0
x3
y1
x2
y1
x1
y1
x0
y1
z0
x
v1
x3
y2
x2
y2
x1
y2
x0
y2
z1
x
v2
z6
z5
z4
z3
z2
The following VHDL model describes the circuit of Fig. 8.3 (B = 2).
187
188
8 Multipliers
x3
y0
u3
x3
x2
y0
y1
x2
u2
x1
y0
v2
y1
x1
u1
x0
y0
u0
v1
v0
y1
x0
y1
z0
0
x
x3
y2
x2
y2
x1
y2
x0
y2
z1
0
x
z6
z5
z4
z3
z2
189
8:8
190
8 Multipliers
x
ym-1
u
d
z=xb+u+d
xB
0
v m-1
y1
z=xb+u+d
m-1
xB
z (m-1)
u
d
0
v1
y0
z=xb+u+d
xB
z(1)
u
d
u
v0
z (0)
multioperand adder
y6
z=xb+u+d
u
d
0
v6
y1
z=xb+u+d
0
v1
y0
z=xb+u+d
u
d
w (1)
w (1)
z (1)
z (0)
w (6)
x2
u
d
x2
(6)
7-to-3 counter
x (1)
x (2)
x (3)
3-to-2 counter
x (1)
x (2)
ripple adder
u
v0
yj
191
x2
yj
x1
yj
0 x0
yj
x
vj
x3
yj+1
x2
y j+1
x1
y j+1
x0
y j+1
z0
(j+1,j)
x
v j+1
z 5(j+1,j) z4(j+1,j)
z3(j+1,j)
z 2(j+1,j)
z1(j+1,j)
192
8 Multipliers
xi
yj
c j,i+1
xi
c ji
cj+1,i
zi
x i-1 yj+1
x i-1 y j+1
cj+1,i
yj
0
1
c j+1,i-1
c j+1,i-1
(j+1,j)
z i(j+1,j)
(a)
(b)
193
8:9
with
zn 2 f0; 1; . . . ; B2 1g and zi 2 f0; 1; . . . ; B1 1g; 8i in f0; 1; . . . ; n 1g:
Finally, given two n-digit radix-B1 numbers x and u, and two m-digit radix-B2
numbers y and v, compute
z0 x y0 u v0 ;
z1 x y1 v1 B2 ;
z2 x y2 v2 B22 ;
...
8:10
8:11
194
8 Multipliers
195
x 4i+74i+4 y 2j-12j-2
x 4i+34i
y 2j-12j-2
b
c
j-1,i+1
j-1,i
d
x 4i+34i
y2j+12j
x 4i-14i-4
y2j+12j
j,i
j,i-1
d
e
f
e
f
Fig. 8.9 Part of a 4n-bit by 2n-bit multiplier using 4-bit by 2-bit multiplication blocks
Fig. 8.9. As before, a stored-carry encoding circuit could also be designed, but
with an even more complex connection pattern. It is left as an exercise.
196
8 Multipliers
x
shift register
initially: y
x b
Fig. 8.2
zn..1 z 0 d
x b
Fig. 8.2
z n..1 z 0 d
register
initially: u
shift register
initially: v
register
initially: u
zm-1..0
zn+m-1..m
zn+m-1..m
(a)
shift register
initially: y
zm-1..0
(b)
z0 u x y0 v0 =B;
z1 z0 x y1 v1 =B;
z2 z1 x y2 v2 =B;
zm1
8:12
...
zm2 x ym1 vm1 =B:
Multiply the first equation by B, the second by B2, and so on, and add the so
obtained equations. The result is
zm1 Bm u x y0 v0 x y1 v1 B . . . x ym1 vm1 Bm1
xy u v:
Algorithm 8.2: Shift and add multiplication
A data path for executing Algorithm 8.2 is shown in Fig. 8.10a. The following
VHDL model describes the circuit of Fig. 8.10a (B = 2).
197
The complete circuit also includes an m-state counter and a control unit.
A complete generic model shift_and_add_multiplier.vhd is available at the
Authors web page.
If v = 0, the same shift register can be used for storing both y and the least
significant bits of z. The modified circuit is shown in Fig. 8.10b. A complete
generic model shift_and_add_multiplier2 is also available.
198
8 Multipliers
x
y1
y2
y3
carry-save adder
sn..1
c n..1
register
initially: u
register
initially: v
s0
adder
shift register
initially: y
zm-1..0
z n+m-1..m
8:13
199
The complete circuit also includes an m-state counter and a control unit. A complete generic model sequential_CSA_multiplier.vhd is available at the Authors
web page. The minimum clock period is equal to the delay of a 1-bit by 1-bit
multiplier. Thus, the total computation time is equal to
Tmultiplier n; m m Tmultiplier 1; 1 Tadder n n m Tmultiplier 1; 1:
8:14
Comment 8.2
In sequential_CSA_multiplier.vhd the done flag is raised as soon as the final values
of the adder inputs are available. A more correct control unit should raise the flag
k cycles later, being k Tclk an upper bound of the n-bit adder delay. The value of
k could be defined as a generic parameter (Exercise 8.3).
8.4 Integers
Given four Bs complement integers
x xn xn1 xn2 . . .x0 ; y ym ym1 ym2 . . .y0 ; u un un1 un2 . . .u0 ;
v vm vm1 vm2 . . . v0 ;
belonging to the ranges
Bn x\Bn ; Bm y\Bm ; Bn u\Bn ; Bm v\Bm ;
then z = x y ? u ? v belongs to the interval
200
8 Multipliers
Bnm1 z\Bnm1 :
Thus, z is a Bs complement number of the form
z znm1 znm znm1 . . . z1 z0 :
Example 8.2
Assume that B = 10, n = 4, m = 3, x = 7918, y = -541, u =
-7017, v = 742, and compute z = 7918(-541) ? (-7017) ? 742. In 10s
complement: x = 07918, y = 1459, u = 12983, v = 0742.
1. Express all operands with 9 digits: x = 000007918, y = 199999459, u =
199992983, v = 000000742.
8.4 Integers
v2
201
u3 v 2
u3 v2
u 3 v2
u 3 v2
u2 v 1
u1 v0
u0
3,0
3,0
3,0
3,0
2,0
1,0
0,0
3,1
3,1
3,1
2,1
1,1
0,1
3,2
3,2
2,2
1,2
0,2
3,2
2,2
1,2
0,2
2,2
1,2
0,2
1,2
0,2
z0
z1
z2
j,i
z3
xjyi + c + d = 2e + f
0,2
z4
z5
z6
8:15
202
8 Multipliers
zm1 zm2 x ym1 vm1 =B;
zm zm1 x ym vm =B:
8:16
8.4 Integers
203
Multiply the first equation by B, the second by B2, and so on, and add the m ? 1
so obtained equations. The result is
zm Bm1 u x y0 v0 x y1 v1 B . . . x ym1 vm1 Bm1
x ym vm Bm
xy u v:
Algorithm 8.3: Modified shift and add multiplication
8:17
z u x b;
8:18
and
where
Bn x\Bn ; Bn u\Bn ; 0 b\B; 0 d\B:
Thus, in the first case,
Bn1 z\Bn1 ;
and in the second case
Bn1 B 1 z\Bn1 ;
so that in both cases z is an (n ? 2)-digit Bs complement integer and natural(z) = z mod 2Bn+1.
The first primitive (8.17) is implemented by the circuit of Fig. 8.13 and the
second (8.18) by the circuit of Fig. 8.14. In both, circuit zn+1 is computed
modulo 2.
As an example, the combinational circuit of Fig. 8.15 implements Algorithm
8.3 (with n = m = 2). Its cost and computation time are practically the same as in
the case of a ripple-carry multiplier for natural numbers. It can be described by the
following VHDL model.
204
8 Multipliers
xn
u n xn
un x n-1
u n-1
u1 x0
u0
.....
d
z n+1
zn
z n-1
z0
xn
x n-1
x0
(B-1)'s
compl.
(B-1)'s
compl.
(B-1)'s
compl.
un
un
un-1
.....
u0
x
b
z n+1
zn
z n-1
z0
8.4 Integers
Fig. 8.15 Combinational
multiplier for integers
(B = 2, m = n = 2)
205
x2
y0
u 3 x2
y0
x2
x2
x1
x2
x1
x2
y3
x1
y1
x0
y2
x0
z6
z5
z4
z3
z1
v2
y3
z0
v1
y3
u0
v0
y2
y3
x0
y2
y0
y1
y2
u1 x0
y1
x2
y0
y1
x2
u2 x 1
z2
y3
206
8 Multipliers
8.4 Integers
207
x3
y0
x2
x
nand
x3
y1
y2
z6
z5
x1
y0
x0
y0
x
0
x2
x
nand
x3
y0
y1
x1
y1
x0
y1
z0
x
0
x2
y2
x
nand
z4
x1
y2
x
nand
z3
x0
y2
x
nand
z1
z2
z x y X0 Y0 xn ym 2nm xn Y0 2n ym X0 2n :
The (n ? m ? 2)-bit 2s complement representations of xn Y0 2n and
ym X0 2m are
2m1 2m xn ym1 2m1 . . . xn y0 20 1 2n mod 2nm2 ;
and
2n1 2n ym xn1 2n1 . . . ym x0 20 1 2m mod 2nm2 ;
so that the representation of xn ym 2nm xn Y0 2n ym X0 2n is
2nm1 xn ym 2nm xn ym1 2nm1 . . . xn y0 2n
2n ym xn1 2nm1 . . . ym x0 2m 2m mod 2nm2 :
A simple modification of the combinational multipliers of Fig. 8.3 and 8.4
allows computing x y, where x is an (n ? 1)-bit 2s complement integer and y an
(m ? 1)-bit 2s complement integer. An example is shown in Fig. 8.16 (n = 3,
208
8 Multipliers
m = 2). The nand multiplication cells are similar to that of Fig. 8.1b, but for the
substitution of the AND gate by a NAND gate [1].
The following VHDL model describes the circuit of Fig. 8.16.
8.4 Integers
209
so that all coefficients yi belong to {-1, 0, 1}. Then y can be represented under the
form
y y0m 2m y0m1 2m1 . . . y01 2 y00 ;
the so-called Booths encoding of y (Booth [2]. Unlike the 2s complement representation in which ym has a specific function, all coefficients yi have the same
function. Formally, the Booths representation of an integer is the same as the
binary representation of a natural. The basic multiplication algorithm (Algorithm
8.1), with v = 0, can be used.
Algorithm 8.4: Booth multiplication, z 5 xy 1 u
210
8 Multipliers
011
010,001
111,000
110,101
100
adder
n+1
2x
x
0
-x
-2x
y
2
register
initially: u
shift register
initially: y.0
zn+m+1..m+1
z m..0
m1
Tadder n 3:
2
With respect to a radix-2 shift and add multiplier (Sect. 8.2.1), the computation
time has been divided by 2.
The following VHDL model describes the circuit of Fig. 8.17.
8.4 Integers
211
212
Fig. 8.18 LUT
implementation of a k-bit by
n-bit constant multiplier
8 Multipliers
b 5..0
w5..0
w11..6
wn+5..n
213
u
w n+5..0
b 5..0
Fig.8.18
(n+6)-bit
adder
The circuit of Fig. 8.18 can be used as a component for generating constant
multipliers. As an example, a sequential n-bit by m-bit constant multiplier is
synthesized. First define a component similar to that of Fig. 8.2, with x constant. It
computes z = c b ? u, where c is an n-bit constant natural, b a k-bit natural, and
u an n-bit natural. The maximum value of z is
2n 1 2k 1 2n 1 2nk 2k ;
so it is an (n ? k)-bit number. It consists of a k-bit by n-bit multiplier (Fig. 8.18)
and an (n ? k)-bit adder (Fig. 8.19).
Finally, the circuit of Fig. 8.19 can be used to generate a radix-2k shift and add
multiplier that computes z = c y ? u, where c is an n-bit constant natural, y an
m-bit natural, and u an n-bit natural. The maximum value of z is
2n 12m 1 2n 1 2nm 2m ;
so z is an (n ? m)-bit number. Assume that the radix-2k representation of y is
Ym/k-1 Ym/k-2 Y0, where each Yi is a k-bit number. The circuit implements the
following set of equations:
z0 u c Y0 =2k ;
z1 z0 c Y1 =2k ;
z2 z1 c Y2 =2k ;
zm=k1
8:19
...
zm=k2 c Ym=k1 =2k :
Thus,
zm=k1 2k
m=k
u c Y0 c Y1 2k . . . c Ym=k1 2k
that is to say
zm=k1 2m c y u:
The circuit is shown in Fig. 8.20.
The computation time is approximately equal to
T m=k TLUTk Tadder n k:
m=k1
214
8 Multipliers
Yi
z n+k-1..k z k-1..0
register
initially: u
shift register
initially: y
zn+m-1..m
z m-1..0
The following VHDL model describes the circuit of Fig. 8.20 (k = 6).
215
LUTS
Delay
8
16
32
32
64
64
8
16
16
32
32
64
96
384
771
1536
3073
6181
13.29
28.26
36.91
57.46
74.12
119.33
LUTS
Delay
8
16
32
32
64
64
8
16
16
32
32
64
102
399
788
1580
3165
6354
8.05
15.42
16.99
29.50
32.08
60.90
216
8 Multipliers
LUTs
Delay
8
16
32
32
64
64
32
8
16
16
32
64
32
64
113
435
835
1668
6460
3236
3236
5.343
6.897
7.281
7.901
11.41
9.535
9.535
LUTs
DSPs
Delay
8
16
32
32
64
64
32
8
16
16
32
64
32
64
0
0
77
93
346
211
211
2
2
2
4
12
6
6
4.926
4.926
6.773
9.866
12.86
11.76
11.76
mk
nk
LUTs
Delay
2
4
2
4
4
4
2
4
2
4
2
4
8
4
16
8
16
16
16
16
32
32
64
64
16
16
32
32
32
64
452
448
1740
1808
3480
6960
10.23
17.40
12.11
20.29
15.96
22.91
217
m
mk
nk
LUTs
Delay
2
4
2
4
4
4
2
4
2
4
2
4
8
4
16
8
16
16
16
16
32
32
64
64
16
16
32
32
32
64
461
457
1757
3501
1821
6981
8.48
10.09
10.36
11.10
12.32
14.93
mk
nk
DSPs
LUTs
Delay
2
4
2
4
4
4
2
4
2
4
2
4
8
4
16
8
16
16
16
16
32
32
64
64
16
16
32
32
32
64
8
32
8
32
16
32
0
0
0
0
0
0
12.58
27.89
12.58
27.89
16.70
27.89
mk
nk
DSPs
LUTs
Delay
2
4
2
4
4
4
2
4
2
4
2
4
8
4
16
8
16
16
16
16
32
32
64
64
16
16
32
32
32
64
8
32
8
32
16
32
15
15
31
31
63
63
9.90
17.00
10.26
17.36
11.07
18.09
FFs
LUTs
Period
Total time
8
8
16
16
32
32
64
64
8
16
8
16
16
32
32
64
29
46
38
55
71
104
136
201
43
61
72
90
112
161
203
306
2.87
2.87
4.19
4.19
7.50
7.50
15.55
15.55
23.0
45.9
33.5
67.0
120.0
240.0
497.6
995.2
218
8 Multipliers
FFs
LUTs
Period
Total time
8
16
16
32
32
64
64
8
8
16
16
32
32
64
29
47
56
88
106
170
203
43
64
74
122
139
235
268
1.87
1.88
1.92
1.93
1.84
1.84
1.84
15.0
15.0
30.7
30.9
58.9
58.9
117.8
LUTs
Delay
8
8
16
16
32
32
8
16
8
16
16
32
179
420
421
677
1662
2488
12.49
18.00
20.41
25.86
42.95
55.69
LUTs
Delay
8
8
16
16
32
32
8
16
8
16
16
32
122
230
231
435
844
1635
15.90
27.51
20.20
31.81
39.96
62.91
LUTs
Delay
8
8
16
16
32
32
8
16
8
16
16
32
106
209
204
407
794
1586
14.18
24.60
18.91
30.60
39.43
62.91
Another option is the modified shift and add algorithm of Sect. 8.4.2 (Fig. 8.15;
Table 8.12).
In Table 8.13, examples of post correction implementations are reported.
As a last option, several Booth multipliers have been implemented
(Table 8.14).
219
LUTs
Delay
8
8
16
16
32
32
8
16
8
16
16
32
188
356
332
628
1172
2276
13.49
25.12
13.68
25.31
25.67
49.09
FFs
LUTs
Period
Total time
8
8
16
16
32
32
64
64
9
17
9
17
17
33
33
65
25
34
33
42
58
75
107
140
58
68
125
135
231
248
440
473
2.90
2.90
3.12
3.12
3.48
3.48
4.22
4.22
26.1
49.3
28.1
53.0
59.2
114.8
139.1
274.0
8.7 Exercises
1. Generate the VHDL model of a mixed-radix parallel multiplier (Sect. 8.2.4).
2. Synthesize a 2n-bit by 2n-bit parallel multiplier using n-bit by n-bit multipliers
as building blocks.
3. Modify the VHDL model sequential_CSA_multiplier.vhd so that the done flag
is raised when the final result is available (Comment 8.2).
4. Generate the VHDL model of a carry-save multiplier with post correction
(Sect. 8.4.3).
5. Synthesize a sequential multiplier based on Algorithm 8.3.
6. Synthesize a parallel constant multiplier (Sect. 8.5).
7. Generate models of constant multipliers for integers.
8. Synthesize a constant multiplier that computes z = c1y1 ? c2y2 ? ?
csys ? u.
220
8 Multipliers
References
1. Baugh CR, Wooley BA (1973) A twos complement parallel array multiplication algorithm.
IEEE Trans Comput C 31:10451047
2. Booth AD (1951) A signed binary multiplication technique. Q J Mech Appl Mech 4:126140
3. Dadda L (1965) Some schemes for parallel multipliers. Alta Frequenza 34:349356
4. Wallace CS (1964) A suggestion for fast multipliers. IEEE Trans Electron Comput EC13:1417
Chapter 9
Dividers
9:1
In fact, Eq. (9.1) has two solutions: one with r C 0 and another with r \ 0.
221
222
9 Dividers
ri+1
qi+1 =
-B -(B-1) -(B-2)
-2
-2y
-By
-(B-1)y
-1
-y
2y
B-3 B-2
(B-3)y
(B-2)y
(B-1)y
B-1
By
-(B-2)y
Bri
9:2
B rp1 qp y rp :
From (9.2) the following relation is obtained:
x Bp q1 Bp1 q2 Bp2 qp1 B1 qp B0 y
rp ; with y rp \y:
9:3
Thus,
q q1 Bp1 q2 Bp2 qp1 B1 qp B0 and r rp :
At each step of (9.2) ri
? 1
and qi
? 1
ri 1 B ri qi 1 y; where y ri 1 \y:
9:4
The Robertson diagram of Fig. 9.1 defines the set of possible solutions: the dotted
lines define the domain fB ri ; ri 1 j B y B ri \B y and y ri 1 \yg;
and the diagonals correspond to the equations ri 1 B ri k y with k 2
fB; B 1; . . .; 1; 0; 1; . . .; B 1; Bg: If k y B ri \k 1 y; there are
two possible solutions for qi ? 1, namely k and k ? 1. To the first one corresponds
a non-negative value of ri ? 1, and to the second one a negative value.
If the values qi ? 1 = -B and qi ? 1 = B are discarded, the solutions of (9.4)
are the following:
if B ri B 1 y then qi 1 B 1;
if k y B ri \k 1 y then qi 1 k or k 1; 8k in fB 1; . . .; 1; 0;
1; . . .; B 2g;
if B ri \ B 1 y then qi 1 B 1;
so all qis are radix-B digits or negative radix-B digits. The quotient q is obtained
under the form
223
Bri
(B-1)ymax
qi+1 =
B-1
B-1 or B-2
Bri = (B-1)y
Bri = (B-2)y
3ymax
2ymax
ymax
Bri = 2y
1 or 2
Bri = y
0 or 1
0 or -1
-ymax
-1 or -2
ymax
Bri = -y
Bri = -2y
-2ymax
-3ymax
-(B-1) or -(B-2)
Bri = -(B-2)y
-(B-1)
-(B-1)ymax
Bri = -(B-1)y
9:5
224
9 Dividers
225
2ri
ymax
ri+1
qi+1=
-2y
-2
-y
qi+1 = 0 or 1
2y
ymax
2ri
qi+1= -1
qi+1 = 0 or -1
(a)
2ri = y
qi+1 = 1
-ymax
qi+1 = -1
2ri = -y
(b)
Fig. 9.3 Robertson diagram (a) and P-D diagram (b), when B = 2
Then
q0i 2 f0; 1g and q 1 q00 2p q01 2p1 q02 2p2 q0p1
21 20 : In other words, the vector
1 q00 q01 q02 . . .q0p1 1
is the 2s complement representation of q. The corresponding modified algorithm
is the following.
Algorithm 9.3: Division, non-restoring algorithm, version 2
9:6
226
9 Dividers
ri-1
y
sign bit
update
qi-1
shift register
.....
addb /sub
adder/
subtractor
ri = 2r i-1
quotient
update
load
register
initially: x
ri-1
Fig. 9.4 Non-restoring divider
The complete circuit also includes a p-state counter and a control unit. A complete
generic model non_restoring.vhd is available at the Authors web page.
227
9:7
Comment 9.1
The least significant bit is always equal to 1 while the sign of the remainder could
be different from that of the dividend. For some applications a final correction step
could be necessary.
228
9 Dividers
ri-1
update
shift register
qi
sign
subtractor
.....
2ri-1
quotient
2ri-1 - y
1
0
ri
update
load
register
initially: x
ri-1
Fig. 9.5 Restoring divider
The complete circuit also includes a p-state counter and a control unit. A complete
generic model restoring.vhd is available at the Authors web page.
229
2ri
2ri = y
ymax
qi+1 = 1
ymin
qi+1 = 1
qi+1 = 0 or 1
qi+1 = 0
ymin
-ymin
ymax
qi+1 = 0 or -1
qi+1 = -1
qi+1 = -1
-ymax
2ri = -y
230
9 Dividers
load
update
q (i)
qm (i)
0,1
-1
|qi |
qi
register
initially: 0
q
q (i)
qm (i)
0,-1
load
1-|qi |
register
initially: 1
update
(i)
(i)
qm
r
rn..n-2
quotient
selection
function
operation
adder/
subtractor
2r-y or 2r or 2r+y
load
update
load
on the fly
conversion
update
register
initially: x
r
q
231
232
9 Dividers
The complete circuit also includes a p-state counter and a control unit. A complete
generic model srt_divider.vhd is available at the Authors web page.
The computation time is approximately the same as that of a non-restoring
divider (9.7).
9:8
233
cn+1..n-2
y
sn+1..n-2
adder
v
quotient
selection
function
operation
carry-save
adder/subtractor
2r-y or 2r or 2r+y
(stored-carry form)
load
update
on the fly
conversion
register
initially: 0
register
initially: x
load
update
q
Fig. 9.9 SRT divider with carry-save adder
Thus vi ci =2n2 si =2n2 ri =2n2 and vi [ ci =2n2 1 si =2n2
1 ri =2n2 2; so that (9.8), with ymin = 2n - 1, holds. Lower and upper
bounds of vi are computed as follows:
vi ri =2n2 \2n =2n2 4 and vi [ ri =2n2 2 2n =2n2 2 6:
Thus, vi belongs to the range -5 B vi B 3 and is a 4-bit 2s complement integer.
In order to compute vi, both ci and si must be represented with 4 ? (n - 2) = n ? 2
bits.
The complete circuit is shown in Fig. 9.9.
The following VHDL model describes the circuit of Fig. 9.9. The dividend
x = xn xn 1 x0 and the remainder r = rn rn - 1 r0 are (n ? 1)-bit 2s
complement integers, the divisor y = 1 yn - 2 y0 is an n-bit normalized natural,
the quotient q = q0 q1 qp is a (p ? 1)-bit 2s complement integer, and condition (9.6) must hold.
234
9 Dividers
The complete circuit also includes a p-state counter and a control unit. A complete
generic model srt_csa_divider.vhd is available at the Authors web page.
The computation time is approximately equal to
Tdivider n; p p Tadder 1 Tadder n n p Tadder 1:
9:9
235
4r
n
4r = 3y
32
3
52n-1
2 or 3
2n+1
4r = 2y
32n-1
1 or 2
2n
4r = y
2n-1
0 or 1
2n-1
32n-2
0 or -1
2n
n-1
-2
-2n
4r = -y
-1 or -2
-32n-1
4r = -2y
-2n+1
-2 or -3
-52n-1
-32n
-3
4r = -3y
if
if
if
if
if
if
if
if
4 ri 2n1 then qi 1 3;
3 2n1 4 ri \2n1 and y\3 2n2 then qi 1 3;
3 2n1 4 ri \2n1 and y 3 2n2 then qi 1 2;
2n 4 ri \3 2n1 then qi 1 2;
2n1 4 ri \2n then qi 1 1;
2n1 4 ri \2n1 then qi 1 0;
2n 4 ri \ 3 2n1 then qi 1 1;
3 2n1 4 ri \ 2n then qi 1 2;
236
Table 9.1 Digit selection
9 Dividers
rn rn
- 1
rn
- 2
rn
- 3
r/2n
0000
0001
0010
0011
0
1
2
3
0100
0101
0110
0111
1000
1001
1010
1011
1100
4
5
6
7
-8
-7
-6
-5
-4
1101
1110
1111
-3
-2
-1
- 2
q-(i
? 1)
0
1
2
2 if yn - 2 = 1, 3
if yn - 2 = 0
3
3
3
3
-3
-3
-3
-3
-3 if yn - 2 = 0, -2
if yn - 2 = 1
-2
-1
0
237
qi
1,-3
2,-2
3,-1
1,-3
2,-2
3,-1
q(i-1) qm(i-1)
qi 0?
q(i-1) qm(i-1)
qi > 0?
register
initially: 0...00
register
initially: 0...03
q(i-1)
qm(i-1)
load
update
238
9 Dividers
r
rn..n-3
yn-2
quotient
selection
function
load
update
on the fly
conversion
magnitude
sign
adder/subtractor
load
update
register
initially: x
r
Fig. 9.12 Radix-4 SRT divider
y
2y
3y
239
The complete circuit also includes a p-state counter and a control unit. A complete
generic model radix_four_divider.vhd is available at the Authors web page.
240
9 Dividers
the case of any radix B [6]. Nevertheless, the corresponding FPGA implementations
are generally very costly in terms of used cells.
Another method for executing radix-B division consists of converting the radixB operands to binary operands, executing the division with any type of binary
divider, and converting the binary results to radix-B results. Radix-B to binary and
binary to radix-B converters are described in Chap. 10. Decimal dividers (B = 10)
based on this method are reported in Chap. 12.
An alternative option, similar to the previous one, is to use a classical binary
algorithm (Sect. 9.2) and to execute all the operations in base B. All along this
section it is assumed that B is even.
Consider the following digit-recurrence algorithm (Algorithm 9.1 with B = 2):
Algorithm 9.7: Division, binary digit-recurrence algorithm
9:10
Algorithm 9.7 can be executed whatever the radix used for representing the
numbers. If Bs complement radix-B representation is used, then the following
operations must be available: radix-B doubling, adding, subtracting and halving.
Algorithm 9.8: Division, binary digit-recurrence algorithm, version 2
241
ri-1
ri-1
x2
-y
0 -ulp ulp 0
two_r
quotient
selection
2
ulp
adder
adder
register
initially: x
register
initially: 0
register
initially: 2-1
ri-1
ulp
load
update
242
9 Dividers
xn-1
x2
x1
x0
x2
x2
x2
x2
un-1/2
zn
dn-2
zn-1
....
u2/2
d1 u1/2
z2
cin=0
d0 u0/2
z1
z0
- 1
243
xn-1 0
xn-1 k-1...xn-1 1
x10
xn-2 k-1...xn-2 1
conn.
conn.
xn-2 0B/2
zn
x00
x0 k-1...x02x01
conn.
x10B/2
x00B/2
(k-1)-bit
adder
(k-1)-bit
adder
zn-1
z1
z0
244
9 Dividers
245
If all operations were executed with full precision, then the maximum error after
i steps would be given by the following relation:
i
y yi \1=22 :
9:11
Another option is the Goldschmidt algorithm. Given two real numbers x and
y belonging to the intervals 1 B x \ 2 and 1 B y \ 2, it generates two sequences
of real numbers a0, a1, a2, and b0, b1, b2, , and the second one constitutes a
sequence of successive approximations of x/y.
Algorithm 9.10: Goldschmidts algorithm
If all operations were executed with full precision, then the maximum error after
i steps would be defined by the following relation:
246
9 Dividers
x
i
bi \1=22 1 :
y
9:12
The following VHDL model describes the corresponding circuit; n is the number
of fractional bits of x and y, and p is the number of fractional bits of the internal
variables a, b and c.
247
FFs
LUTs
Period
Total time
8
16
24
32
32
53
64
64
8
16
27
16
32
56
32
64
22
39
59
55
72
118
104
137
28
46
64
78
79
123
143
144
2.21
2.43
2.63
2.82
2.82
3.30
3.55
3.55
19.9
41.3
73.6
47.9
93.1
188.1
117.2
230.8
FFs
LUTs
Period
Total time
8
16
24
32
32
53
64
64
8
16
27
16
32
56
32
64
21
38
58
54
71
117
103
136
26
44
62
76
77
121
141
142
2.46
2.68
2.90
3.08
3.08
3.56
3.82
3.82
22.1
45.6
81.2
52.4
101.6
202.9
126.1
248.3
FFs
LUTs
Period
Total time
8
16
24
32
32
53
64
64
8
16
27
16
32
56
32
64
33
58
89
75
107
177
139
204
44
78
118
110
143
235
207
272
2.43
2.57
2.76
2.94
2.94
3.42
3.66
3.66
21.9
43.7
77.3
50.0
97.0
194.9
120.8
237.9
non-restoring dividers,
restoring dividers,
binary SRT dividers,
binary SRT dividers with carry-save adder.
Radix-4 SRT dividers have also been implemented. The quotient is a (2p ? 1)-bit
2s complement integer (Table 9.6).
Examples of decimal dividers are given in Chap. 11.
248
Table 9.5 Binary SRT
dividers with carry-save
adders
9 Dividers
n
FFs
LUTs
Period
Total time
8
16
24
32
32
53
64
64
8
16
27
16
32
56
32
64
40
73
112
105
138
229
202
267
86
143
215
239
272
448
464
529
2.81
2.86
2.86
2.86
2.86
2.87
2.87
2.87
25.3
48.6
80.1
48.6
94.4
163.6
94.7
186.6
2p
FFs
LUTs
Period
Total time
8
16
24
32
32
53
64
64
4
8
14
8
16
28
16
32
8
16
28
16
32
56
32
64
30
55
88
71
104
174
136
201
70
112
170
175
209
343
337
402
3.53
3.86
4.08
4.27
4.27
4.76
5.01
5.01
17.7
34.7
61.2
38.4
72.6
138.0
85.2
165.3
FFs
LUTs
DSPs
Period
Total time
8
8
16
16
24
53
10
16
18
32
27
56
4
5
5
7
6
8
25
38
43
71
62
118
38
42
119
76
89
441
2
2
2
8
8
24
5.76
5.68
7.3
10.5
10.5
14.42
23.0
28.4
36.5
73.5
63.0
115.4
9.6 Exercises
1. Synthesize several types of combinational dividers: restoring, non-restoring,
SRT, radix-2k.
2. Generate a generic VHDL model of a Newton-Raphson inverter.
9.6 Exercises
249
References
1. Freiman CV (1961) Statistical analysis of certain binary division algorithms. IRE Proc
49:91103
2. Robertson JE (1958) A new class of division methods. IRE Trans Electron Comput
EC-7:218222
3. Cocke J, Sweeney DW (1957) High speed arithmetic in a parallel device. IBM technical
report, Feb 1957
4. Deschamps JP, Sutter G (2010) Decimal division: algorithms and FPGA implementations.
In: 6th southern conference on programmable logic (SPL)
5. Lang T, Nannarelli A (2007) A radix-10 digit-recurrence division unit: algorithm and
architecture. IEEE Trans Comput 56(6):113
6. Deschamps JP (2010) A radix-B divider. Contact: [email protected]
Chapter 10
Other Operations
This chapter is devoted to arithmetic functions and operations other than the four
basic ones. The conversion of binary numbers to radix-B ones, and conversely, is
dealt with in Sects. 10.1 and 10.2. An important particular case is B = 10 as human
interfaces generally use decimal representations while internal computations are
performed with binary circuits. In Sect. 10.3, several square rooting circuits are
presented, based on digit-recurrence or convergence algorithms. Logarithms and
exponentials are the topics of Sects. 10.4 and 10.5. Finally, the computation of
trigonometric functions, based on the CORDIC algorithm [2, 3], is described in
Sect. 10.6.
10:1
Taking into account that B [ 2, the bits xi can be considered as radix-B digits,
and a simple conversion method consists of computing (10.1) in radix-B.
Algorithm 10.1: Binary to radix-B conversion
251
252
Fig. 10.1 Binary to
radix-B converter
10
zm-1..0
Fig.9.14
Other Operations
x n-1..0 (binary )
c in
x n-i
load
update
wm-1..0
register (radix B)
initially: 0
load
update
ym-1..0 (base B)
In order to compute z2 + xni the circuit of Fig. 9.14, with cin xni instead of 0,
can be used. A sequential binary to radix-B converter is shown in Fig. 10.1. It is
described by the following VHDL model.
The complete circuit also includes an n-state counter and a control unit.
A complete model BinaryToDecimal2.vhd is available at the Authors web page
(B = 10).
The computation time of the circuit of Fig. 10.1 is equal to nTclk where Tclk
must be greater than TLUT-k (Sect. 9.3). Thus
TbinaryradixB n TLUTk :
10:2
10.2
253
y m-1..0
Fig.9.15
wm..1
w0
shift register (binary)
load
update
load
update
xn-1..0 (binary )
so
z qn 2n rn1 2n1 rn2 2n2 r1 2 r0 :
As z is smaller than 2n, then qn = 0, and the binary representation of z is
constituted by the set of remainders rn1 rn2 . . . r1 r0 :
Algorithm 10.2: Radix-B to binary conversion
Observe that if qi = qi+12 + ri, then qi(B/2) = qi+1B + ri(B/2) where ri(B/2) \
2(B/2) = B.
Algorithm 10.3: Radix-B to binary conversion, version 2
In order to compute (B/2)qi the circuit of Fig. 9.15 can be used. A sequential
radix-B to binary converter is shown in Fig. 10.2. It is described by the following
VHDL model.
254
10
Other Operations
The complete circuit also includes an n-state counter and a control unit.
A complete model DecimalToBinary2.vhd is available at the Authors web page
(B = 10).
The computation time of the circuit of Fig. 10.2 is equal to nTclk where Tclk
must be greater than TLUT-k (Sect. 9.3). Thus
TradixBbinary n TLUTk :
10:3
10.3
Square Rooters
255
such that
0 Ri \ 1 2Qi 22ni :
10:5
The value of qn-i is chosen in such a way that condition (10.5) holds. Consider
two cases:
If Ri1 \1 4Qi1 22ni ; then qni 0; Qi 2Qi1 ; Ri Ri1 :
As Ri Ri1 \1 4Qi1 22ni 1 2Qi 22ni and Ri Ri1 0;
condition (10.5) holds.
If Ri1 1 4Qi1 22ni ; then qni 1; Qi 2Qi1 1; Ri Ri1
1 4Qi1 22ni ; so that Ri 0 and Ri \1 2Qi1 22ni1 1 4Qi1
22ni 3 4Qi1 22ni 1 2Qi 22ni :
Algorithm 10.4: Square root, restoring algorithm
Qi is an i-bit number, Ri \ 1 2Qi 22ni Qi & 1& 0 0 0 is a (2n-i+1)bit number, and Pi1 1 4Qi1 22ni Qi & 01& 00 0 a 2n 2 ibit number.
256
10
r 3n..2n-2
sign
Other Operations
Q&01
(n+3)-bit
subtractor
r3n..2n-2
difn..0
Q
1
r2n-3..0&00
(3n+1)-bit register
initially: 000.x 2n-1x0
10.3
Square Rooters
257
The only computational resource is an (n+3)-bit subtractor so that the computation time is approximately equal to nTadder(n). The complete circuit includes
an n-bit counter and a control unit. A generic VHDL model SquareRoot.vhd is
available at the Authors web page.
Another equivalent algorithm is obtained if Pi, Qi and Ri are replaced by
pi Pi =22ni1 ; qi Qi =2i ; ri Ri =22ni :
Algorithm 10.6: Square root, restoring algorithm, version 3
258
10
Other Operations
10.3
Square Rooters
259
r3n+1..2n-2
r3n+1
n-bit shift register
initially: 0
Q&r3n+11
(n+2)-bit
adder/subtractor
sign
r2n-3..0&00
sumdifn+1..0
(3n+2)-bit register
initially: 000.x 2n-1x0
260
10
Other Operations
and
Comment 10.1
The previous method can also be used for computing the square root of a natural
X x2n1 x2n2 . . . x1 x0 with an accuracy of p bits: represent X as an (n+p)-bit
fractional number x2n1 x2n2 . . . x1 x0 00 . . . 0 and use the preceding method.
10.3
Square Rooters
261
f (x ) = x 2-X
-X
1/2
x i+1 x i
accuracy of p+n fractional bits, using any division algorithm, so that X 2np
q xi 2n r; with r\xi2n, and X Q xi R; where Q = q2-p and R = (r/xi)
2-(n+p) \2-p.
An example of implementation SquareRootNR4.vhd is available at the Authors
web page. The corresponding data path is shown in Fig. 10.6. The initial value x0
must be defined in such a way that x02n [X. In the preceding example, initial_y =
x0 is defined so that initial_y(n+p n+p-4)2-2 is an approximation of the square
root of X2n-1 2n-4 and that initial_y2n is greater than X.
262
10
Other Operations
table
divider
start_ div
div_ done
x0
adder
x i+1
(n+p+1)-bit register
initially:x 0
load
update
xi
Comment 10.2
Every iteration step includes a division, an operation whose complexity is similar
to that of a complete square root computation using a digit recurrence algorithm.
Thus, this type of circuit is generally not time effective.
Another method is to first compute X-1/2. A final multiplication computes
X1/2 = X-1/2 X. The following iteration can be used for computing X-1/2
xi1 xi =2 3 x2i X ;
where the initial value x0 belongs to the range 0 \ x0 B X-1/2. The corresponding
graphical construction is shown in Fig. 10.7.
The corresponding circuit does not include dividers, only multipliers and an adder.
The implementation of this second convergence algorithm is left as an exercise.
10.4 Logarithm
Given an n-bit normalized fractional number x = 1x-1 x-2 x-n, compute y =
log2x with an accuracy of p fractional bits. As x belongs to the interval 1 B x\2, its
base-2 logarithm is a non-negative number smaller than 1, so y = 0.y-1 y-2 y-p.
If y = log2x, then x 20 y1 y2 ...yp ... ; so that x2 2y1 y2 ... yp ... : Thus
if x2 2 : y1 1 and x2 =2 20:y2 yp ;
if x2 \2 : y1 0 and x2 20:y2 yp :
10.4
Logarithm
263
X -1/2
xi
x i+1
-X
log
load
update
(n+1)-bit register
initially:x
z
The preceding algorithm can be executed by the data path of Fig. 10.8 to which
corresponds the following VHDL model.
264
10
Other Operations
10.5 Exponential
Given an n-bit fractional number x 0:x1 x2 . . . xn ; compute y = 2x with an
accuracy of p fractional bits. As x x1 21 x2 22 . . . xn 2n ; then
1 x1 2 x2
n xn
2x 22
22
. . . 22
:
i
If all the constant values ai 22 are computed in advance, then the following
algorithm computes 2x.
Algorithm 10.10: Exponential 2x
The preceding algorithm can be executed by the data path of Fig. 10.9.
The problem is accuracy. Assume that all ais are computed with m fractional
bits so that the actual operand ai is equal to ai - ei, where ei belongs to the range
0 ei \2m :
10:6
Consider the worst case, that is y = 20.11 1. Then the obtained value is
y a1 e1 a2 e2 . . . an en : If second or higher order products ei ej ek
are not taken into account, then y0 % y - (e1 a2 an + a1 e2 an + a1 an-1 en).
As all products p1 = a2 an, p2 = a1 a3 an, etc., belong to the range 1 \ pi \ 2,
and ei to (10.6), then
0
y y0 \2 n 2m :
10:7
10.5
Exponential
265
ai
table
multiplier
x -i
shift-register
load
load
update
parallel register
initially:1.000
update
Relation (10.7) would define the maximum error if all products were computed
exactly, but it is not the case. At each step the obtained product is rounded. Thus
Algorithm 10.8 successively computes
0
z 2 [ a1 a2 2m ;
0
0
0
0
0
0
0
z3 [ a1 a2 2m a3 2m a1 a2 a3 2m 1 a3 ;
0
0
0
0
0
0
0
0
0
0
0
0
a4 2m a1 a2 a3 a4 2m 1 a4 a3 a4 ;
z4 [ a1 a2 a3 2m 1 a3
[ y 2m 1 2n 2 [ y 2 n 2m :
10:8
Thus, from (10.7) and (10.8), the maximum error y - zn is smaller than 4n2-m.
In order to obtain the result y with p fractional bits, the following relation must hold
true: 4n2-m- B 2-p, and thus
m p log2 n 2:
10:9
As an example, with n = 8 and p = 12, the internal data must be computed with
m = 17 fractional bits.
The following VHDL model describes the circuit of Fig. 10.9.
266
10
Other Operations
powers is a constant array defined within a user package; it stores the fractional
part of ai with 24 bits:
The preceding algorithm can be executed by the data path of Fig. 10.9 in which
the table is substituted by the circuit of Fig. 10.10.
Once again, the problem is accuracy. In this case there is an additional problem:
in order to get all coefficients ai with an accuracy of m fractional bits, they must be
computed with an accuracy of k [ m bits. Algorithm 10.11 successively computes
10.5
Exponential
267
ai
multiplier
a i2
load
parallel register
initially:an
update
2
0
0
an [ an 2k ; an1 [ an 2k 2k a2n 2an 2k 2k an1 2k
2
0
1 2an ; an2 [ an1 2k 1 2an 2k a2n1 2an1 2k 1 2an
0
2k an2 2k 1 2an1 4an1 an ; an3 [ an2 2k
1 2an1 4an1 an 2 2k a2n2 2an2 2k 1 2an1 4an1 an 2k
an3 2k 1 2an2 4an2 an1 8an2 an1 an ;
and so on. Finally
0
a1 2k 2n 3:
In conclusion, a1 - a10 \ 2-k(2n - 3) \ 2n-k. The maximum error is smaller
than 2-m if n-k B -m, that is k C n + m. Thus, according to (10.9)
k n p log2 n 2:
As an example, with n = 8 and p = 8, the coefficients ai (Fig. 10.10) are
computed with k = 21 fractional bits and z (Fig. 10.9) with 13 fractional bits.
A complete VHDL model Exponential2.vhd, in which an, expressed with
k fractional bits, is a generic parameter, is available at the Authors web page.
Comment 10.3
Given an n-bit fractional number x and a number b [2, the computation of y = bx,
with an accuracy of p fractional bits, can be performed with Algorithm 10.10 if the
constants ai are defined as follows:
i
ai b2 :
So, the circuit is the same, but for the definition of the table which stores the
constants ai. In particular, it can be used for computing ex or 10x.
268
10
Other Operations
/2
yi
i
xi+1
xi
The constants aRi and aIi are equal to cos 2-i and sin 2-i, respectively. The
synthesis of the corresponding circuit is left as an exercise.
A more efficient algorithm, which does not include multiplications, is CORDIC
[2, 3]. It is a convergence method based on the graphical construction of
Fig. 10.11. Given a vector (xi, yi), then a pseudo-rotation by ai radians defines a
rotated vector (xi+1, yi+1) where
xi1 xi yi tan ai xi cos ai yi sin ai 1 tan2 ai 0:5 ;
yi1 yi xi tan ai yi cos ai xi sin ai 1 tan2 ai 0:5 :
In the previous relations, xicos ai - yisin ai and yicos ai + xisin ai define the
vector obtained after a (true) rotation by ai radians. Therefore, if an initial vector (x0, y0) is rotated by successive angles a0, a1, , an-1, then the final vector is
(xn, yn) where
10.6
Trigonometric Functions
269
shifter
+/-
dm
shifter
-1
+/-
dm
-i
tan (2 )
+/-
register
init.: x 0
register
init.: 0
register
init.: z
table
dm
270
10
Other Operations
angles is a constant array defined within a user package; it stores tan-12-i, for i up
to 15, with 32 bits:
10.6
Trigonometric Functions
271
272
Table 10.1 Binary to
decimal converters
10
Other Operations
FFs
LUTs
Period
Total time
8
16
24
32
48
64
3
5
8
10
15
20
27
43
54
82
119
155
29
45
56
82
119
155
1.73
1.91
1.91
1.83
1.83
1.83
15.6
32.5
47.8
60.4
89.7
119.0
FFs
LUTs
Period
Total time
8
16
24
32
48
64
3
5
8
10
15
20
26
43
65
81
118
154
22
30
43
51
72
92
1.80
1.84
1.87
1.87
1.87
1.87
16.2
31.3
46.8
61.7
91.6
121.6
10.7.1 Converters
Table 10.1 gives implementation results of several binary-to-decimal converters.
They convert n-bit numbers to m-digit numbers.
In the case of decimal-to-binary converters, the implementation results are
given in Table 10.2.
10.7
FPGA Implementations
273
FFs
LUTs
Period
Total time
8
16
24
32
38
71
104
136
45
79
113
144
2.57
2.79
3.00
3.18
20.6
44.6
72.0
101.8
FFs
LUTs
Period
Total time
8
16
24
32
39
72
105
137
39
62
88
111
2.61
2.80
2.98
3.16
20.9
44.8
71.5
101.1
FFs
LUTs
Period
8
8
8
16
16
32
32
0
4
8
8
16
16
32
42
51
59
92
108
173
205
67
78
90
135
160
249
301
2.94
3.50
3.57
3.78
3.92
4.35
4.67
FFs
LUTs
DSPs
Period
Total time
8
16
24
32
10
18
27
36
16
25
59
44
20
29
109
46
1
1
2
4
4.59
4.59
7.80
9.60
45.9
82.6
210.5
345.6
274
Table 10.7 Exponential 2x
10
Other Operations
FFs
LUTs
DSPs
Period
Total time
8
16
8
16
13
23
27
46
29
48
1
2
4.79
6.42
38.3
102.7
FFs
LUTs
DSPs
Period
Total time
8
16
8
16
13
23
21
39
49
86
17
71
3
10
5.64
10.64
45.1
170.2
FFs
LUTs
Period
Total time
16
32
32
8
16
24
16
32
32
57
106
106
134
299
309
3.58
4.21
4.21
57,28
134.72
134.72
FFs
LUTs
DSPs
Period
Total time
8
16
48
8
16
24
16
32
48
43
76
210
136
297
664
1
2
5
3.39
4.44
4.68
27.12
71.04
224.64
10.8 Exercises
1. Generate VHDL models of combinational binary-to-decimal and decimal-tobinary converters.
2. Synthesize binary-to-radix-60 and radix-60-to-binary converters using LUT-6.
10.8
Exercises
275
References
1. Parhami B (2000) Computer arithmetic: algorithms and hardware design. Oxford University
Press, New York
2. Volder JE (1959) The CORDIC trigonometric computing technique. IRE Trans Electron
Comput EC8:330334
3. Volder JE (2000) The birth of CORDIC. J VLSI Signal Process Sys 25:101105
Chapter 11
Decimal Operations
11.1 Addition
Addition is a primitive operation for most arithmetic functions, and then it
deserves special attention. The general principles for addition are in Chap. 7, and
in this section we examine special consideration for efficient implementation of
decimal addition targeting FPGAs.
277
278
11
Decimal Operations
s4
s4
s 3.(s 2s 1)
FA
FA
FA
HA
s3
s2
s1
HA
FA
HA
c(i)
s0
c(i+1)
z3(i)
z 2(i)
z 1(i)
z 0(i)
For B = 10, the classic ripple-carry for a BCD decimal adder cell can be
implemented as suggested in Fig. 11.1 The mod-10 addition is performed adding 6
to the binary sum of the digits, when a carry for the next digit is generated. The
VHDL model ripple_carry_adder_BCD.vhd is available at the Authors web page.
As described in Chap. 7, the na implementation of an adder (ripple-carry,
Fig. 7.1) has a meaningful critical path. In order to reduce the execution time of
each iteration step, Algorithm 11.1 can be modified as shown in the next section.
11.1
Addition
x(n-1)
279
y(n-1)
x(n-2)
G-P
g(n-1)
z(n)=c(n)
Cy.Ch.
x(n-1)
y(n-2)
x(1)
y(n-1)
Cy.Ch.
x(n-2)
x(0)
G-P
G-P
p(n-1) g(n-2)
c(n-1)
y(1)
p(n-2)
c(n-2)
y(n-2)
G-P
g(1)
.....
c(2)
p(1)
Cy.Ch.
x(1)
y(0)
y(1)
g(0)
c(1)
p(0)
Cy.Ch.
x(0)
c(0)=c_in
y(0)
mod B sum
mod B sum
mod B sum
mod B sum
z(n-1)
z(n-2)
z(1)
z(0)
So, ci
? 1
The use of propagate and generate functions allow generating a n-digit adder
carry-chain array of Fig. 11.1. It is based on the Algorithm 11.2. The GeneratePropagate (G-P) cell calculates the Generate and Propagate functions; and the
carry-chain (Cy.Ch) cell computes the next carry. Observe that the carry-chain
cells are binary circuits, whereas the generate-propagate and the mod B sum cells
are B-ary ones. As regards the computation time, the critical path is shaded in
Fig. 11.2. (It has been assumed that Tsum [ TCy. Ch)
280
11
(a) x3 y3 x2 y2 x1 y1 x0 y0
(b)
Decimal Operations
...
FA
s4
FA
s3
FA
s2
s1
HA
s0
G-P
p(i)
c(i+1)
FA
FA
HA
c(i)
p = s4.s3.s2.s1.s0
g = s4 s3.(s2 s1)
p
FA
g(i)
0
Cy.Ch.
HA
z 3(i)
FA
z2(i)
FA
z 1(i)
HA
z 0(i)
Fig. 11.3 a Simple G-P cell for BCD adder. b Carry-chain BCD adder ith digit computation
11:1
gi 1 if xi yi [ 9; gi 0 otherwise;
11:2
11:3
gi G0 P3 _ G2 _ P2 G1 _ G3 _ P3 P2 _ P3 P1 _ G2 P1 _ G2 G1 11:4
0
11.1
Addition
281
cout
4-bit
Adder
s 4:0(N-1)
c(N)
p(N-1)
G-P
s3:0(N-1)
correc z3:0(N-1)
Adder
g(N-1)
c(N-1)
x3:0(N-2)
y3:0(N-2)
4-bit
Adder
p(N-2)
s 4:0(N-2)
G-P
s3:0(N-2)
correc z 3:0(N-2)
Adder
g(N-2)
...
x 3:0(1)
y3:0(1)
..
.
...
...
c(2)
4-bit
Adder
s4:0(1)
4-bit
Adder
s4:0(0)
p(1)
G-P g(1)
c(1)
p(0)
G-P g(0)
s3:0(1)
c(1)
x 3:0(0)
y3:0(0)
c(N-2)
correc
Adder
z 3:0(1)
correc
Adder
z 3:0(0)
s3:0(0)
c(0)
c in = c(0)
282
11
Decimal Operations
For this purpose one could use formulas (11.3) and (11.4), nevertheless, in order
to minimize time and hardware consumption the implementation of p(i) and g(i) is
revisited as follows. Remembering that p(i) = 1 whenever the arithmetic sum
x(i) ? y(i) = 9, one defines a 6-input function pp(i) set to be 1 whenever the
arithmetic sum of the first 3 bits of x(i) and y(i) is 4. Then p(i) may be computed as:
pi x0 i y0 i ppi:
11:5
On the other hand, gg(i) is defined as a 6-input function set to be 1 whenever the
arithmetic sum of the first 3 bits of x(i) and y(i) is 5 or more. So, remembering that
g(i) = 1, whenever the arithmetic sum x(i) ? y(i) [ 9, g(i), may be computed as:
gi ggi _ ppi x0 i y0 i:
11:6
As Xilinx LUTs may compute 6-variable functions, then gg(i) and pp(i) may be
synthesized using 2 LUTs in parallel while g(i) and p(i) are computed through an
additional single LUT as shown at Fig. 11.5b.
11:7
11:9
11:10
11:11
11.2
283
where xn1 xn1 10 if xn1 5 and xn1 xn1 if xn1 \5; while the sign
definition rule is the following one: if x is negative then xn 1 C 5; otherwise
xn - 1 \ 5.
11:12
w2 y2 y1 ;
w 1 y1 ;
w0 y00
11:13
An improvement to the adder stage could be carried out by avoiding the delay
produced by the 9s complement step. Thus, this operation may be carried out
within the first binary adder stage, where p(i) and g(i) are computed as
0
p0 i x0 i y0 i A =S ; p1 i x1 i y1 i;
0
p2 i x2 i y2 i y1 i: A =S
0 0
0
0
0
0
p3 i x3 i y3 i y2 i y1 i A =S y3 i A =S
11:14
gk i xk i; 8k:
11:15
284
11
(a)
s 4(i)
s 3(i)
s2(i)
s1(i)
s0(i)
Decimal Operations
(b)
c(i+1)
LUT
p(i)
g(i)
P-G
c(i)
x 1(i)
x 2(i)
x 3(i)
y1(i)
y2(i)
y3(i)
x1(i)
x2(i)
x3(i)
y1(i)
y2(i)
y3(i)
LUT
c(i+1)
x0(i)
y0(i)
pp(i)
LUT
p(i)
g(i)
c(i)
LUT
gg(i)
P-G
Fig. 11.5 FPGA carry-chain for decimal addition. a. P-G calculation using an intermediate
addition. b. P-G calculation directly from BCD digits
A/S
y3(i)
y2(i)
y1(i)
w3(i)
LUT
A/S
y1(i)
y0 (i)
w2(i)
w1(i)
LUT
w0(i)
A third alternative is computing G and P directly from the input data. As far as
addition is concerned, the P and G functions may be implemented according to
formulas (11.4) and (11.5). The idea is computing the corresponding functions in
the subtract mode and then multiplexing according to the add/subtract control
signal A0 /S.
11.2
(a)
(b)
s 4(i)
A/S
y 3(i)
y2(i)
y1(i)
x 3(i)
285
A/S
LUT
p3(i)
1
s3(i)
x 1(i)
x 2(i)
x 3(i)
y 1(i)
y 2(i)
y 3(i)
LUT
A/S
y2(i)
y1(i)
x 2(i)
LUT
p2(i)
1
s2(i)
x1 (i)
x2 (i)
x3 (i)
y1 (i)
y2 (i)
y3 (i)
c(i+1)
pps(i)
pp(i)
x 0(i)
y 0(i)
LUT
p(i)
0
g(i)
LUT
c(i)
ppa(i)
A/S
LUT
p1(i)
y 1(i)
x 1(i)
s1(i)
A/S
LUT
p0(i)
y 0(i)
x 0(i)
s0(i)
x 1(i)
x 2(i)
x 3(i)
y 1(i)
y 2(i)
y 3(i)
x1 (i)
x2 (i)
x3 (i)
y1 (i)
y2 (i)
y3 (i)
LUT
ggs(i)
1
0
LUT
gg(i)
gga(i)
P-G
Fig. 11.7 FPGA implementation of adder-subtractor. a Adder-subtractor for one BCD digit.
b Direct computation of P-G function
Then we can use the circuit of Fig. 11.6a to compute the P-G function using the
previous computed addition of BDC digits. The VHDL model addsub_BDC_v1.vhd
that implements a complete decimal adder-subtractor and it is available at the
Authors web page. To complete the circuit, a final correction adder (correct_add.vhd) corrects the decimal digit as a function of the carries.
The third alternative is computing G and P directly from the input data. For this
reason, assuming that the operation at hand is X ? (Y), one defines on one hand
ppa(i) and gga(i) according to (11.4) and (11.5) (Sect. 11.1.4), i.e. using the
straight values of Ys BCD components. On the other hand, pps(i) and ggs(i) are
defined using wk(i) as computed by the 9s complement circuit (11.13). As
wk(i) are expressed from the yk(i) both pps(i) and ggs(i) may be computed directly
from xk(i) and yk(i). Then the correct pp(i) and gg(i) signal is selected according to
the add/subtract control signal A0 /S. Finally, the propagate and generate function
are computed as:
pi x0 i y0 i A0 =S ppi;
11:16
11:17
Figure 11.7b shows the Xilinx LUT based implementation. The multiplexers
are implemented using dedicated muxF7 resources.
The VHDL model addsub_BDC_v2.vhd that implements the adder-subtractor
using for the P-G computation from the direct inputs (carry-chain_v2.vhd) is
available at the Authors web page.
286
11
Decimal Operations
80
40
p6
20
p5
p6
10
p4
p5
8
p3
d3
d2
d1
d0
c3
4
p2
p4
p6
c2
2
p1
p4
p5
c1
1
p0
c0
11:18
11:19
11.3
Decimal Multiplication
287
11:20
Then
D dc7 dc6 dc5 dc4 and C dc3 dc2 dc1 dc0
A better implementation can be achieved using the following relations
(Arithmetic II):
1. the product AB P p6 p5 p4 p3 p2 p1 p0
2. compute:
cc p3 p2 p1 p0 0 p4 p4 0 0 p6 p5 0;
dd p6 p5 p4 0 p6 p5
(cc is 5 bits, dd has 4 bits, computed in parallel)
3. define:
cy1 1 iff cc [ 19; cy0 1 iff 9\cc\20
(cy1 y cy2 are function of cc3 cc2 cc1 cc0, and can be computed in parallel)
4. compute:
c cc3 cc2 cc1 cc0 cy1cy1 or cy0 cy 0 0;
11:21
288
11
Decimal Operations
11.3
Decimal Multiplication
b
a(n-1)
1x1 BCD
multiplier
289
a(n-2)
1x1 BCD
multiplier
1x1 BCD
multiplier
...
C(n-1)
C(n-2)
D(n-1)
D(n-2)
D(n-3)
Z(n)
Z(n-1)
Z(n-2)
a(2)
...
...
a(1)
1x1 BCD
multiplier
a(0)
1x1 BCD
multiplier
C(2)
C(1)
D(2)
D(1)
D(0)
Z(3)
Z(2)
Z(1)
C(0) = Z(0)
Z(0)
Fig. 11.9 N by one digit circuit multiplication. It includes N one by one digit multipliers and a
fast carry-chain adder
A(n-1:0)
B(m-1:0)
capt
shift
b(0)
N+1-digit register
Nx1 BCD
multiplier
partialsum(n-1:0)
N+1 BCD
Carry chain Adder
mult(n:0)
start
reset
clk
capt
Control
State
Machine
shift
N-digit register
sum(n:1)
sum(n:0)
partialsum (n-1:0)
& shiftmul(m-1:0)
sum(0)
shift
M-digit register
end
N+M-digit register
done
P(n+m-1:0)
shiftmul(m-1:0)
290
11
A(n-1:0)
B(6)
B(7)
Nx1 digit
BCD Mult
Nx1 digit
BCD Mult
m(7)(n:0)
Nx1 digit
BCD Mult
m(6)(n:0)
m(6)(n:1)
B(4)
B(5)
N+1 BCD
Carry chain Adder
s(3)(n+1:1)
s(3)(0)
Nx1 digit
BCD Mult
m(2)(n:0)
m(2)(n:1)
s(2)(0)
Nx1 digit
BCD Mult
m(0)(n:0)
m(1)(n:0)
m(2)(0)
m(0)(0)
m(0)(n:1)
N+1 BCD
Carry chain Adder
s(2)(n+1:1)
B(0)
B(1)
Nx1 digit
BCD Mult
m(3)(n:0)
m(4)(0)
m(4)(n:1)
N+1 BCD
Carry chain Adder
Nx1 digit
BCD Mult
m(4)(n:0)
m(5)(n:0)
m(6)(0)
B(2)
B(3)
Nx1 digit
BCD Mult
Decimal Operations
N+1 BCD
Carry chain Adder
s(1)(n+1:1)
s(2)(n+1:2)
s(1)(0)
s(0)(n+1:1)
s(0)(0)
t(0)(1)
t(0)(0)
s(0)(n+1:2)
s(3)(n+1:0)
s(1)(n+1:0)
N+2 BCD
Carry chain Adder
t(1)(n+3:2)
N+2 BCD
Carry chain Adder
t(1)(1)
t(1)(0)
t(1)(n+3:0)
t(0)(n+3:2)
t(0)(n+3:4)
N+3 BCD
Carry chain Adder
p(n+7:4)
p(3:2)
p(1)
p(0)
11:22
11.4
Decimal Division
291
ri+1
qi+1 =
- 10
-9y
-9
-8
-7
-6
-5
-4
-3
-2
-8y
-7y
-6y
-5y
-4y
-3y
-2y
-y
-1
2y
3y
4y
5y
6y
7y
8y
9y
9
10
10.ri
11:23
11:24
According to (11.23) and (11.24), 10n1 10a \w0 10a \10n1 ; so that
0
10n1a w \10n1a :
11:25
11:26
has been defined. The interval [ky, (k ? 1)y - 10a] must include a multiple of
10a. Thus, y must be greater than or equal to 210a. Taking into account that
y C 10n - 1, the condition is satisfied if a B n - 2.
The following property is a straightforward consequence of (11.24) and (11.26):
0
if w m8 y then qi1 9;
0
292
11
Decimal Operations
11.4
Decimal Division
293
294
11
1_y
3
2_y
3
9_y
w=ri /10n-3
y=y / 10n-3
... 3
Decimal Operations
3
y
rangedetection
detection
range
q- (i+1)
q-(i +1)
n
3
1_y
2_y
...
B
a N by 1 digit multiplier
- aB
-q-(i +1)y
n+2
8_y
9_y
10ri
n+2
decimal adder
n+1
ri+ 1
register
n+1
ri
11:28
11:29
11:30
Assume that a set of integer-valued functions Mk(y), for k in {-10, -9,, -1,
0, 1,, 8, 9}, satisfying
k y Mk y 10a \k 1 y 2 10a;
11:31
have been defined. The interval [ky, (k ? 1)y - 210a[ must include a multiple of
10a. Thus y must be greater than or equal to 310a. Taking into account that
11.4
Decimal Division
295
This SRT-like algorithm generates a p-digit decimal quotient 0.q-1 q-2q-p and
a remainder rp satisfying (11.27), and can be converted to a decimal number by
computing the difference between two decimal numbers as in non-restoring
algorithm.
It remains to define a set of integer-valued functions Mk(y) satisfying (11.31). In
order to simplify their computation, they should only depend on the most significant digits of y. Actually, the same definition as in Sect. 11.3.1 can be used, that is
Mk y bk y0 =10c bias where y0 y=10a1 if k 0 and
Mk y bk y0 =10c bias if k\0:
In this case the range of bias is 3 B bias B 6. With a = n - 2, y0 as a 3-digit
natural, w0 and mk(y) are 4-digit 10s complement numbers, and w00 is a 5-digit 10s
complement number. In the following Algorithm 2 Mk(y) is computed without
adding up bias to bk y0 =10c or bk y0 =10c; and w00 is substituted by w00 bias.
Algorithm 11.4: SRT like algorithm for decimal numbers
296
11
Decimal Operations
11.4
Decimal Division
ri-1
297
y
quotient
selection
-y
-ulp ulp
x2
0
two_ r
adder
ulp
adder
2
ri
register
initially: x
register
initially: 0
register
initially : 2 -1
ri-1
ulp
load
update
298
11
Decimal Operations
11.5
299
Table 11.1 Delays in ns for decimal and binary adders and adder-subtractor
N (digits) RpCy add CyCh add AddSub V1 AddSub V2 M (bits) Binary add-sub
8
16
32
48
64
96
12.4
24.4
48.5
72.3
95.9
3.5
3.8
4.5
5.1
5.2
5.9
3.5
3.8
4.6
5.2
5.5
6.1
3.4
3.7
4.8
5.3
5.5
6.1
27
54
107
160
213
319
2.1
2.6
3.8
5.2
6.6
8.8
Circuit
# LUTs
79N
99N
d3:32
Ne
99N
13 9 N
d3:32
Ne
Table 11.1 exhibits the post placement and routing delays in ns for the decimal
adder implementations Ad-I and Ad-II of Sect. 7.1; and the delays in ns
for the decimal adder-subtractor implementations AS-I and AS-II of Sect. 7.2.
Table 11.2 lists the consumed areas expressed in terms of 6-input look-up tables
(6-input LUTs). The estimated area presented in Table 11.2 was empirically
confirmed.
Comments
1. Observe that for large operands, the decimal operations are faster than the
binary ones.
2. The delay for the carry-chain adder and the adder-subtractor are similar in
theory. The small difference is due to the placement and routing algorithm.
3. The overall area with respect to binary computation is not negligible. In Xilinx
6-input LUT family an adder-subtractor is between 3 and 4 times bigger.
300
11
Decimal Operations
Table 11.3 Results of BCD N 9 1 multipliers using LUTs cells and BRAM cells
N
Mem in LUTs cells
BRAM-based cells
4
8
16
32
T (ns)
# LUT
T (ns)
# LUT
# BRAM
5.0
5.1
5.3
6.1
118
242
490
986
5.0
5.1
5.4
6.1
41
81
169
345
1
2
4
8
Table 11.4 Results of sequential implementations of N 9 M multipliers using one by one digit
multiplication in LUTs
N
M
T (ns)
# FF
# LUT
# cycles
Delay (ns)
4
8
8
8
16
16
32
16
32
4
4
8
16
8
16
16
32
32
5.0
5.1
5.1
5.1
5.3
5.3
5.7
5.3
5.7
122
186
235
332
363
460
716
653
909
243
451
484
553
921
986
1,764
1,120
1,894
4
5
9
17
9
17
17
33
33
25.0
25.5
45.9
86.7
47.7
90.1
96.9
174.9
188.1
11.5
301
Delay (ns)
# LUT
4
8
8
8
16
16
4
4
8
16
8
16
10.2
10.7
13.4
15.7
13.6
16.3
719
1,368
2,911
6,020
5,924
12,165
based multipliers are synchronous. Partial products are inputs to an addition tree. For
all BCD additions the fast carry-chain adder of Sect. 11.1.4 has been used.
Input and output registers have been included in the design. Delays include FF
propagation and connections. The amount of FFs actually used is greater than
8*(M ? N) because the ISE tools [2] replicate the input register in order to reduce
fan-outs. The most useful data for area evaluation is the number of LUTs
(Table 11.5).
Comments
1. Observe that computation time and the required area are different for N by M
than M by N. That is due mainly by the use of carry logic in the adder tree.
302
11
st
5
1_y
3
9_y
2_y
3
... 3
ct
Decimal Operations
st=si /10n-3
ct=ci /10n-3
adder
5 w
y
n
range detection
q-(i+1)
carry-free multiplier
10si
p1
p0
n+2
n+2
10ci
n+2
n+2
p1 +p0= q-(i+1) y
4-to-2 counter
n+2
n+2
si+1
ci+1
register
n+2
n+2
si
ci
Period
Latency
8
16
32
48
11.0
11.3
11.6
12.1
110.0
203.4
394.4
605.0
106
203
396
589
1,082
1,589
2,543
3,552
Period
Latency
8
16
32
48
10.9
10.9
10.9
10.9
109.0
196.2
370.6
545.0
233
345
571
795
1,445
2,203
3,475
4,627
mod 10 reduction. On the other hand, the computation of the next quotient digit
is much more complex than in the binary case.
11.6
Exercises
303
11.6 Exercises
1. Implement a decimal comparator. Based on the decimal adder subtractor
architecture modify it to implement a comparator.
2. Implement a decimal greater than circuit that returns 1 if A C B else 0.
Tip: Base your design on the decimal adder subtractor.
3. Implement a N 9 2 digits circuit. In order to speed up computation analyze the
use of a 4 to 2 decimal reducer.
4. Implement a N 9 4 digits circuit using a 8 to 2 decimal reducer and only one
carry save adder.
5. Design a N by M digit multiplier using the N 9 2 or the N 9 4 digit multiplier.
Do you improve the multiplication time? What is the area penalty with respect
to the use of a N 9 1 multiplier?
6. Implement the binary digit-recurrence algorithm for decimal division (Algorithm 11.5). The key point is an efficient implementation of radix-B doubling,
adding, subtracting and halving.
References
1. Altera Corp (2011) Advanced synthesis cookbook. http://www.altera.com
2. Xilinx Inc (2011b) ISE Design Suite Software Manuals (v 13.1). http://www.xilinx.com
3. Bioul G, Vazquez M, Deschamps J-P, Sutter G (2010) High speed FPGA 10s complement
adders-subtractors. Int J Reconfigurable Comput 2010:14, Article ID 219764
4. Jaberipur G, Kaivani A (2007) Binary-coded decimal digit multiplier. IET Comput Digit Tech
1(4):377381
5. Sutter G, Todorovich E, Bioul G, Vazquez M, Deschamps J-P (2009) FPGA implementations
of BCD multipliers. V International Conference on ReConFigurable Computing and FPGAs
(ReConFig 09), Mexico
6. Xilinx Inc (2010) Virtex-5 FPGA Data Sheet, DS202 (v5.3). http://www.xilinx.com
7. Xilinx Inc (2011) XST User Guide UG687 (v 13.1). http://www.xilinx.com
Chapter 12
There are many data processing applications (e.g. image and voice processing),
which use a large range of values and that need a relatively high precision. In such
cases, instead of encoding the information in the form of integers or fixed-point
numbers, an alternative solution is a floating-point representation. In the first
section of this chapter, the IEEE standard for floating point is described. The next
section is devoted to the algorithms for executing the basic arithmetic operations.
The two following sections define the main rounding methods and introduce the
concept of guard digit. Finally, the last few sections propose basic implementations of the arithmetic operations, namely addition and subtraction, multiplication,
division and square root.
305
306
12.1.1 Formats
Formats in IEEE 754 describe sets of floating-point data and encodings for
interchanging them. This format allows representing a finite subset of real numbers. The floating-point numbers are represented using a triplet of natural numbers
(positive integers). The finite numbers may be expressed either in base 2 (binary)
or in base 10 (decimal). Each finite number is described by three integers: the sign
(zero or one), the significand s (also known as coefficient or mantissa), and the
exponent e. The numerical value of the represented number is (-1)sign 9 s 9 Be,
where B is the base (2 or 10).
For example, if sign = 1, s = 123456, e = -3 and B = 10, then the represented number is -123.456.
The format also allows the representation of infinite numbers (+? and -?),
and of special values, called Not a Number (NaN), to represent invalid values. In
fact there are two kinds of NaN: qNaN (quiet) and sNaN (signaling). The latter,
used for diagnostic purposes, indicates the source of the NaN.
The values that can be represented are determined by the base (B), the number
of digits of the significand (precision p), and the maximum and minimum values
emin and emax of e. Hence, s is an integer belonging to the range 0 to Bp-1, and e is
an integer such that emin e emax :
For example if B = 10 and p = 7 then s is included between 0 and 9999999. If
emin = -96 and emax = 96, then the smallest non-zero positive number that can be
represented is 1 9 10-101, the largest is 9999999 9 1090 (9.999999 9 1096), and
the full range of numbers is from -9.999999 9 1096 to 9.999999 9 1096. The
numbers closest to the inverse of these bounds (-1 9 10-95 and 1 9 10-95) are
considered to be the smallest (in magnitude) normal numbers. Non-zero numbers
between these smallest numbers are called subnormal (also denormalized)
numbers.
Zero values are finite values whose significand is 0. The sign bit specifies if a
zero is +0 (positive zero) or -0 (negative zero).
12.1
307
Table 12.1 Binary and decimal floating point format in IEEE 754-2008
Binary formats (B = 2)
Parameter
Binary
16
Binary
32
Binary
64
Binary
128
Decimal Decimal
132
l64
p, digits
emax
emin
Common
name
10 ? 1
+15
-14
Half
precision
23 ? 1
+127
-126
Single
precision
52 ? 1
+1023
-1022
Double
precision
112 ? 1
+16383
-16382
Quadruple
precision
7
+96
-95
16
+384
-383
Decimal
128
34
+16,383
-16,382
16
11
15
15
5
10
32
24
127
127
8
23
64
53
1,023
1,023
11
52
128
113
16,383
16,383
15
112
Multiple of 32
k-w
2kp1 1
emax
Round(4log2 k) - 13
k-w-1
defined (Table 12.2). The 16-bit format is only for the exchange or storage of small
numbers.
The binary interchange encoding scheme is the same as in the IEEE 754-1985
standard. The k-bit strings are made up of three fields (Fig. 12.1):
a 1-bit sign S,
a w-bit biased exponent E = e ? bias,
the p - 1 trailing bits of the significand; the missing bit is encoded in the
exponent (hidden first bit).
Each binary floating-point number has just one encoding. In the following
description, the significand s is expressed in scientific notation, with the radix point
immediately following the first digit. To make the encoding unique, the value of the
significand s is maximized by decreasing e until either e = emin or s C 1 (normalization). After normalization, there are two possibilities regarding the significand:
If s C 1 and e C emin then a normalized number of the form 1.d1 d2dp
obtained. The first 1 is not stored (implicit leading 1).
-1
is
308
If e emin and 0 \ s \ 1, the floating-point number is called subnormal. Subnormal numbers (and zero) are encoded with a reserved biased exponent value.
They have an implicit leading significand bit = 0 (Table 12.2).
The minimum exponent value is emin 1 emax : The range of the biased
exponent E is 1 to 2w - 2 to encode normal numbers. The reserved value 0 is used
to encode 0 and subnormal numbers. The value 2w - 1 is reserved to encode
? and NaNs.
The value of a binary floating-point data is inferred from the constituent fields
as follows:
If E 2w 1 (all 1s in E), then the data is an NaN or infinity. If T = 0, then it
is a qNaN or an sNaN. If the first bit of T is 1, it is an sNaN. If T = 0, then the
value is 1sign 1:
If E = 0 (all 0s in E), then the data is 0 or a subnormal number. If T = 0 it is a
signed 0. Otherwise (T = 0), the value of the corresponding floating-point
number is 1Sign 2emin 0 21p T :
If 1 B E B 2w - 2, then the data is 1sign 2Ebias 1 21 p T :
Remember that the significand of a normal number has an implicit leading 1.
Example 12.1
Convert the decimal number -9.6875 to its binary32 representation.
First convert the absolute value of the number to binary (Chap. 10):
9:687510 1001:10112 :
Normalize: 1001:1011 1:00110011 23 : Hence e 3; s 1:0011 0011:
Hide the first bit and complete with 0s up to 23 bits:
00110011000000000000000.
Add bias to the exponent. In this case, w = 8, bias = 28 - 1 = 127 and thus
E e bias 3 127 13010 100000102 :
Compose the final 32-bit representation:
1 10000010 001100110000000000000002 C119800016 :
Example 12.2
Convert the following binary32 numbers to their decimal representation.
7FC0000016: sig n 0; E FF16 ; T 6 0; hence it is an NaN. Since the first bit
of T is 1, it is a quiet NaN.
FF80000016: sig n 0; E FF16 ; T 0; hence it is -?.
6545AB7816 : sign 0; E CA16 ; 20210 ; e E bias 202 127 7510 ;
T 100010110101011011110002 ;
s 1:100010110101011011110002 1:544295310 :
The number is 1:100010110101011011112 275 5:8341827 1022 :
12.1
309
12:2
12:3
310
12:4
The significands s1 and s2 of the operands are multiples of ulp. If e1 is greater than
e2, the value of s could no longer be a multiple of ulp and some rounding function
should be applied to s. Assume that s0 \ s \ s00 = s0 ? ulp, s0 and s00 being two
successive multiples of ulp. Then the rounding function associates to s either s0 or s00 ,
according to some rounding strategy. According to (12.4) and to the fact that 1 and
Bulp are multiples of ulp, it is obvious that 1 B s0 \ s00 B Bulp. Nevertheless, if
the condition (12.3) does not hold, that is if 1 B s \ B, s could belong to the interval
B ulp\s\B;
12:5
Examples 12.3
Assume that B = 2 and ulp = 2-5, so that the numbers are represented in the form
s 2e where 1 B s B 1.111112. For simplicity e is written in decimal (base 10).
1. Compute z 1:10101 23 1:00010 21 :
Alignment: z 1:10101 0:000100010 23 1:101110010 23 :
Rounding: s 1:10111:
Final result: z 1:10111 23 :
12.2
Arithmetic Operations
311
3. Compute z 1:10101 23 1:10101 21 :
Alignment: z 1:10010 0:0110101 23 1:1111101 23 :
Rounding: s 10:00000:
Normalization: s 1:00000; e 4:
Final result : z 1:00000 24 :
Comments 12.2
1. The addition of two positive numbers could produce an overflow as the final
value of e could be greater than emax.
2. Observe in the previous examples the lack of precision due to the small number
of bits (6 bits) used in the significand s.
312
Examples 12.4
Assume again that B = 2 and ulp = 2-5, so that the numbers are represented in
the form s 2e where 1 B s B 1.111112. For computing the difference, the 2s
complement representation is used (one extra bit is used).
1. Compute z 1:10101 22 1:01010 21 :
Alignment: z 0:00110101 1:01010 21 :
20 s complement addition: 00:00110101 10:10101 00:00001 21
10:11100101 21 :
Change of sign: s 01:00011010 00:00000001 01:00011011:
Rounding: s 1:00011:
Final result: z 1:00011 21 :
2. Compute z 1:00010 23 1:10110 22 :
Alignment: z 1:00010 0:110110 23 :
20 s complement addition: 01:00010 11:001001 00:000001 23
00:001110 23 :
Leading zeroes: k 3; s 1:11000; e 0:
Final result: z 1:11000 20 :
3. Compute z 1:01010 23 1:01001 21 :
Alignment: z 1:01010 0:0101001 23 0:1111111 23 :
Leading zeroes: k 1; s 1:111111; e 2:
Rounding: s 10:00000:
Normalization: s 1:0000; e 3:
Final result:z 1:00000 23 :
Comment 12.3
The difference of two positive numbers could produce an underflow as the final
value of e could be smaller than emin.
12.2
Arithmetic Operations
313
Operation
Sign1
Sign2
Actual
operation
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
s1 ?
s1 -(s1
-(s1
s1 s1 ?
-(s1
-(s1
s2
s2
s2
s2
?
-
s2)
s2)
s2)
s2)
sign
sign1
s B 1
e1
s1 B 1
sign2
e2
s2 B ;
if operation 0;
if operation 1:
Once the significands have been aligned, the actual operation (addition or
subtraction of the significands) depends on the values of operation, sign1 and sign2
(Table 12.3). The following algorithm computes z. The procedure swap (a, b)
interchanges a and b.
Algorithm 12.3: Addition and subtraction
314
12.2.4 Multiplication
Given two floating-point numbers 1sign1 s1 Be1 and 1sign2 s2 Be2 their
product (-1)sign s Be is computed as follows:
sign sign1 xor sign2 ; s s1 s2 ; e e1 e2 :
12:7
12:8
Examples 12.5
Assume again that B = 2 and ulp = 2-5, so that the numbers are represented in
the form s 2e where 1 B s B 1.111112. The exponent e is represented in decimal.
1. Compute z 1:11101 22 1:00010 25 :
Multiplication : z 01:1100101100 23 :
Rounding : s 1:11001:
Final result : z 1:11001 23 :
2. Compute z 1:11101 23 1:00011 21 :
Multiplication : z 10:00010101112 22 :
Normalization : s 1:000010101112 ;
Rounding : s 1:00001:
Final result : z 1:00001 23 :
e 3:
12.2
Arithmetic Operations
315
3. Compute z 1:01000 x 21 1:10011 22 :
Multiplication : z 01:111111000 22 :
Normalization : s 1:11111;
e 3:
Rounding : s 10:00000:
Normalization : s 1;
e 4:
Comment 12.4
The product of two real numbers could produce an overflow or an underflow as the
final value of e could be greater than emax or smaller than emin (addition of two
negative exponents).
12.2.5 Division
Given two floating-point numbers 1sign1 s1 Be1 and 1sign2 s2 Be2 their
quotient
(-1)sign s Be is computed as follows:
sign sign1 xor sign2 ;
s s1 =s2 ;
e e1 e2 :
12:9
The value of s belongs to the interval 1/B \ s B B - ulp, and could be smaller
than 1. If that is the case, that is if s s1 =s2 \1; then s1 \s2 ; s1 s2
ulp; s1 =s2 1 ulp=s2 \1 ulp=B; and 1=B\s\1 ulp=B:
Then (normalization) substitute s by s B, and e by e - 1. The new value of
s satisfies 1 \ s \ B - ulp. It remains to round the significand.
Algorithm 12.5: Division
Examples 12.6
Assume again that B = 2 and ulp = 2-5, so that the numbers are represented in
the form s 2e where 1 B s B 1.111112. The exponent e is represented in
decimal.
316
1. Compute z 1:11101 23
1:00011 21 :
Division: z 1:1011111000 24 :
Rounding: s 1:10111:
Final result: z 1:00001 23 :
2. Compute z 1:01000 21
1:10011 22 :
e 2:
Rounding: s 1:10010:
Final result: z 1:10010 22 :
Comment 12.5
The quotient of two real numbers could produce an underflow or an overflow as the
final value of e could be smaller than emin or bigger than emax. Observe that a second
normalization is not necessary as in the case of addition, subtraction and multiplication.
s s1 1=2 ;
s s1 =B1=2 ;
e e1 =2;
e e1 1=2:
12:10
12:11
1/2
In the first case (12.10), 1 B s B (B - ulp) \ B - ulp. In the second case (1/
B)1/2 B s \ 1. Hence (normalization) s must be substituted by s B and e by e - 1,
so that 1 B s \ B. It remains to round the significand and to normalize if necessary.
Algorithm 12.6: Square root
s s1 B1=2 ;
e e1 1=2:
12:12
12.2
Arithmetic Operations
317
In this case B1=2 s B2 ulp B1=2 \B; then the first normalization is not
necessary. Nevertheless, s could be B - ulp \ s \ B, and then depending on the
rounding strategy, normalization after rounding could be still necessary.
Algorithm 12.7: Square root, second version
Note that the round to nearest (default rounding in IEEE 754-2008) and the
truncation rounding schemes allow avoiding the second normalization.
Examples 12.7
Assume again that B = 2 and ulp = 2-5, so that the numbers are represented in the
form s 2e where 1 B s B 1.111112. The exponent e is represented in decimal form.
1=2
1. Compute z 1:11101 24
:
Square rooting : z 1:01100001 22 :
Rounding : s 1:01100:
Final result : z 1:01100 22 :
1=2
2. Compute z 1:00101 21 :
e 2:
3. Compute z 1:11111 23
1=2
:
e 2:
318
12.3
Rounding Schemes
319
The preceding schemes (round to the nearest) produce the smallest absolute
error, and the two last (tie to even, tie to odd) also produce the smallest average
absolute error (unbiased or 0-bias representation systems).
Assume now that the exact result of an operation, after normalization, is
s 1:s1 s2 s3 . . . sp jsp1 sp2 sp3 . . .
where ulp is equal to B-p (the | symbol indicates the separation between the digit
which corresponds to the ulp and the following). Whatever the chosen rounding
scheme, it is not necessary to have previously computed all the digits s-(p+1) s(p+2); it is sufficient to know whether all the digits s-(p+1) s-(p+2) are equal to
0, or not. For example the following algorithm computes round(s) if the round to
the nearest, tie to even scheme is used.
Algorithm 12.8: Round to the nearest, tie to even
320
12:14
12:15
or
or
rk rk1 rk2 . . . rp rp1 0 . . . 0j0 0 . . . multiply by Bk where k [ 1:
12:16
For executing a rounding operation, the worst case is (12.15). In particular, for
executing Algorithm 12.8, it is necessary to know
12.4
Guard Digits
321
322
The circuits support normalized binary IEEE 754-2008 operands. Regarding the
binary subnormals the associated hardware to manage this situation is complex.
Some floating point implementations solve operations with subnormals via software routines. In the FPGA arena, most cores do not support denormalized
numbers. The dynamic range can be increased using fewer resources by increasing
the size of the exponent (a 1-bit increase in exponent, roughly doubles the dynamic
range) and is typically the solution adopted.
12.5.1 AdderSubtractor
An adder-subtractor based on Algorithm 12.3 will now be synthesized. The
operands are supposed to be in IEEE 754 binary encoding. It is made up of five
parts, namely unpacking, alignment, addition, normalization and rounding, and
packing. The following implementation does not support subnormal numbers; they
are interpreted as zero.
12.5.1.1 Unpacking
The unpacking separates the constitutive parts of the Floating Points and additionally detects the special numbers (infinite, zeros and NaNs). The special number
detection is implemented using simple comparators. The following VHDL process
defines the unpacking of a floating point operand FP; k is the number of bits of FP,
w is the number of bits of the exponent, and p is the significand precision.
The previous is implemented using two w bits comparators, one p bits comparator and some additional gates for the rest of the conditions.
12.5.1.2 Alignment
The alignment circuit implements the three first lines of the Algorithm 12.3, i.e.
12.5
Arithmetic Circuits
e1
323
e2
s1
s2
actual_sign2
subtractor
subtractor
e1-e2
e2-e1
sign(e1-e2)
new_s2
0
s
000
dif
right shifter
(from 0 to p+3)
sign
new_sign2
aligned_s2
000
aligned_s1
324
12.5
Arithmetic Circuits
325
s1
s2
aligned_s1
aligned_s2
sign
sign2
(p+1)-bits subtractor
alt_result
alt_result >0
operation
e1
e2
significand
selection
iszero1
iszero2
(p+4)-bits adder /
subtractor
operation
result
3
signif
The second rounding is avoided in the following way. The only possibility to need a
second rounding is when, as the result of an addition, the significand is 1.1111111xx.
This situation is detected in a combinational block that generates the signal isTwo
and adds one to the exponent. After rounding, the resulting number is 10.000000, but
the two most significand bits are discarded and the hidden 1 is appended.
12.5.1.5 Packing
The packing joins the constitutive parts of the floating point result. Additionally
depending on special cases (infinite, zeros and NaNs), generates the corresponding
codification.
Example 12.8 (Complete VHDL code available)
Generate the VHDL model of an IEEE decimal floating-point adder-subtractor. It
is made up of five previously described blocks. Fig. 12.5 summarizes the interconnections. For clearness and reusability the code is written using parameters,
326
e
e
=
111...111
/B
+1
isTwo
signif(p+3)
0
1 0
leading
0's
zero_flag
sign
operation
x Bk
-k
alt_result > 0
Sign
computation
Rounding
new_e
p-bits
half adder
s3
s2
s1
s0
rounding
decision
new_sign
s(p+2:3)
1 (ulp)
new_s
where K is the size of the floating point numbers (sign, exponent, significand), E is
the size of the exponent and P is the size of the significand (including the hidden
1). The entity declaration of the circuit is:
For simplicity the code is written as a single VHDL code except additional files
to describe the right shifter of Fig. 12.2 and the leading zero detection and shifting
of Fig. 12.4. The code is available at the book home page.
12.5
Arithmetic Circuits
327
FP1
FP2
isInf1
add/sub
Unpack
e2
e1
sign1 sign2
isInf2
isZ1 isZ2 s1
s1
sign1 sign2
s2
e2
e1
sign1 sign2
s2 isNaN1 isNaN2
e2
e1
isZ1 isZ2 s1
s2
Alignment
sign
new_sign2
sign
sign2
aligned_s1 aligned_s2
aligned_s1 aligned_s2 s1 s2
Addition / Subtraction
operation
sign
operation
significand
significand
sign
new_e
new_s
s
Pack
FP
overflow
undeflow
zero_flag
Conditions
Detection
isInf
isZero
isNaN
FP
A two stage pipeline could be achieved dividing the data path between addtion/
subtraction and normalization and rounding stages (dotted line in Fig. 12.5).
12.5.2 Multiplier
A basic multiplier deduced from Algorithm 12.4 is shown in Fig. 12.6. The
unpacking and packing circuits are the same as in the case of the adder-subtractor
(Fig. 12.5, Sects. 12.5.1.1 and 12.5.1.5) and for simplicity, are not drawn. The
normalization and rounding is a simplified version of Fig. 12.4, where the part
related to the subtraction is not necessary.
328
s2
e1
sign1 sign2
e2
adder
p-by-p-bits
multiplier
p-4 .. 0
2.p-1 .. 0
2.p-1 .. p-3
sticky digit
generation
Normalization
prod(p+3 .. 0)
e
/B
=
111...111
+1
isTwo
prod(p+3)
1
Rounding
s(p+2:3)
p-bits
half adder
s3
s2
s1
s0
rounding
decision
1 (ulp)
sign
The obvious method of computing the sticky bit is with a large fan-in OR gate
on the low order bits of the product. Observe, in this case, that the critical path
includes the p by p bits multiplication and the sticky digit generation.
An alternative method consists of determining the number of trailing zeros in
the two inputs of the multiplier. It is easy to demonstrate that the number of
trailing zeros in the product is equal to the sum of the number of trailing zeros in
each input operand. Notice that this method does not require the actual low order
product bits, just the input operands, so the computation can occur in parallel with
the actual multiply operation, removing the sticky computation from the critical
path.
The drawback of this method is that significant extra hardware is required. This
hardware includes two long length priority encoders to count the number of
12.5
Arithmetic Circuits
s1
329
s2
e1
sticky bit
generation
conditions
detection
adder
p-by-p-bits
multiplier
isZero
2.p-1 .. 0
2.p-1 .. p-3
isInf
isNan
sticky bit
prod(p+3 .. 0)
e
Normalization and
Rounding
conditions
adjust
isZero
sign
isInf isNan
underflow overflow isZero
trailing zeros in the input operands, a small length adder, and a small length
comparator. On the other hand, some hardware is eliminated, since the actual low
order bits of the product are no longer needed.
A faster floating point multiplier architecture that computes the p by p multiplication and the sticky bit in parallel is presented in Fig. 12.7. The dotted lines
suggest a three stage pipeline implementation using a two stage p by p multiplication. The two extra blocks are shown to indicate the special conditions detections. In the second block, the range of the exponent is controlled to detect
overflow and underflow conditions. In this figure the packing and unpacking
process are omitted for simplicity.
Example 12.9 (complete VHDL code available)
Generate the VHDL model of a generic floating-point multiplier. It is made up of
the blocks depicted in Fig. 12.7 described in a single VHDL file. For clearness and
reusability the code is written using parameters, where K is the size of the floating
point numbers (sign, exponent, significand), E is the size of the exponent and P is
the size of the significand (including the hidden 1). The entity declaration of the
circuit is:
330
s2
e1
sign1 sign2
e2
subtractor
p-by-p-bits
divider
q
e
r
p-1 .. 0
p+2 .. 0
sticky digit
generation
Normalization
div(p+3 .. 0)
e
*B
-1
div(p+2)
0
quotient < 0
Rounding
sign
The code is available at the home page of this book. The combinational circuit
registers inputs and outputs to ease the synchronization. A two or three stage
pipeline is easily achievable adding the intermediate registers as suggested in
Fig. 12.7. In order to increase the clock frequency, more pipeline registers can be
inserted into the integer multiplier.
12.5
Arithmetic Circuits
331
s1
s2
e1
e2
sign1 sign2
subtractor
e
p-by-p-bits
divider
Sticky bit
div(p+3 .. 0)
Normalization and
Rounding
s
sign
12.5.3 Divider
A basic divider, deduced from Algorithm 12.6, is shown in Fig. 12.9. The
unpacking and packing circuits are similar to those of the adder-subtractor or
multiplier. The normalize and rounding is a simplified version of Fig. 12.4,
where the part related to the subtraction is not necessary.
The inputs of the p-bit divider are s1 and s2. The first operator s1 is internally
divided by two (s1/B, i.e. right shifted) so that the dividend is smaller than the
divisor. The precision is chosen equal to p ? 3 digits. Thus, the outputs quotient
and remainder satisfy the relation (s1/B).Bp+3 = s2.q ? r where r \ s2, that is,
s1 =s2 q:Bp2 r=s2 Bp2 where r=s2 Bp2 \Bp2 :
The sticky digit is equal to 1 if r \ 0 and to 0 if r = 0. The final approximation
of the exact result is
quotient q:Bp2 sticky digit: Bp3 :
332
12.5
Arithmetic Circuits
333
s1
e1
-1
mod 2
p+1 .. 0
/2
(p+2)-bits
square rooter
sticky
sticky bit
sqrt(p+1 .. 0)
p+2 .. 0
Normalization
s
334
Delay
FP_add
FP_mult
FP_mult_luts
FP_div
FP_sqrt
11.7
8.4
9.7
46.6
38.0
96
96
98
119
64
699
189
802
789
409
275
105
234
262
123
Delay
FP_add
FP_mult
FP_mult_luts
FP_div
FP_sqrt
15.4
15.1
12.5
136.9
97.6
192
192
192
244
128
1372
495
3325
3291
1651
15
585
199
907
903
447
Period
Latency
FP_add
FP_mult
FP_mult_luts
FP_mult
FP_mult_luts
FP_sqrt
FP_div
6.4
5.6
7.1
3.8
5.0
9.1
9.2
2
2
2
3
3
6
5
137
138
142
144
252
384
212
637
145
798
178
831
815
455
-0
247
76
235
89
272
266
141
12.6
Exercises
335
12.6 Exercises
1. How many bits are there in the exponent and the significand of a 256-bit
binary floating point number? What are the ranges of the exponent and the
bias?
2. Convert the following decimal numbers to the binary32 and binary64 floatingpoint format. (a) 123.45; (b) -1.0; (c) 673.498e10; (d) qNAN; (e) -1.345e129; (f) ?; (g) 0.1; (h) 5.1e5
3. Convert the following binary32 number to the corresponding decimal number.
(a) 08F05352; (b) 7FC00000; (c) AAD2CBC4; (d) FF800000; (e) 484B0173;
(f) E9E55838; (g) E9E55838.
4. Add, subtract, multiply and divide the following binary floating point numbers
with B = 2 and ulp = 2-5, so that the numbers are represented in the form s
2e where 1 B s B 1.111112. For simplicity e is written in decimal (base 10).
(a)
(b)
(c)
(d)
1.10101 9 23 op 1.10101 9 21
1.00010 9 2-1 op 1.00010 9 2-1
1.00010 9 2-3 op 1.10110 9 22
1.10101 9 23 op 1.00000 9 24
5. Add, subtract, multiply and divide the following decimal floating point
numbers using B = 10 and ulp = 10-4, so that the numbers are represented in
the form s 10e where 1 B s B 9.9999 (normalized decimal numbers).
(a)
(b)
(c)
(d)
6. Analyze the consequences and implication to support denormalized (subnormal in IEEE 754-2008) numbers in the basic operations.
7. Analyze the hardware implication to deal with no-normalized significands
(s) instead of normalized as in the binary standard.
8. Generate VHDL models adding a pipeline stage to the binary floating point
adder of Sect. 12.5.1.
9. Add a pipeline stage to the binary floating point multiplier of Sect. 12.5.2.
10. Generate VHDL models adding two pipeline stages to the binary floating point
multiplier of Sect. 12.5.2.
11. Generate VHDL models adding several pipeline stages to the binary floating
point divider of Sect. 12.5.3.
12. Generate VHDL models adding several pipeline stages to the binary floating
point square root of Sect. 12.5.4.
336
Reference
1. IEEE (2008) IEEE standard for floating-point arithmetic, 29 Aug 2008
Chapter 13
Finite-Field Arithmetic
Finite fields are used in different types of computers and digital communication
systems. Two well-known examples are error-correction codes and cryptography.
The traditional way of implementing the corresponding algorithms is software,
running on general-purpose processors or on digital-signal processors. Nevertheless, in some cases the time constraints cannot be met with instruction-set processors, and specific hardware must be considered.
The operations over the finite ring Zm are described in Sect. 13.1. Two multiplication algorithms are considered: multiply and reduce (Sect. 13.1.2.1) and
interleaved multiplication (Sect. 13.1.2.2). The Montgomery multiplication,
and its application to exponentiation algorithms, are the topics of Sect. 13.1.2.3.
Section 13.2 is dedicated to the division over Zp, where p is a prime. The proposed
method is the binary algorithm, an extension of an algorithm that computes the
greatest common divider of two naturals. The operations over the polynomial ring
Z2[x]/f(x) are described in Sect. 13.3. Two multiplication algorithms are considered: multiply and reduce (Sect. 13.3.2.1) and interleaved multiplication
(Sect. 13.3.2.2). Squaring is the topic of Sect. 13.3.2.3. Finally, Sect. 13.4 is
dedicated to the division over GF(2n).
As a matter of fact, only some of the most important algorithms have been
considered. According to the Authors experience they generate efficient FPGA
implementations (Sect. 13.5). Furthermore, non-binary extension fields GF(pn) are
not considered. A much more complete presentation of finite field arithmetic can
be found in [1] and [2].
337
338
13
Finite-Field Arithmetic
+/-
operation
sign1
s1
m
+/s2
0
sign 2
C.C.
13.1
Operations Modulo m
339
2256
mod
by
m.
340
13
Finite-Field Arithmetic
13.1
Operations Modulo m
341
x 191..0
x 255..192
x 255..192
s 1,192..0
s 2,192..64
s 2,63..0
subtractor
sign 1
sign 1..3
1--
z1
01-
2m
3m
subtractor
subtractor
sign 2
z2
001
sign 3
z3
000
342
13
Finite-Field Arithmetic
y
2acc
x n-i-1
adder
shif register
initially: x
load
update
s
m
2m
subtractor
sign1
sign 1..2
1-
subtractor
sign 2
z1
01
z2
00
register
initially: 0
load
update
acc
Fig. 13.3 Interleaved mod m multiplier
If ripple-carry adders are used, the minimum clock period is about kTFA, so that
the total computation time is approximately equal to k2TFA. In order to reduce the
computation time, the stored-carry encoding principle could be used [2]. For that,
Algorithm 13.3 is modified: accumulator is represented under the form
accs ? accc; the conditional sum (accs ? accc) 2 ? xn-k-iy is computed in storedcarry form, and every sum is followed by a division by m, also in stored-carry form
(Sect. 9.2.4), without on-the-fly conversion as only the remainder must be computed. The corresponding computation time is proportional to k instead of k2. The
design of the circuit is left as an exercise.
13.1
Operations Modulo m
343
The data path corresponding to Algorithm 13.4 (without the final correction) is
shown in Fig. 13.4. It is described by the following VHDL model.
If ripple-carry adders are used, the total computation time is approximately equal
to k2TFA. In order to reduce the computation time, the stored-carry encoding
principle could be used [2, 4].
344
13
Finite-Field Arithmetic
xi y0
y
load
update
xi
shif register
initially: x
q
p0
adder
load
update
register
initially: 0
...xk1 2k1
k1
13.1
Operations Modulo m
345
An MSB-first exponentiation algorithm could also be considered. For that, use the
following computation scheme:
2
k1
346
13
Finite-Field Arithmetic
13.2
Division Modulo p
347
The corresponding circuit is made up of adders, registers and connection resources. A data-flow VHDL description mod_p_division2.vhd is available at the
Authors web page.
An upper bound of the number of steps before a = 1 is 4k, k being the number
of bits of p. So, if ripple-carry adders are used, the computation time is shorter than
4k2TFA (a rather pessimistic estimation).
348
13
Finite-Field Arithmetic
It remains to reduce d(x) modulo f(x). Assume that all coefficients ri,j, such that
xmj mod f x rm1; j xm1 rm2; j xm2 r1;j x1 r0;j;
have been previously computed. Then the coefficients of c(x) = a(x)b(x) mod
f(x) are the following:
Xm2
r dmi ; j 0; 1; . . . ; m 1:
13:1
c j dj
i0 j;i
13.3
349
A complete VHDL model classic_multiplier.vhd, including both the polynomial multiplier and the polynomial reducer, is available at the Authors web
page.
Every coefficient dk is the sum of at most m products aibk-i, and every coefficient cj is the sum of, at most, m coefficients dk. Thus, if tree structures are used,
the computation time is proportional to log m. On the other hand, the cost is
proportional to m2. Hence, this type of multiplier is suitable for small values of
m. For great values of m, the cost could be excessive and sequential multipliers
should be considered.
The first operation c(x) ? bia(x) is executed by the circuit of Fig. 13.5a. In order
to compute a(x)x mod f(x), use the fact that xm mod f x fm1 xm1
fm2 xm2 . . . f0 : Thus,
350
13
Finite-Field Arithmetic
(a)
c m-1 a m-1 c m-2 a m-2
c0 a0
(b)
bi
am-1fm-1 a m-1fm-2
am-2
am-3
a m-1+
+
c m -1
+
c m -2
a m-1f1
am -1f0
am-2+
a0
a1+
a 0+
+
c0
Fig. 13.5 Interleaved multiplier, computation of c(x) ? bia(x) and a(x)x mod f(x)
a x x mod f x am1 fm1 xm1 fm2 xm2 . . . f0 am2 xm1
am3 xm2 . . . a0 x:
The corresponding circuit is shown in Fig. 13.5b.
For fixed
f(x), the products am-1fj
0 i m2
mi even
rj;i ami=2 ; j 1; 3; 5; . . .
0 i m2
mi even
The cost of the circuit depends on the chosen polynomial f(x), which, in turn,
defines the matrix [ri,j]. If f(x) has few non-zero coefficients, as is the case of
trinomials and pentanomials, then the matrix [ri,j] also has few non-zero coefficients, and the corresponding circuit is very fast and cost-effective. Examples of
implementations are given in Sect. 13.5.
13.4
351
352
13
vm-1 um-1
am
bm-1 am-1
...
bm-1
bm-2
...
...
am-2
w1
b0
b0
w0
fm-1
b0
f1
ce_ bv
initially:g(x)
ce_ au
vm-1
...
am-1
v0 u0
ce_bv
initially:f(x)
am
wm-1
initially: h(x)
0
v1 u1
b1 a1
0
0
Finite-Field Arithmetic
vm-2
...
v0
ce_ au
initially: 0
a0
um-1
um-2
...
u0
A drawback of the proposed algorithm is that the degrees of a(x) and b(x) must be
computed at each step. A better option is to use upper bounds a and b of the
degrees of a(x) and b(x).
Algorithm 13.12: Mod f(x) division, binary algorithm, version 2
The data path is the same as before (Fig. 13.6). An upper bound of the number of
steps is 2m. As the operations are performed without carry propagation, the computation time is proportional to m. A data-flow VHDL description mod_f_division2.vhd is available at the Authors web page.
13.4
353
In the preceding binary algorithms, the number of steps is not fixed; it depends
on the input data values. This is an inconvenience when optimization methods,
such as digit-serial processing (Chap. 3), are considered. In the following algorithm [11] a and b are substituted by count = |a-b-1|, and a binary variable state
equal to 0 if a [ b, and equal to 1 if a B b.
Algorithm 13.13: Mod f(x) division, binary algorithm, version 3
The data path is still the same as before (Fig. 13.6), and the number of steps is 2m,
independently of the input data values. As the operations are performed without
carry propagation, the computation time is proportional to m. A data-flow VHDL
description mod_f_division3.vhd is available at the Authors web page. Furthermore, several digit-serial implementations, with different digit definitions, are
given in Sect. 13.5.
354
13
Finite-Field Arithmetic
LUTs
Delay
8
64
37
2,125
3.2
5.3
FFs
LUTs
Period
Total time
8
64
163
233
32
201
500
711
34
207
504
714
1.48
1.80
1.60
1.88
13.3
117.0
262.4
439.9
LUTs
Delay
8
64
163
233
8
129
163
153
0.7
0.7
0.7
0.7
FFs
LUTs
Period
Total time
8
64
163
233
43
273
673
953
34
157
370
510
2.41
2.63
2.96
2.98
41.0
339.3
967.9
1391.7
Several sequential mod f(x) dividers (Sect. 13.4) have been implemented
(Table 13.4).
13.6 Exercises
1.
2.
3.
4.
5.
13.6
Exercises
355
References
1. Rodrguez-Henrquez F, Saqib N, Daz-Prez A, Ko K (2006) Cryptographic algorithms
on reconfigurable hardware. Springer, Heidelberg
2. Deschamps JP, Imaa JL, Sutter G (2009) Hardware implementation of finite-field arithmetic.
McGraw-Hill, New York
3. Montgomery PL (1985) Modular multiplication without trial division. Math Comput 44:
519521
4. Sutter G, Deschamps JP, Imaa JL (2011) Modular multiplication and exponentiation
architectures for fast RSA cryptosystem based on digit serial computation. IEEE Trans
Industr Electron 58(7):31013109
5. Hankerson D, Menezes A, Vanstone S (2004) Guide to elliptic curve cryptography. Springer,
Heidelberg
6. Menezes A, van Oorschot PC, Vanstone S (1996) Handbook of applied cryptography. CRC
Press, Boca Raton
7. Knuth DE (1981) The art of computer programming, Seminumerical algorithmsvol, vol 2.
Addison-Wesley, Reading
8. Deschamps JP, Bioul G, Sutter G (2006) Synthesis of arithmetic circuits. Wiley, New York
9. Meurice de Dormale G, Bulens Ph, Quisquater JJ (2004) Efficient modular division
implementation. Lect Notes Comp Sci 3203:231240
10. Takagi N (1998) A VLSI algorithm for modular division based on the binary GCD algorithm.
IEICE Trans Fundam Electron Commun Comp Sci 81-A(5):724728
11. Kim C, Hong C (2002) High-speed division architecture for GF(2m). Electron Lett 38:
835836
Chapter 14
Systems on Chip
357
358
14
Systems on Chip
14.3
Embedded Systems
359
systems are their reduced size, cost and power consumption when they are compared
against general-purpose computers. An embedded system is, usually, composed
of a low-cost microprocessor, memory and some peripherals. It may include a
customized peripheral or coprocessor to speed-up a specific task or computation.
360
14
Systems on Chip
14.3
Embedded Systems
361
Cache logic. The on-chip memory provides high bandwidth memory, but very
limited capacity. By contrast, the external memory is slower, but it provides a
higher capacity due to its lower cost per byte. When the on-chip memory does
not provide enough capacity for the application, it can be used as cache memory.
The cache improves the execution of applications from external memory,
storing transparently the most frequently accessed instructions and data. The
microprocessor architecture must provide cache logic which implements a
policy, in order to try to maximize the number of cache hits.
Memory Management Unit (MMU). Most sophisticated OS require virtual
memory support in order to provide memory protection, and to extend the logical
capacity of the physical memory. The address space of the memory is divided in
pages and the MMU translates the page numbers from the logical memory to the
physical memory. It relies on the Translation Lookaside Buffer (TLB) which stores
a page table, in order to provide a fast translation. Many embedded microprocessors
may not implement the MMU, since the OS does not support virtual memory.
Floating-Point Unit (FPU). The unit provides arithmetic operations on real
numbers that are usually formatted on the single or double-precision IEEE-754.
The FPU may not be adequate on many low-cost applications and it can be
emulated by software.
Parallel multiplier, divider, barrel-shifter, or other additional units. Some applications require frequent operation on integer numbers that can be accelerated using
a dedicated hardware unit. For instance, a barrel-shifter can perform an arbitrary
n-bit shift in a single clock cycle which is useful in some applications.
Debug module. Most embedded processors can add logic to support an external
debugger tool. This way the external debugger can control the microprocessor
which is running an application, by performing actions like stop it, inspect the
registers or set breakpoints.
There is a very large collection of embedded microprocessors from different
device vendors, third-parties or free sources. Some FPGAs embed hard-core
microprocessors that provide the highest performance, but with a lower flexibility
since they are physically implemented in the device. The FPGA vendor provides
the tool chain (compiler, linker, debugger, etc.) to easily develop the software.
Many FPGA vendors provide their own soft-core microprocessors, in order to
facilitate the development of embedded systems on their devices. They are highly
optimized for the target device, and they can be configured to match the required
specifications. However, the flexibility is quite limited since the microprocessor
cannot be (easily) modified or implemented in a device from other vendors. The
vendor also provides the design flow, which permits the easily development of the
hardware and software. The open-source microprocessors usually occupy a larger
area and cannot run at the same clock frequencies when they are compared with
the previous alternatives. However, they provide the greater flexibility since they
can be modified or implemented in any FPGA that provides enough logic capacity.
The design flow is also more complicated since the designer uses tools from
different parties and they may lack support and warranty.
362
14
Systems on Chip
14.3.2 Peripherals
A general-purpose microprocessor is attached to other hardware components to
add functionalities to the computer system. Peripherals extend the capabilities of
the computer and they are dependent on it. The peripheral provides a bus interface
which transforms signals and the handshake from the core to the system bus, in
order to permit the microprocessor to control/monitor it. Many peripherals can
generate an interrupt request which triggers the execution of a service routine on
the microprocessor when the peripheral detects an event.
14.3
Embedded Systems
363
There are many types of peripherals. Some examples of typical peripherals are:
General Purpose Input/Output (GPIO), a peripheral which can be used to get or
set the state of I/O pins that are externally connected to other device. Typical
applications are to drive led diodes or to read switches. However, they can control
more complex devices when the microprocessor is programmed accordingly.
Universal Asynchronous Receiver-Transmitter (UART). The data transported
on a serial communication interface is composed of character frames. The
frames on an asynchronous transmission are characterized by the start, stop and
parity bits, the character length, as well as the bit speed. The UART peripheral
generates and transmits a frame when it receives a new character from the
microprocessor. It also performs the reverse transformation when it receives a
frame from the serial interface. The transmitted and received characters are
temporally stored in First Input First Output (FIFOs) in order to permit the
microprocessor to not attend the UART immediately.
Counter/Timer. The peripheral implements a register which counts an event
from an external signal or from an internal clock. In the second case the elapsed
time is deducted from the number of counts and the clock period. Another
typical application is to generate a Programmable Interval Timer (PIT) which
periodically generates an interrupt request.
Memory controllers. They assign an address space to the memory and they
control efficiently the read/write accesses from the microprocessor, in order to
maximize the memory bandwidth. They can also match the data width between
the memory and system bus. Depending on the memory they may permit burst
accesses and refresh the dynamic memory.
Interrupt controller. Combines interrupt requests from several sources to the
microprocessors interrupt input. They can mask interrupts to enable/disable
them or assign priorities.
Direct Memory Access (DMA). Some peripherals, such as video and disk
controllers, require very frequent accesses to memory. The microprocessor can
move data from memory to the peripheral (or vice versa) through the system
bus, although, it is quite inefficient. A DMA is connected as a master of the
system bus in order to permit to copy data between memory and peripherals
while the microprocessor performs any other action. The microprocessor task
can be reduced to program the DMA with the source and destination addresses,
the data length and the burst size. These systems require a bus arbiter since they
are several bus masters that can concurrently initiate a new access. The DMA
can be a centralized peripheral or be a part of a peripheral.
Video controller. A peripheral which generates the video signals required to
display an image. They require a DMA in order to read pixels from the video
memory (frame buffer) to generate the images at the required frequency.
Peripherals implement a set of internal registers in order to control or monitor
them. Each register provides different functionalities that depend on the peripheral
type and designer. A basic classification of peripheral registers is:
364
14
Systems on Chip
14.3
Embedded Systems
365
14.3.3 Coprocessors
Some algorithms may require a set of specific computations that cannot be re-solved
in a general-purpose microprocessor in a reasonable execution time. A co-processor
is designed and optimized to quickly perform a specific set of computations to
accelerate the system performance. Coprocessors do not fetch instructions or data
from memory. The general-purpose microprocessor accesses to memory to read
data and enables the coprocessor to accelerate a computation. Coprocessors allow
the customization of computer systems, since they can improve computational
performances without upgrading the general-purpose microprocessor.
There is a wide range of coprocessors and some of the most typical are:
FPU coprocessors to accelerate the computations on real numbers. Many
embedded microprocessors do not integrate a FPU in order to reduce cost.
Others may integrate a single-precision FPU which may be not adequate for
some applications. The single and double-precision IEEE-754 may be emulated
through software libraries, but the system performance may be greatly affected.
Cryptoprocessors are designed to accelerate the most time spending computations related to cryptography, such as the encryption and decryption tasks. Their
architecture usually implements several parallel processing units to improve the
speed-up factor.
DSPs perform a digital signal processing from a data stream (voice, image,
etc.). The internal architecture may provide several pipeline stages to improve
the data throughput. The input and output streams are carried from/to the
microprocessor which accesses to memory or other components.
The coprocessors can attach to the microprocessor in several ways:
Data-path extension by custom instructions. For instance the ALU of the softcore NIOS II [3] is prepared to be connected to custom logic which accepts
users machine instructions that are similar to the native ones. Another example
is the APU (Auxiliary Processor Unit) of the PowerPC-405/440 [10, 11], which
permits the execution of new instructions on a tightly-coupled coprocessor.
Dedicated point-to-point attachment, such as the FSL (Fast Simplex Link) for
MicroBlaze [13]. The machine instructions set provides some instructions able
to transmit/receive data to/from the links attached to the coprocessor.
System bus. A coprocessor can be also attached to the system bus, as any
peripheral. This alternative is not usually implemented since it increases the
latency to communicate data, due to the more sophisticated bus handshake.
366
14
Systems on Chip
14.3.4 Memory
A memory circuit is a digital storage device which is commonly used to allocate
the program executed by a microprocessor. The memory on embedded systems
can be classified in several ways. A first logical division is:
Instruction memory. Microprocessors execute software applications when
reading machine instructions from the instruction memory.
Data memory. Some of the machine instructions command the microprocessor
to read or write data from/to the data memory.
The instruction and data memories may be physically isolated or implemented
on the same devices. Moreover, they can be composed of several banks and types
of memories. According to the physical placement, memory can be classified as
the following:
Internal memory. It is integrated in the same device as the embedded system.
The memory is usually SRAM technology which features fast access time but
small capacity (some KB). The internal memory can store a small executable
(like a boot loader) or serve as cache memory. Many FPGAs integrate internal
blocks of memory. For instance, Xilinx FPGAs implement Block RAM
(BRAMs), dual-port synchronous SRAM memories that permit simultaneous
access to instructions and data per clock cycle.
External memory. It is not integrated into the device which implements the
embedded system. The external memory can be semiconductor chips or storage
devices.
The semiconductor technologies that are, usually, applied as external memory
are DRAM and FLASH. They provide larger capacity (MB or GB) but slower
access time, when compared with SRAM memory. The DRAM can storage
instructions and data; however it is a volatile memory. By contrast, the
FLASH is non-volatile but it cannot be used as the data memory since it is
read-only memory. Many embedded systems make use of both technologies,
first copying the program from FLASH to DRAM during the boot-up, and
then executing the application from DRAM.
Storage devices, such as hard drives or FLASH cards, are more common in
general-purpose computers than in embedded systems. They are non-volatile
memories that feature very high capacity, but very slow access time.
Moreover, data is sequentially accessed in large blocks (some KB) making
it inefficient when randomly accessing data. The main purpose is to serve as
storage repository of programs and files, although some computer systems
can implement virtual memory on them if the microprocessor and OS
support it.
Another classification is based on the architecture of the bit cells. The most
common memories on embedded systems are as follows:
14.3
Embedded Systems
367
Static Random Access Memory (SRAM). The bit cell is larger than the other
alternatives; therefore, the memory density is lower and the cost per capacity is
increased. The main advantages are the lower access time and lower power
consumption when they are in standby mode. Many embedded systems and
FPGAs integrate some SRAM blocks, since they are compatible with conventional CMOS technologies to fabricate the integrated circuit.
Dynamic RAM (DRAM). It is a read/write and volatile memory as is the
SRAM. The main advantage is the larger memory density, since the bit cell is
very simple. However, the access time is, usually, slower and the bit cells must
be periodically refreshed, increasing the power consumption. Moreover, they
cannot be integrated efficiently in conventional CMOS technologies; therefore
they are usually connected as external memory.
FLASH. The bit cell is based on a floating-gate transistor which can be erased
and programmed providing read-only and non-volatile memory. There are two
important variants depending on the interconnections between cells: NAND and
NOR. The NAND FLASH memories are employed as storage devices, since
they provide higher capacity and bandwidth. However, the disadvantage is that
data is sequentially read on pages. By contrast, the NOR FLASHs are, usually,
employed as Read-Only Memory (ROMs), since they provide random access to
data and similar interface as in SRAM chips.
There are more memory technologies, but most of them are not currently
implemented or widely used on commercial applications. The Ferroelectric RAM
(FRAM) is an emerging technology on embedded systems. It provides randomaccess and non-volatile memory which features much better performances on
access time, power consumption, and write endurance when compared with
FLASH. However, the memory density is much lower. For instance, the
MSP430FR57xx family [8] of 16-bit microcontrollers can embed up to 16 KB of
FRAM.
Another classification is based in the synchronicity of accesses to data.
Asynchronous memory. Data is accessed by the address bus, independently of
any clock signal. Many external SRAM memories are asynchronous since they
provide a very simple interface and low latency when data is randomly accessed.
An interesting alternative to external SRAM is the Pseudo Static RAM
(PSRAM), which is DRAM surrounded with internal circuitry that permits to
control it as a SRAM chip. It offers external memory featured by a high capacity
and simple interface, since its internal circuitry hides the specific DRAM
operations.
Synchronous memory. Data is synchronously accessed according to a clock
signal. These memories improve the data bandwidth since they implement a
pipeline scheme, in order to increment clock frequency and data throughput.
However, the latency is increased due to the stage registers. Most of the present
Synchronous DRAM (SDRAMs), such as the DDR2 and DDR3, use Double
Data Rate (DDR) registers to provide very efficient burst accesses in order to
replace the cache pages. Another example of synchronous external memory is
368
14
Systems on Chip
the Zero-Bus Turn-around (ZBT) SRAM which improves the bandwidth of burst
accesses and eliminates the idle cycles between read and write operations.
Usually, the embedded memory blocks on FPGAs are double-port synchronous
SRAMs that can efficiently implement the data and instruction cache of a
Harvard microprocessor.
Embedded systems usually provide a memory controller to assign an address
space to memory and to adapt the handshake signals and the data width from the
system bus. The complexity of the external memory controller depends on the type
of memory. The SRAM controllers are quite simple, since data can be accessed
randomly. The DRAM controllers are more complex since the address bus is
multiplexed and they control the periodical refreshing operation. Moreover, data is
usually synchronously accessed on a burst basis; therefore the controller also
generates the clock signal and the logic to communicate data between clock
domains.
14.3.5 Busses
The interconnection between the microprocessor, peripherals and memory controllers in an embedded system is done through busses. The system bus permits the
transfer of data between the building components. It is physically composed of a
set of connecting lines that are shared by the components. The component must
follow a common communication protocol which is implemented on the bus
handshake.
The bus master starts a new access cycle in order to read/write data from/to a
slave. An embedded system can provide several masters, but only one of them can
take the bus control during a time slot. The bus arbiter assigns the control to one of
the masters that are requesting it concurrently. For instance, a typical embedded
system is composed of a Harvard microprocessor which connects to an external
memory controller through the system bus. The external memory stores the
instructions and data of an application program. The bus arbiter assigns the bus
control to the microprocessors instruction-side, or the data-side, according to the
arbitration policy, while executing the program from the external memory.
There is a wide variety of busses depending on the embedded processor.
Moreover many embedded systems provide several busses; depending on the
memory or peripherals that are connected. A typical bus can be divided into the
following:
The address bus is a unidirectional bus, from the masters to the slaves, that
carries the address of the memory (or a register in a memory-mapped peripheral)
that is going to be accessed.
The data bus is a bidirectional bus that carries the data which is read from
slaves, or written to them. A bidirectional bus requires tri-state buffers that are
not, usually, available in the internal FPGA fabric. In FPGAs, the data bus is
14.3
Embedded Systems
369
replaced by two unidirectional busses: the write data bus (from masters to
slaves) and the read data bus (from slaves to masters).
The control bus is the set of signals which implements the bus handshake, such
as the read/write signal, or signals that indicate the access start and to
acknowledge the completion. Some advanced busses permit burst accesses, in
order to improve the throughput of sequential data transfers, and error signals, to
permit the microprocessor to detect error conditions on the bus.
References
1.
2.
3.
4.
5.
6.
7.
9.
8.
10.
11.
12.
13.
Chapter 15
371
372
15
permits the building of a simulation model in order to check the hardware design
through a Hardware Description Language (HDL) simulator.
The EDK is frequently updated and upgraded. We will focus this chapter on the
ISE Design Suite 13.1 for Windows since it is probably the most popular operating system for PCs, but there are no significant differences with versions for
other operating systems. Although the case studies included in this chapter can be
implemented on other upgraded EDK versions, they might require some small
changes.
The EDK is composed of a set of tools:
Xilinx Platform Studio (XPS). A graphical user interface that permits the
designing of the hardware of the embedded system from a set of interconnected
IP cores. It is the top-level tool which takes care of the necessary files and steps
needed to successfully complete the hardware design. The XPS implements the
design flow which runs other low-level tools in order to compute the hardware
synthesis and implementation (Platgen), the generation of the bitstream (BitGen) and the simulation model (Simgen).
Software Development Kit (SDK). An integrated environment based on the
Eclipse/CDT to manage the development of the software. It launches the C/C++
cross-compiler and linker to build the binary which is executed by the embedded
microprocessor. Moreover, it also provides a simple interface with the debugger
and profiler tools used by more advanced users. SDK also builds the BSP (Board
Support Package) through a low-level tool (Libgen). The BSP contains the set of
software drivers used to control the hardware from the executable.
IP cores. The library of configurable cores (microprocessors, peripherals, busses,
etc.) that are used as basic building blocks of the embedded system. Most of
these cores are licensed with the EDK, but there are also some cores that must
be licensed separately. Many cores include a set of programming functions and
drivers that can be used to facilitate the software development.
GNU tools chain. The set of tools that generate the software libraries and the
executable binaries. It includes the GNU C/C++ cross-compiler and linker. The
GNU tools are developed for the Linux operating system, but EDK includes
ported versions to the Windows OS.
Additionally, the EDK relies on other tools:
ISE tools. They synthesize and implement the hardware, generate the bitstream
and program the device. They also include other tools to generate the simulation
model, implement the internal memory, the timing analysis and others. Platform
Studio calls the required ISE tools, simplifying the design flow since the user
can be abstracted from many specific details.
Supported HDL simulator. It is recommended if the user designs a new IP, since
it permits the simulation of the hardware of the embedded system. Some IP
cores deployed with EDK are protected and encrypted, therefore, they can be
simulated only on supported simulators. The ISE tools provide the ISim, but the
user can choose a third-party simulator.
15.1
373
EDK and Xilinx
Simulation
libraries
Users Hardware
(VHDL files)
XST
Microprocessor
Hardware
Specification
(MHS file)
Simgen
Platgen
ISim
BitGen
EDK IP cores
GNU Tools
Chain
(Compiler /
Linker )
Libgen
Microprocessor
Software
Specification
(MSS file)
Users Drivers
(C files)
Data2MEM
Debugger
FPGA
development
board
GNU Tools
Chain
(Compiler /
Linker)
Users Software
(C/C++)
A development board with a Xilinx FPGA. There is a wide range of boards with
different FPGAs (Spartan or Virtex series), memories, displays, communication
interfaces and other elements.
A Xilinx programmer, such as the Parallel III/IV or the Platform Cable USB
I/II. Some development boards provide an embedded USB programmer. The
programmers can be used to program the FPGA and to debug the executable
through the Xilinx Machine Debugger (XMD) low-level tool.
Two files play an important role in the design flow (see Fig. 15.1): the
Microprocessor Hardware Specification (MHS) and the Microprocessor Software
Specification (MSS). The XPS manages the hardware design flow using a Xilinx
Microprocessor Project (XMP) project file. The XPS project relies on the MHS file
which configures a set of instances to IP cores that are interconnected as building
blocks. The XPS can export the hardware design to SDK in order to generate the
Board Support Package (BSP) for the embedded system. The BSP generation
relies on the MSS file which configures the drivers and libraries that can be used by
the executable to control the hardware.
374
15
anode-3
anode-2
anode-1
anode-0
a
a
b
g
b
c
d
e
f
g
15.1
375
External
oscillator
50MHz
Reset
button
JTAG
programmer
system
clock
DLMB
Controller
reset
DLMB
MDM
MicroBlaze
BRAM
ILMB
Controller
Interrupt
Controller
ILMB
PLB
switches
GPIO
led7seg
GPIO
Timer
switches
4-digit,
7-segment
led display
rs232
UART
Serial
communication
allocated into the microprocessors memory. In order to program the FPGA the
design flow executes the Data2MEM tool to generate a new bitstream file which
configures the FPGA including the BRAM contents to store the executable binary.
15.1.2 Hardware
The system is composed of the MicroBlaze, the internal BRAM and a set of
peripherals (see Fig. 15.3). The BRAM implements the microprocessors local
memory and it is connected through two Local Memory Bus (LMBs). The peripherals are connected through a Processor Local Bus (PLB). The MicroBlaze can
control the display and the two switches through General Purpose Input Output
(GPIO) peripherals. The UART peripheral permits the serial communication with
the external PC through the RS-232 interface. The timer and the UART will request
the microprocessor to interrupt, therefore, the system includes an interrupt controller. Finally, the Machine Debug Module (MDM) permits the debugging of the
application executed by the MicroBlaze through the FPGA programmer and the
XMD tool.
376
15
15.1.2.1 Specification
The first step is to specify the hardware of the embedded system in the MHS file.
The easiest way to start is by using the Base System Builder (BSB) wizard.
It creates the project file and a valid MHS file [14] which describes the system
composed of the microprocessor attached to local memory and peripherals. Open
the Xilinx Platform Studio (XPS) and follow the steps:
(1) XPS opens a dialog window. Choose the BSB wizard
(2) The next dialog window configures the project file and directory. Change the
path to C:\edk13.1\led7seg and the project name to system.xmp
(3) Next, a new dialog configures the system bus. The AXI is currently supported
only on the newer FPGA families (Spartan-6, Virtex-6). Therefore, select the
PLB [4] since it is supported by all the FPGA families.
(4) Select the option Create a new design in the dialog Welcome.
(5) The dialog can configure a pre-built system for a supported FPGA board.
Choose Create a system for a custom board to setup the system from scratch.
Then select the Spartan-3 xc3s200-ft256-4 device and any polarity for the reset
button. These parameters can be easily modified later.
(6) Choose a single processor system since it simplifies the design.
(7) The next dialog configures the frequencies of the reference and system clocks
as well as the capacity of the microprocessors local memory. Leave the
default parameters since they are changed later.
(8) The BSB permits the connecting of a set of peripherals to the system. Just
continue until the end of the wizard since they are added later. The BSP
creates the hardware specification of a basic embedded system.
The XPS permits to display and modify the system in a graphical way using the
tab System Assembly View, as shown in Fig. 15.4. The view Bus Interfaces shows
the system composed of instances to IP components that are interconnected
through busses. The microprocessor (microblaze_0) is connected to the BRAM
(lmb_bram) through the data and instruction LMBs (dlmb, ilmb) [17] and their
memory controllers (dlmb_cntrl, ilmb_cntrl). The peripherals attach to the system
through the PLB as slaves. The instruction and data PLB sides of the microprocessor are the bus masters. The MDM (mdm_0) peripheral attaches to the
microprocessor through the PLB (mb_plb). The MicroBlaze is master of the
busses, meanwhile, the peripheral and memory controllers are the slaves. Finally
the last two instances are the generators of the internal clock (clock_generator_0)
and reset (proc_sys_reset_0) that are required by the system.
15.1
377
The IPs provide a set of parameters that can be fixed, auto-computed or configurable. Select an instance and open the contextual menu (click the right button of
the mouse) to configure the IP graphically (see Fig. 15.5). The HDL name of a
parameter is the same as it appears in the MHS file. The configurable parameters
C_BASEADDR and C_HIGHADDR of the LMB controllers setup the address space
of the microprocessors local memory. Changing the C_HIGHADDR to 0 9 3FFF
of both LMB controllers increases the auto-computed parameter C_MEMSIZE of
the BRAM to 16 KB (0 9 4000). Every IP deployed by EDK provides a PDF file
which details the parameters, the input/output ports and internal architecture.
The MHS file is a text file which can be manually edited, as shown in Fig. 15.6.
It is synchronized with the System Assembly View. Therefore, they both update
378
15
when the other one is modified. The MHS specifies the systems hardware as a set
of interconnected instances and external FPGA ports. Each instance contains
configurable parameters, interface to busses and other ports. The parameters that
are not declared in the MHS take a default or the auto-computed value. The bus
interfaces or ports that are not declared in the MHS are disconnected.
At the beginning of the MHS file there are the declarations of the external
FPGA ports used to input the clock and the reset. The external ports are connected
to the internal signals CLK_S and sys_rst_s that are used by the clock and reset
generators. The parameter CLK_FREQ declares the frequency of the external
oscillator and the RST_POLARITY declares the logic level when the reset input
asserts. Both parameters must be modified according to the FPGA board.
15.1
379
380
15
6
7
1
4
3
(1) Drag the XPS General Purpose IO from the IP Catalog to the System
Assembly View.
(2) XPS opens a dialog to configure it. Set the data width of the first channel
(parameter C_GPIO_WIDTH) to 11 in order to control the display.
(3) Click the created instance and change the name to led7seg.
(4) Click the PLB interface to connect the peripherals SPLB (Slave PLB).
Go to the view Addresses and configure automatically the addresses of the
peripheral (see Fig. 15.8). The internal registers of the GPIO are accessible from
the microprocessor within this addresses range. By default, XPS assigns the
addresses range to 64 KB (0 9 10000) from an address above the 0 9 80000000.
15.1
381
Fig. 15.8 Automatic configuration of the addresses for the GPIO peripheral
The user can change the range, but they must not overlap to the other memorymapped peripherals.
Finally, the 11-bit width GPIOs output port must be connected to the external
FPGA ports that drive the display. Figure 15.9 shows the view Ports which permits
the external connection of the output port GPIO_IO_O. Change the default name to
fpga_0_led7seg_pin in a similar fashion as the rest of the external FPGA ports and
check that the direction is configured as output. Finally, set the range order to [10:0]
to declare them in descending order. The MSB and the LSB are indexed as 10 and 0,
respectively.
Open the MHS file to observe the previous changes. There is a new entry in the
section of the external FPGA ports. It also contains the new GPIO instance
including its configuration parameters and connections.
The next step adds a new GPIO to read the state of the two switches. The
hardware can also be modified by editing the MHS file. Copy the previous GPIO
instance and change the instance name to switches and the data width to 2. Set
the addresses range to 64 KB and do not overlap it to the other peripherals.
Connect the input port GPIO_IO_I to a signal which is connected to an external
FPGA port named fpga_0_switches_pin. Save the MHS file to update the
graphical view. The project will close if there is an error in the MHS file, which
must be manually corrected.
382
15
The next step adds the two peripherals to the PLB that will request interrupts.
The timer [6] will be programmed to request a periodic interrupt. The parameter
C_ ONE_TIMER_ONLY configures a single timer to minimize the size of the
peripheral.
15.1
383
The UART [7] will request an interrupt when it receives a new character from
the RS232. It will also transmit messages to the user through the serial communication. Therefore the UART ports transmission (TX) and reception (RX) are
connected to external FPGA ports. The parameters C_BAUDRATE and
C_USE_PARITY configure the speed and parity of the communication.
Some displays provide an extra input to turn on a dot placed beside the digit.
This case study just turns off the dot connecting the net_gnd or net_vcc to its
associated external port.
384
15
provides a receiving First Input First Output (FIFO) memory which temporarily
stores the characters. Therefore, a new character received can be processed when
the display is not refreshing.
15.1.2.2 Synthesis
The XPS menu Hardware ? Generate Netlist synthesizes the design to generate
a set of NGC netlist files. It calls the Platgen tool [13] which starts performing a
Design Rule Check (DRC). Then it calls the Xilinx Synthesis Technology (XST)
[3] tool to synthesize the IP instances to get their NGC files. The embedded
system is finally synthesized and optimized to get the top-level netlist file.
A change in the MHS file will force Platgen to synthesise only the required
15.1
385
2
1
modules to speed-up the execution. If desired, the XPS can clean up the generated files to start the Platgen from scratch. Synthesis is dependent of the FPGA,
therefore, the user must select the correct device in the Project Options
(see Fig. 15.10) before proceeding.
Figure 15.11 shows the tab Design Summary which displays the report files
generated by the Platgen and XST in order to check details about the design, such
as the occupied FPGA resources or the estimated maximum frequency of the
clock.
15.1.2.3 Implementation
The implementation computes the FPGA layout which is stored in a Native Circuit
Description (NCD) file. The design flow executes three main tools: NGDBUILD,
MAP and PAR [15]. The NGDBUILD translates the NGC files and annotates
constraints from a User Constraints File (UCF). The following tools compute the
layout based on the annotated constraints. The design flow continues with
the MAP and PAR tools to map the netlist into the FPGA resources and to compute
their placements and routings.
The BSB wizard generates the UCF for the selected prototyping board. The
XPS project refers the UCF which must be edited to specify the attachment of the
display and switches to the FPGA board. The UCF also specifies the clock
frequency of the external oscillator.
386
15
3
1
Fig. 15.11 Report files from the Platgen and synthesis
The XPS menu Hardware ? Generate Bitstream launches the BitGen tool [15]
which generates the bitstream file from the FPGA layout. First, the XPS executes
the design flow to implement the FPGA layout, if necessary. Then it generates the
BIT (bitstream) file system.bit and the BlockRAM Memory Map (BMM) file
system_bd.bmm. The microprocessors local memory is implemented on BRAMs,
15.1
387
but the generated BIT file does not initialize them, since the executable binary is
not available at this stage. The file system_bd.bmm annotates the physical placement of the BRAMs with the microprocessors local memory. This file will be
required later to update the BRAM contents of the bitstream. The tab Design
Summary shows the reports generated by the implementation tools.
15.1.2.4 Software
The XPS menu Project ? Export Hardware opens a dialog window to export the
required files to SDK, as shown in Fig. 15.12. Select the option to export the BIT
and BMM files to permit SDK to program the FPGA. It creates a new directory
which is allocated in the XPS project folder.
SDK starts opening a dialog to set the workspace folder. Write the path
c:\edk13.1\led7seg\SDK\workspace to create it into the SDK folder which was generated
by XPS during the hardware exportation. The software development involves two stages:
(1) The BSP generation. Creates a set of headers and libraries to control the
hardware from the microprocessor.
(2) The executable ELF file. It builds the file executed by the embedded
microprocessor.
388
15
3
c:\edk13.1\led7seg\workspace\standalone_bsp_0
OS provide more advanced features, but they require external memory. The BSP
project standalone_bsp_0 is located by default in a folder contained in the SDK
workspace.
The wizard generates the BSP project which is linked to a MSS file. The MSS
[14] is a text file which list the drivers used by the peripherals and the OS for the
microprocessor. The Libgen [13] tool reads the MSS file to generate the BSP. As
with the MHS file, the MSS can be graphically or manually edited. Figure 15.14
shows the graphical view which configures the MSS. Change the standard input
(stdin) and output (stdout) to the instance rs232 in order to permit the console
functions to use the UART peripheral.
The MSS file can also be manually edited, and it reflects the configuration
changes done in the previous dialog.
The rest of the MSS file shows the drivers and peripherals. A peripheral driver
is a collection of declarations and functions that can be used to control it from the
executable. By default the BSP wizard sets a specific driver to every peripheral,
but the user can change it to set a generic driver or no driver. The generic driver
can control any peripheral, but the user must have a deeper knowledge about its
internal architecture. The systems hardware provides two GPIO peripherals: the
15.1
389
2
6
4
5
switches and the led7seg instances. Select the generic driver for them in order to
understand better the role of internal registers of peripherals.
The SDK automatically calls the Libgen tool [13] when the MHS is changed to
build the BSP. The user may disable the Build Automatically behaviour in order to
clean or build the BSP using the commands under the menu Project. The Libgen
tool compiles the source files of the peripheral drivers and the OS, and it stores
them into the A (archive) library files. It also generates the H (header) files that
declare the functions contained in the libraries. The library and header files are
stored in the folders lib and include of the instance microblaze_0. The SDK can
display the contents of both folders and open the header files.
390
15
An important header file is the xparameters.h which declares a set of parameters about the hardware. Every peripheral has its own parameters that are obtained
from the exported hardware, as the addresses range of the GPIOs. The declarations
can be used by the C/C++ source files to control the peripherals.
15.1.2.6 Executable
The SDK will build the ELF executable from the source C++ files that are compiled
and linked with the functions stored in the BSP libraries. Click the menu
File ? New ? Xilinx New C++ Project which opens a wizard to create a C++
project for the BSP. Change the default project name to app1 and select the
previously generated BSP standalone_bsp_0, as depicted in Fig. 15.15. The wizard
creates a nested folder app1/src in the SDK workspace to store the source files that
will be compiled.
Using the Windows Explorer delete the template file main.cc which was
created by the wizard, and copy the new source files: ledseg7.cc, led7seg.h and
application.cc. Go to SDK and click the menu Refresh of the contextual menu
(right button of the mouse) of the project app1, in order to update the list of
source files. The SDK can open and display the source files in its integrated
editor (see Fig. 15.16).
The source files led7seg.h and led7seg.cc declare and implement a C++ class
named CLed7Seg which controls the display through the GPIO. The EDK
peripherals implement a set of 32-bit registers that are used to control them.
The peripherals registers are memory-mapped, therefore, MicroBlaze can access
them when it executes read/write instructions to the content of a C/C++ pointer.
The application can directly control the peripheral, although, it is necessary to
have a deeper knowledge about the internal architecture. The first register of a
GPIO [5] is the GPIO_DATA which is mapped at the base address of the
peripheral. The register retrieves/sets the state of the input/output ports depending
if the microprocessor reads/writes it.
The class constructor assigns the input argument to the integer (32-bit) pointer
GPIO_Data. Any pointer used to access a peripherals register should be declared
volatile. If not, the compiler may optimize a set of sequential memory accesses
through the pointer, changing the order or deleting some of them. The GPIO
15.1
391
2
1
method concatenates the anodes and segments to write them into the GPIO_DATA
register through its pointer. The header file of the class declares the two parameters
that configure the active state of the anodes and segments of the display, therefore,
they can be easily changed to adapt it to another prototyping board.
392
15
The class declares two member variables: Data and Config. The Data is a
16-bit variable which stores the number which is displayed. The Config is an 8-bit
variable which uses the two LSBs to turn on/off the display and to show/hide the
left-side zeros of the number. The method Refresh is periodically executed since
the timers ISR calls it. It reads the member variables and calls the Digit method
to display one of the digits starting at the left side. The Digit method first computes the segments and anodes of a digit and then it calls the GPIO method to
display it.
The application C++ file is composed of 4 sections. The first section opens the
required header files. The application controls the GPIOs directly, but the rest of
the peripherals are controlled through their drivers. Therefore, it opens the header
files that declare the functions stored in the BSP libraries. The file xparameter.h
declares base addresses that are necessary to use the driver functions.
The second section initializes the object Display for the class CLed7Seg. The
objects constructor gets the base address of the GPIO which drives the display.
The section also declares the global variables data and endloop that are managed
by the ISR of the UART.
The third section is composed of the two ISRs. The timers ISR periodically
reads the state of the two external switches and refreshes the display. First, it reads
the register GPIO_DATA of the peripheral which attaches to the two external
switches. Then, it compares the state of the switches against the previous call.
A change in one of the switches will swap one of the two bits that configure the
display, using the bitwise XOR operators. Finally, it refreshes the display. The
other ISR is executed when the UART receives a new character which is read
15.1
393
using its driver function [9]. Depending on the received character, it changes the
data or the configuration of the display, or it quits the application.
The last section is the main function of the application. It configures and
enables the interrupt sources, and then it executes a loop until the application
quits. The loop can execute any computation without affecting the control of the
display.
The timer peripheral [6] implements two registers to periodically generate an
interrupt request: Timer Control/Status Register 0 (TCSR0) and Timer Load
Register 0 (TLR0). The configuration of both registers asserts the interrupt signal
every 5 ms (250,000 counts, 50 MHz clock). The constants and functions of the
timer driver are declared in the header file tmrctr_l.h [10] which was generated by
the BSP. They are low-level API functions since the programmer knows the
functionality of the registers. These functions compute the address of the registers
and write data into them through a pointer.
The driver of the interrupt controller provides functions [11] to register ISRs and
to enable the interrupts sources. Finally the application enables the MicroBlazes
interrupt input through an OS function [18].
394
15
By default, the wizard of the C++ project generates two targets: Debug and
Release. They differ in the flags of the GNU compiler [13]. The target Debug
compiles source files without optimizations and enabling debug symbols. The
target Release compiles source files with optimizations to build smaller and faster
code which is not suitable to debug. The targets configure other compiler flags that
are derived from the MicroBlazes configuration in order to use the optional
machine instructions. The menu Project ? Build Project builds the active target
which can be changed anytime using the Project ? Build Configurations ? Set
Active. Then, SDK runs the GNU tool chain which compiles the source files and it
links the resulting object code with the BSP libraries. The executable ELF file is
stored in a nested folder which is named as the target.
15.1
395
c:\edk13.1\led7seg\SDK\workspace\hw_platform_0\system.bit
c:\edk13.1\led7seg\SDK\workspace\hw_platform_0\system_bd.bmm
2
c:\edk13.1\led7seg\SDK\workspace\app1\Release\app1.elf
3
Fig. 15.17 Bitstream configuration to program the FPGA
3
1
purpose (see Fig. 15.18). Press the reset button of the FPGA board and play with
the terminal.
The Xilinx Microprocessor Debugger (XMD) [13] is a low-level tool which
manages the programming and debugging of the embedded system through the
MDM peripheral and the JTAG programming cable. The user can interact with the
XMD clicking the SDK menu Xilinx Tools ? XMD console
396
15
1
2
3
Fig. 15.19 Configuration of the debugger
15.1
397
3
In a similar way the user can place another breakpoint at the assignment of the
variable rs232_char in the ISR of the UART. The application will suspend when it
receives a new character from the PC. Then, the ISR updates the data or the
configuration of the display.
398
15
15.2
399
PLB
PLB_*
plb_led7seg.vhd
Sl_*
IPIF
Bus2IP_*
IP2Bus_*
user_logic.vhd
counter
reg_data
reg_control
2
16
led7seg.vhd
core
4
GLUE LOGIC
anodes
segments switch_off
switch_zeros
drive the 7 segments and the 4 anodes of the display. Finally, it declares two input
ports that attaches to the external switches that configure the display.
400
15
Figure 15.21 shows a hierarchical schematic of the peripheral and related files.
The wizard created two VHDL files that are stored in the folder hdl\vhdl which is
nested in the hardware repository. The plb_led7seg.vhd is the top-level file which
connects an instance of the user logic to the PLB. The user_logic.vhd is a dummy
peripheral, therefore, the file must be modified to perform the desired functionality. The tasks related to the timers ISR are now described as hardware in this
file. The computation related to the class CLed7Seg class is now described in the
new hardware file led7seg.vhd.
The Peripheral Analyze Order (PAO) file [14] is the ordered list (bottom to top
level) of libraries and files required to synthesize the IP. The first two entries refer
to EDK libraries due to the selected IPIF. Then, it continues with the list of VHDL
files that will be synthesized into the target library. The target library must be
named as the repository folder of the peripheral.
The top-level file declares the entity plb_dec7seg and its architecture.
The template of the entity leaves space to add new generics and ports, therefore,
the user must complete it. The VHDL generics and ports of the entity must be
declared in the same way as in the MPD file.
The architecture declares an instance to the user logic. It leaves space to map the
new generics and ports. The two switches are directly connected from the top-level
15.2
401
input ports to the user logic. The glue logic attaches the output ports from the user
logic to the top-level ports, in order to drive the display. The synthesizer will
infer inverters between these ports if the configuration parameters are set to true.
The user logic will provide its internal timer, therefore, the instance requires a new
generic which configures the number of clock cycles to refresh the display. It is
computed from the MPD parameters that define the refreshing period (microseconds) and the clock period (picoseconds). Finally, the user logic also requires a
generic which configures the number of internal registers.
The architecture also contains the IPIF instance which eases the PLB connection. It adapts the PLB handshake to/from IP signals named ipif_Bus2IP_*/
ipif_IP2Bus_*. The IPIF also decodes the PLB address bus to enable one of the
peripherals internal registers. The architecture declares the number of internal
registers in the user logic, which is two in this case. This number affects the width
of the signals Bus2IP_RdCE (Read Chip Enable) and Bus2IP_WrCE (Write Chip
Enable) that arrives to the user logic. When MicroBlaze executes a read/write
instruction to a peripherals address range, the IPIF decoder sets one of the enable
bits, in order to read/write one of the internal registers of the user logic. The IPIF
maps registers on 32-bit boundaries from the peripherals base address. The first
register is mapped on C_BASEADDR, the second register on C_BASEADDR+4,
and so on. In order to simplify the IPIF, the C_HIGHADDR parameter is usually
402
15
The template of the user logic is a dummy design which must be modified. The
entity template declares an important generic which is the number of registers. The
entity must be completed according to the top-level file. Therefore, the designer must
add the new generics to configure the refreshing counts and the new ports to drive the
display and to read the switches. The default values of the generics are overwritten
since they are passed from the instance at the top-level file.
The architecture of the user logic declares two registers to control the display
that are accessible from the PLB. The first one (reg_data) is a 16-bit data register
which sets the 4 digits to display. The second one (reg_control) is a 2-bit register
which controls the display configuration: one bit to turn on/off the display and the
other bit to show/hide the left-hand zeros. The input ports Bus2IP_RdCE and
Bus2IP_WrCE provide a single bit for each register to read or write them. The data
comes from the port Bus2IP_Data during a write access to one of the registers.
During a read access, one of the registers drives the port IP2Bus_Data.
The architecture uses the signals bus_data/ip_data to get/set the data bus, since
they are declared in descending bit order as the registers. The PLB must
acknowledge the access completion before MicroBlaze can continue with a new
access. The user logic asserts the ports IP2Bus_RdAck/IP2Bus_WrAck when it
completes the access to the registers. In this case, the user logic does not require
wait states to read/write the registers.
15.2
403
The rest of the user logic performs the tasks executed in the timers ISR in the
software application. It implements a counter which periodically asserts the signal
refresh. This signal is used to refresh the display and to capture the state of the
switches in order to modify the control register. There is an instance, named core,
of the IP led7seg, which drives the display from the registers contents.
404
15
The file led7seg.vhd implements the functionalities described in the C++ class
CLed7Seg in the software application. It generates the signals that drive the display, updating the displayed digit when the port refresh is asserted. The architecture is composed of two processes. The first process updates the index of the
digit to refresh, starting at the left side. The second process computes the displays
ports based in the input ports off, zeros and data that are driven from the
peripherals registers.
15.2
External
oscillator
50MHz
405
Reset
button
JTAG
programmer
system
clock
DLMB
Controller
reset
DLMB
MDM
MicroBlaze
BRAM
ILMB
Controller
Interrupt
Controller
ILMB
PLB
led7seg
plb_led7seg
rs232
UART
switches
4-digit,
7-segment
led display
Serial
communication
The C file of the driver is compiled during the BSP generation. The library file
stores the functions, but not the C macros. The library may store other internal
functions that are not visible for the programmer since they are not declared in the H
file. The function which swaps one of the bits of the control register executes two
accesses to the register. First, MicroBlaze reads the register and changes one bit
according to a mask. Next, it writes the resulting value into the same register.
406
15
15.2
407
The names of the external FPGA ports, connected to the display and switches,
have changed, therefore, the UCF must be updated.
At this point, the designer can easily test the peripherals hardware using the
XMD tool. Using the XPS, implement the new hardware. Then, click Device
Configuration ? Download Bitstream to create the download.bit file and program
the FPGA. If the XPS project has no ELF file, the bitstream configures the BRAM
to store the default executable bootloop which runs a dummy loop. In order to
program the FPGA, XPS calls the iMPACT tool with the commands of the batch
file download.cmd. Check the number of the p flag, which configures the FPGA
position in the JTAG chain.
408
15
Once the FPGA is successfully programmed, click the Debug ? Launch XMD to open
the command shell. The XMD shell initially shows the microprocessor configuration when
it connects to the MDM. Then, the user can test the peripheral when he runs the XMD
commands: mwr (memory write) and mrd (memory read). Write the peripherals base
address (0 9 84C00000) to change the displayed number. Write the next register
(0 9 84C00004) to change the configuration of the display. The peripherals registers are
accessible from different addresses due to the incomplete IPIF decoder.
15.2
409
The Libgen tool generates the BSP according to the information described in
the MSS file. It gets the source files from the folder plb_led7seg_v1_00_a which
must be stored in the driver repositories. Then, it copies the header files and builds
the libraries. The file libgen.options must be modified to configure the local
repository to the XPS directory.
The C++ code related to the GPIOs and timers ISR is deleted. Therefore, the
files that declare and implement the class CLed7Seg are deleted in the project.
The modified file application.cc calls the drivers functions to control the
display.
The application can be programmed and debugged following the same steps as
in the previous case study. Check that the BIT and BMM files are the imported
ones from the current XPS project, before proceeding.
410
15
new Xilinx C++ project in SDK named simulation. Modify the C++ template file
main.cc to add a sequence which just writes the peripherals registers, as desired.
The function delay is called to improve the observation on the simulation results.
Build the target Debug to get the file simulation.elf.
The XPS project must import the ELF file in order to generate the simulation
model. Click the menu Project ? Select Elf file and choose the file simulation.elf
which is stored in the SDK project (see Fig. 15.23)
XPS launches the Simgen tool [13] to generate the set of files required to simulate the embedded system running the selected executable. Click the XPS menu
Project ? Project Options which opens the dialog window shown in Fig. 15.24.
Select the options: VHDL, generate testbench template and behavioural model.
This is the recommended model since it simulates the peripherals source files. The
other models are more complicated, since they simulate the output files from the
synthesis or the implementation stages. Click the XPS menu Simulation ?
Generate Simulation HDL Files to call the Simgen tool. The Simgen does not
support some FPGA families, such as the Spartan-3, but there is a workaround.
Choose any supported device since it does not affect the behavioural simulation of
the peripheral, because it does not require the synthesizing of its source VHDL files.
Finally, click the menu Simulation ? Launch HDL Simulator which takes some
time to compile the simulation files before starting the ISim [19].
Simgen creates the testbench file system_tb.vhd. It declares the instance dut
(Device Under Test) of the system and the stimulus applied to its inputs. ISim
shows the hierarchical view of the instances to simulate. Open the contextual menu
of the top-level instance system_tb and click Go To Source Code in order to edit
the testbench template, as shown in Fig. 15.25.
The template file drives the clock and reset signals, and it provides a users
section to write more stimulus. In order to simulate the external switches, change
the state of the switch, which turns off/on the display, at 250 and 260 ls.
15.2
411
5
1
6
2
3
The simulation will take a huge number of clocks to refresh the display. The
displays ports are updated every 200,000 clock cycles when the refresh cycle is
set to 4,000 ls (50 MHz clock frequency). Each instance of a peripheral is simulated from a wrapper file which configures the IP according the MHS file. Go
through the hierarchical view, and select the instance led7seg. Open the wrapper
file led7seg_wrapper.vhd, and change the generic C_REFRESH_PERIOD_US to
shorten the simulation of the refresh. This change does not affect the synthesis or
the implementation since it does not modify the MHS file.
412
15
2
3
3
4
15.2
413
Any change in the VHDL files requires the compilation of the model before starting
the simulation. Therefore, click the ISim menu Simulation ? Relaunch. In order to
display the external ports of the system, select the instance system_tb and drag the desired
signals into the waveform window. Add a divider, named USER LOGIC, in order to
display a new group of waveforms separately from the previous ones. Go to the hierarchy
of the instances system_tb ? dut ? led7seg ? led7seg ? USER_LOGIC_I and
drag the IPIF signals and the peripherals registers to the waveform. Repeat the previous
steps to display the signals of the instance core, as shown in Fig. 15.26.
414
15
15.3
415
3
1
2
4
416
15
Plain
block
AddExpKey
SubBytes
Encrypted
block
AddExpKey
InvShiftRows
Encrypted
block
AddExpKey
ShiftRows
Plain
block
AddExpKey
InvSubBytes
Expanded
Key
AddExpKey
MixColumns
key
Expanded
Key
key
AddExpKey
InvMixColumns
Fig. 15.30 Block encryption (left) and decryption (right) in the AES-128 cipher
microprocessor. The microprocessor writes commands and data to the coprocessors registers in order to perform a computation. When the coprocessor
completes, the microprocessor retrieves the computed data in order to continue
the algorithm.
This case study presents an embedded system which communicates with an
external PC in order to set the state of some led diodes and to read the state of some
switches. The system is connected through the serial port, but the communications
are ciphered using the Advanced Encryption Standard (AES-128). The system
decrypts the commands received from the PC and encrypts the answer messages.
The MicroBlaze cannot decrypt the messages at the desired speed, therefore, a
coprocessor is developed to accelerate the AES-128 computation.
15.3
417
offline mode computes the expanded key before starting the rounds, therefore, it is
computed only when the cipher key changes.
The transmitter divides a message into blocks that are encrypted and transmitted. The receiver decrypts the received blocks to build the message. There are
several operation modes and padding schemes to permit block ciphers to work
with messages of any length [22]. This case study chooses the Electronic CodeBook (ECB) mode and null padding, due to its simplicity. The other modes require
the IV (Initialization Vector) generator and a different block partitioning, but this
fact does not affect the AES block cipher.
418
15
microprocessor, therefore, the MHS connects their interrupt ports. The system
removes the interrupt controller since there is a single interrupt source.
The UCF file is modified to attach the external port to the switches and leds
according the FPGA board. Implement the hardware and export it to SDK.
15.3
419
3
2
5
4
The file app.cc implements the application. It waits for commands launched
from a PC in order to set the leds or to report the state of the switches. Any
command produces an answer message which is transmitted to the PC. Commands and answer messages are encrypted during their transmission on the serial
interface. The function send_rs232_cipher implements the ECB operation mode
and null padding. It divides an answer message into 128-bit blocks that are
individually encrypted and transmitted. In a similar way, the function
get_rs232_cipher builds a command message from the received and decrypted
blocks.
420
15
1
2
15.3
421
In order to display the time taken to encrypt/decrypt blocks, debug the ELF of
the target Release using SDK. Set a breakpoint at the last instruction of the C++
file to pause the execution when the user launches the quit command. Then resume
the application and launch several commands to the embedded system through the
Linux console. The variables EncryptMaxCycles and DecryptMaxCycles contain
422
15
the maximum number of clock cycles required to encrypt and decrypt a block. The
decryption of a block takes 1.76 ms (87,338 clock cycles, 50 MHz clock
frequency), but the serial communication requires 1.39 ms to transmit it (115,200
bps). The FIFO of the UART may overrun during the reception of large messages
since the MicroBlaze cannot decrypt blocks at the required speed. To solve the
problem, the transmission speed can be lowered or the system can implement a
flow control. A better solution is to accelerate the most time-spending
computations.
15.3.3 Profiling
The profiler [12] is an intrusive tool which is used to test the application performances. It reports the number of calls and the execution time of every function.
The profiling requires a dedicated timer able to interrupt the microprocessor in
order to sample the program counter. The source files of the application must be
compiled to add profiling ISR and code (compiler switchpg). The linker attaches
to the profiling library to build the executable.
The server application is not adequate to profile the since most of the execution
time is devoted to wait for user messages from the serial port. A better approach is
to execute a new application which continuously encrypts and decrypts messages.
Therefore, create a new C++ project named profiling. The application file, named
profiling_app.cc, uses the class CAES128 to encrypt/decrypt messages in a loop.
Change the compiler switches of the target Release to enable the profiler and the
same optimization level as the server application (switches -pg -O2). Next, clean
the project in order to build the application from scratch.
The profiler collects data and stores it in memory during the execution. Once
the application completes, the collected data is downloaded to the PC in order to
analyze it. The SDK must set the profiler memory which cannot be overlapped to
the application memory. Use the SDK to program the FPGA with the imported
bitstream and BMM, and set the ELF to bootloop. Open the XMD console to
launch some commands. The first command changes the working directory to the
applications folder. The second command establishes the connection to the
MicroBlazes MDM. The last command tries to download the ELF file into
memory. It fails since the profiler memory is not defined, but it reports the allocated memory of the ELF. The hardware implements the local memory from the
address 0 9 0000 to 0 9 3FFF, and the XMD reported there is available free
memory from address 0 9 2248.
15.3
423
1
2
The profiling data can be stored into any free space of memory. Click the menu
Run ? Run Configurations to open the dialog depicted in Fig. 15.34. Then add
the application to profile and set the profile memory from the 0 9 3000 address.
Finally, run the application and wait until the application ends. The PC downloads
424
15
the collected data which is stored in the file gmon.out. It is a binary file which is
interpreted by the GNU gprof tool.
Double click on the file gmon.out to display the results (see Fig. 15.35). The
SDK shows a graphical view of the collected data which can be arranged in several
ways, as the percentage of execution time devoted to each function. Two methods
of the CAES128 class take the 88% of the processing time: X and Multiply. They
are child functions called from the function InvMixColumn which is one of the
steps executed during the decryption. The child function X is also called from the
step MixColumn during the encryption.
The profiling information can be used to re-implement the most time-spending
functions to improve the execution time. However, a coprocessor can greatly
accelerate a specific computation.
15.3
425
Fig. 15.36 Attachment of the MicroBlaze and coprocessor (top), and FSL schematic (bottom)
426
15
when data is retrieved (FSL_S_Read). Similarly, in order to write data, the FIFO
signals when there is no free space (FSL_M_Full), and the master requests to write
data (FSL_M_Write).
XPS can create the template files for coprocessors through the same wizard
used for peripherals. Launch the wizard by clicking the menu Hardware ? Create
or Import Peripheral:
(1)
(2)
(3)
(4)
The coprocessor does not require additional parameters, ports or VHDL files,
therefore, the MPD and PAO files are not modified. The template file
fsl_mixcolumns.vhd is modified to design the coprocessor. It implements a 4x4
matrix of 8-bit registers (reg_state) to compute and store the AES state.
Additionally, a 1-bit register (reg_mode) configures the computation mode as
MixColumns or InvMixColumns. MicroBlaze will execute a control-type write
instruction to the coprocessors slave FSL to set the mode register. Next, it
follows four data-type write instructions to set the registers of the AES state.
Matrices are stored in rows in the microprocessors memory, and each 32-bit
data sets the 4 registers of a row in the coprocessor. The coprocessor starts the
computation when the registers reg_state and reg_mode have been written. The
coprocessor must acknowledge the read of data to the slave-FSL in order to
delete it from the FIFO.
The coprocessor computes and stores the result in the registers reg_state. The
coprocessor writes the resulting data to its master-FSL in a quite similar way. It
writes a new row on the FSL when the computation is completed and the FIFO is
not full. MicroBlaze executes four read instructions to the FSL in order to retrieve
the resulting AES state.
A key idea to accelerate a computation is in parallel processing. A full-parallel
implementation could compute the entire array of the AES state in a single clock
cycle, although it may occupy a large area. However, MicroBlaze would not
completely take profit of this architecture since it takes several more clock cycles
to execute the FSL instructions to set and retrieve the AES state. A semi-parallel
architecture offers a good trade-off between speed and area. The coprocessor
computes the left-side column of the reg_state in a clock cycle. The state register
shifts one column and the computation is repeated for the next 3 columns.
Therefore, it takes 4 clock cycles to complete.
15.3
427
428
15
15.3
429
430
15
Clean and build the BSP project to generate the new BSP from scratch. The
software applications use the C++ class CAES128 which implements the block
15.3
431
The rest of the source files are not modified, since the class CAES128 carries
out the encryption/decryption of blocks. Build the C++ projects to generate the
new executables.
15.3.6 Simulation
The coprocessor can be simulated in a very similar way as described for the
peripheral. The application profiling is used to build the simulation model since it
continuously encrypts/decrypts blocks using the coprocessor.
Search when the coprocessor asserts the signal start which launches the computation, as shown in Fig. 15.38. The waveform previously shows the execution of
five FSL write instructions by MicroBlaze. The coprocessor reads the data from its
slave-FSL. The first FSL instruction is a control-type access which writes the mode
register (reg_mode). The next four FSL instructions are data-type accesses to write
the rows of the state register (reg_state). The coprocessor starts the computation
after the completion of the five write accesses.
Figure 15.39 shows that the coprocessor takes 4 clock cycles to compute the
state register. Then, it writes the four rows of the resulting reg_state to its masterFSL, in order the MicroBlaze can read them. The coprocessor does not have to
wait for the MicroBlaze to read the resulting data, since the FIFO of the FSL is not
full.
432
15
2
3
15.3
433
Fig. 15.39 Simulation of the coprocessor computation and the master FSL
434
15
References
1. Nist (2002) NIST Advanced Encryption Standard (AES) FIPS PUB 197
2. Xilinx (2005) Spartan-3 Starter Kit Board User Guide (UG130)
3. Xilinx (2010a) XST User Guide for Virttex-4, Virtex-5, Spartan-3, and Newer CPLD
Devices (UG627)
4. Xilinx (2010b) LogiCORE IP Processor Local Bus PLB v4.6 (DS531)
5. Xilinx (2010c) XPS General Purpose Input/Output (GPIO) v2.00.a (DS569)
6. Xilinx (2010d) LogiCORE IP XPS Timer/Counter v1.02.a (DS573)
7. Xilinx (2010e) LogiCORE UART Lite v1.01.a (DS571)
8. Xilinx (2010f), LogiCORE IP XPS Interrupt Controller v2.01.a (DS572)
9. Xilinx (2010g) Xilinx Processor IP Library. Software Drivers. uartlite v2.00.a
10. Xilinx (2010h) Xilinx Processor IP Library. Software Drivers. tmrctr v2.03.a
11. Xilinx (2010i) Xilinx Processor IP Library. Software Drivers. intc v2.02.a
12. Xilinx (2010j) EDK Profiling Use Guide (UG448)
13. Xilinx (2011a) Embedded System Tools Reference Manual (UG111)
14. Xilinx (2011b) Platform Specification Format Reference Manual (UG642)
15. Xilinx (2011c) Command Line Tools User Guide (UG628)
16. Xilinx (2011d) MicroBlaze Processor Reference Guide (UG081)
17. Xilinx (2011e) LogicCORE Local Memory Bus LMB v10 (DS445)
18. Xilinx (2011f) Standalone (v.3.0.1.a) (UG647)
19. Xilinx (2011g) ISim User Guide v13.1 (UG660)
20. Xilinx (2011h) Data2MEM User Guide (UG658)
21. Xilinx (2011i) LogiCORE IP Fast Simplex Link (FSL) V20 Bus v2.11d (DS449)
22. Menezes AJ, Oorschot PC, Vanstone SA (1996) Handbook of applied cryptography. CRC
Press, Boca Raton
Chapter 16
435
436
Reconfigurable
routing line
Proxy LUT
Static
routing line
Partial
Reconfiguration
Reconfigurable
routing line
Proxy LUT
Reconfigurable
Partition
Static
Partition
Static
routingline
16.1
437
Decoupling logic. Hardware resources that are changing their configuration bits
can drive unexpected transient values on crossing nets that may cause a malfunction in the static partition. The static partition can deassert the nets from the
proxy logic during the reconfiguration. Another approach is to reset the affected
circuits of the static partition when the reconfiguration completes.
438
External
oscillator
100 MHz
Reset
button
system
clock
reset
RP
DLMB
Controller
FSL mb_to_copro
DLMB
Reconfigurable
Coprocessor
MicroBlaze
BRAM
ILMB
Controller
FSL copro_to_mb
ILMB
RM
decoupling_rst
PLB
HWICAP
rs232
UART
MDM
flash_emc
MCH_EMC
Serial
communication
JTAG
programmer
FLASH
Memory
decoupling
GPIO
RM dummy
RM adder
RM xxx
16.3
Case Study
439
the static partition and the RMs on the reconfigurable partition. The output is
the set of partial bitstreams required to perform the partial reconfiguration.
3. The software development on SDK. The partial bitstreams are programmed into
a FLASH memory. The application provides a C++ class which performs the
partial reconfiguration controlling the HWICAP according to the bitstreams
retrieved from the FLASH.
440
The system declares the instance rcopro which is the reconfigurable coprocessor. The source files of the coprocessor fsl_rcopro are stored in the default local
repository pcores created by the XPS wizard. The coprocessors instance attaches
to MicroBlaze through FSLs. There is a decoupling logic which can reset the FSLs
and the coprocessor. MicroBlaze will control a GPIO (instance decoupling_gpio)
in order to assert the 1-bit signal decoupling_rst when the reconfiguration is
completed. If not, the FIFOs of the FSLs would store unexpected data, and the
state machine of the coprocessor would start in an undesired state.
16.3
Case Study
441
442
will provide several clock domains if the ICAP sets a different clock frequency to
the systems clock; however, it will complicate the design.
MicroBlaze must read a partial bitstream from the external FLASH to feed the
ICAP, in order to reconfigure the coprocessor. EDK provides an external memory
controller (EMC) [6] which permits access to asynchronous memories such as
SRAM and FLASH. It can control several banks of different memories that share
the same input/output ports. Each bank is accessible from a configurable address
range and provides configuration parameters to set the width of the data bus and
the timing specifications. The EMC permits access to 8-bit/16-bit data width
memories, and it can match them to the 32-bit data bus of the PLB. The displayed
configuration is valid for the 28F128J3D [7] device, the 16 MB FLASH embedded
in the Virtex-5 development board [3]. The MicroBlaze can access to the FLASH
from the address 0xA0000000 to the 0xA0FFFFFF. The configuration can be
complicated but, the board designer usually provides a reference design which can
be used to get the parameters. The input/output ports of the EMC are externally
connected to the address, data and control busses of the FLASH. The width of
address bus of the FLASH is 24-bit, but the EMC generates 32-bit addresses.
Therefore, the MHS slices the 24 lower bits that are externally connected.
16.3
Case Study
443
The system must be synthesized in order to generate the netlist files required by the
PR flow. The system netlist contains the module dummy as the initial configuration of
the coprocessor. The module dummy just reads the slave FSL and writes the master
FSL with zeros, but it does not perform any computation. This way, MicroBlaze will
not hang up when it executes any read/write instructions to the coprocessors FSLs.
XPS can continue to implement the FPGA layout, generating the bitstream system.bit. Previously, the UCF file must be modified to map the external ports
depending on the development board. The implemented system is not valid to perform the PR but it can be exported to SDK in order to build the software. Moreover,
the system provides an initial configuration for the coprocessor which can be tested
and simulated, as described in the previous chapter. Finally, the system is able to
program the set of partial bitstreams into the FLASH, as is detailed later.
16.3.1.2 Synthesis of the RMs
The coprocessor will map several reconfigurable modules. All of them are connected to MicroBlaze through the same FSL ports. The master and the slave FSL
connections are quite similar to the case study exposed in the previous chapter.
The VHDL source files that contain the different configurations of the coprocessor
must be synthesized to obtain their netlist files. There are several ways, as with the
Xilinx Project Navigator or with a third-party synthesizer.
This case study executes the XST synthesizer [8] from a command shell. The
directory c:\edk13.1\reconfig\XST allocates the files required by XST. Open a
Xilinx console clicking the XPS menu Project?Launch Xilinx Shell, and type the
next commands:
444
The batch file fsl_rcopro.scr contains the XST commands to synthesize a set of
source VHDL files. The NGC file stores the netlist of a single module for the target
device.
The batch file relies on the file fsl_rcopro.prj which lists the files that are
synthesized, as in the PAO file. The VHDL file fsl_rcopro.vhd declares the toplevel entity of the coprocessor, and it contains the instance of the RM which is
synthesized. The rest of the VHDL files (fsl_rcopro_dummy.vhd, fsl_rcopro_adder.vhd, etc.) are the source files of the different RMs.
The XST must be executed for every RM in order to get the set of netlist files.
Change the parameter C_CONFIG_IDX and the name of the NGC output file to
synthesize a new RM. The XST folder will store the netlist files such as
fsl_rcopro_dummy.ngc, fsl_rcopro_adder.ngc, and so on.
16.3
Case Study
445
446
1
3
2
5
6
6
7
Fig. 16.3 Wizard to create a new PlanAhead PR project
16.3
Case Study
447
Fig. 16.4 Set the RP and the initial RM to the coprocessors instance
2
4
1
448
3
1
Select the instance rcopro and click the button Set Pblock Size in the device
view, as shown in Fig. 16.6. The number of hardware resources in the RP must be
higher than required by any of the synthesized RMs. This case study selects a
region which starts from the top-left edge of the FPGA. Set the partition height to
20 CLBs (a Virtex-5 configuration frame) and the width to 28 CLB columns.
The RP occupies the whole clock region of the FPGA which contains 560 SLICES,
4 RAMB36 and 8 DSP48 resources. Select all the resources in the next window to
permit them to be reconfigured, with the exception of the RAMB36 since no RM
requires them. The partial bitstream will reconfigure the hardware resources
contained in the RP, with the exception of the BRAM contents. This way, the size
of the partial bitstreams is reduced and the reconfiguration time improves.
Figure 16.7 displays the tab Statistics in the Properties of the Pblock
pblock_rcopro. It shows the size of a partial bitstream and the detailed number of
reconfigurable hardware resources. It also reports that the 100 % of the RP is contained in a single clock region, as desired. The tab Attributes shows the range of
resources as SLICE_X0Y100:SLICE_X27Y119 and DSP48_X0Y40:DSP48_X0Y47.
The numbers denote the xy coordinates starting from the bottom-left edge of the
FPGA. Therefore, the SLICES are arranged in a 28 9 20 matrix and the DSP48 are in
a single column. These parameters are annotated in a new UCF in order to implement
the layout of the reconfigurable system.
It is recommended to check the design rules for the PR flow before continuing.
Press Run DRC and unselect all rules except the related to the partial reconfiguration, as shown in Fig. 16.8. It should only report a warning for each RM
since it is not still implemented by a configuration.
16.3
Case Study
449
4
2
3
Fig. 16.7 Properties of the RP
450
12
5
4
3
6
10
8
9
11
Fig. 16.9 Configuring the implementation run for the first layout
3. Display the implementation options to add bm..\..\system.bmm to the NGDBuild tool. This option will load the BMM file which was copied from the XPS
project, in order to generate the annotated BMM file which will be required
later by the Data2MEM tool [9].
4. Display the general properties to change the run name to config_dummy.
5. Start the implementation run by pressing the button Launch Selected Runs.
The implementation of the layout takes some minutes to complete. PlanAhead
opens a dialog window when it completes. Promote the implemented partitions
since it is necessary in the next step.
16.3.2.4 Layout Implementation for the Next RMs
The next step implements the layouts for the rest of the reconfigurable modules. The
PR flow starts importing the layout of the static partition from the first implementation. Then, it computes the layout of a new RM implemented into the RP.
16.3
Case Study
451
1
2
4
5
8
452
config_dummy
config_adder
config_xxx
Fig. 16.11 The first implemented layout (left) and two other layouts for other RMs
16.3
Case Study
453
from XPS can program the FPGA with a static embedded system which cannot be
reconfigured since it does not provide a reconfigurable partition. However, SDK
can build the BSP and the ELF file for the imported system. Moreover, this static
system can be tested, debugged and simulated, with the initial configuration for the
coprocessor, in a similar way as in the previous chapter. The static system can be
used also to program the external FLASH with the partial bitstreams. The reconfigurable system will read the partial bitstreams to program the coprocessor.
454
2
3
Repeat the process to program the rest of partial bitstreams into the next
FLASH blocks (0xA0F200000, 0xA0F400000, etc.). Therefore, the offset must be
incremented 128 KB (0x20000) to point to the next block.
16.3.3.2 BSP Generation
The BSP generation is similar to the study cases of the previous chapter. The SDK
provides a wizard to create the BSP project. The MSS file is modified to set the
specific driver to the reconfigurable coprocessor in order to build the BSP with it.
Moreover, the file libgen.options must add the path of the local repository
which contains the source files. They are stored in the default folder drivers
allocated in the XPS root directory.
16.3
Case Study
455
16.3.3.3 Executable
Create a new C++ project named app1 to build the executable ELF file for the
generated BSP. The source file app1.cc contains the main application functions,
and it relies on the C++ class CBitStream. The class implements methods to read
the bitstream headers from FLASH and to perform the partial reconfiguration. The
application devotes an array of objects of the previous class which is initialized at
the beginning of the application. Three parameters define the settings used to
initialize the array reading the bitstream headers. Then, the application continues
communicating with the user through the serial port in order to load the desired
RM. It also performs a test of the active reconfigurable module which compares
the resulting data from the coprocessor and the computed data from MicroBlaze.
As commented previously, Microblaze must command the GPIO to reset the FSLs
and the reconfigurable coprocessor after the partial reconfiguration completes.
456
The class CBitStream contains the method ReadHeader to initialize its member
data from a bitstream file. The member data will take invalid values if the header
reading fails. The BIT files stored in the FLASH are composed of a header and raw
data. The header contains several fields [11] that provide information about the
16.3
Case Study
457
layout: target FPGA, date, time and name of the NCD file. The last field contains
the address and the size (in bytes) of the raw data.
The method Reconfig performs the partial reconfiguration after checking the
data collected from the header. The raw data of a bitstream embeds the set of ICAP
commands [2] that permit the partial reconfiguration of the FPGA. Therefore, the
task performed by the C++ class during the reconfiguration is reduced to write into
the HWICAP the 32-bit words retrieved from the raw data.
The HWICAP peripheral [5] embeds the ICAP and provides two FIFOs and a
set of registers to perform the partial reconfiguration or to read-back the configuration memory. The example uses three peripherals registers to perform the
partial reconfiguration: control, status, and write FIFO registers. MicroBlaze
accesses to the registers through C/C++ pointers as commented in the previous
chapter.
The C++ class implements the method Hwicap_WordWrite to write a configuration word to the HWICAP. First, the word is temporally stored in the write
458
FIFO. Then, it sets the control register to start the data transfer from the FIFO to
the ICAP. The status register indicates whether the write to the HWICAP was
completed, aborted or got an error. The method returns unsuccessfully in the last
two cases. The method waits until the write of the configuration word completes
before returning successfully.
The method Hwicap_InitWrite aborts and restarts the HWICAP before starting,
in order to assure that the peripheral is empty of data from a previous
reconfiguration.
16.3
Case Study
459
1
2
3
16.3.3.4 Testing
In order to test the reconfigurable embedded system, the FPGA must be programmed with the bitstream generated by PlanAhead and the executable file.
Therefore, select the bitstream system_dummy.bit and the BMM system_bd.bmm
that were imported from PlanAhead (see Fig. 16.13). Then, select the executable
ELF file app1.elf. SDK calls the Data2MEM [9] tool to generate the bitstream file
download.bit which configures the FPGA and the BRAM contents.
The application can also be debugged in the reconfigurable system as described
in the previous chapter.
References
1.
2.
3.
4.
5.
6.
7.
8.
461
Index
A
Activity interval, 41
Adder
base-B carry-chain, 278
decimal carry-chain, 279
decimal ripple-carry, 277
Adders, 153
20 s complement, 175
addition array, 166
addition tree, 167, 180
binary, 155, 175
Brent-Kung, 160
B0 s complement, 174
carry select, 158
carry-lookahead, 160
carry-save, 165, 179
carry-select, 173
carry-skip, 158
combinational multioperand, 167
delay, 155, 165167, 172
FPGA implementation, 175
Ladner-Fisher, 160
Ling, 160
Logarithmic, 160
long operands, 162, 178
overflow, 174
radix-2k, 155, 158, 175
sequential multioperand, 163
stored-carry encoding, 165
Addersubtractors, 173
Addition
B
BCD (binary encoded digits). Decimal
BIT, 452
download.bit, 409, 459
partial_adder.bit, 453
partial_dummy.bit, 453
system.bit, 386, 453
system_dummy.bit, 459
Bit extension, 175
BitGen, 386, 452
BMM, 445
system.bmm, 445
system_bd.bmm, 387, 453, 459
463
464
B (cont.)
Boolean equations, 1
BRAM, 366, 374, 437, 439
BSB, 376
BSP, 373, 384, 388, 454, 455
Bus, 368
master, 368
slave, 368
Bus-holder, 98
Bus-keeper, 98
C
Cache, 361
Carry computation, 155
Carry logic, 9
Carry-chain adder, 278, 279
Carry-chain adder-subtractor, 283
Carry-save adder, 34
Carry-save tree, 170
CISC, 359
Clock
distribution network, 108
gating, 113
jitter, 113
manager, 113
skew, 110
tree, 108
CMOS circuit, 95
Coloring
graph, 35
Command decoder, 87
Computation primitive, 37
Computation resource, 31
programmable, 27
Computation scheme, 37
Computation width, 41
Conditional assignments, 5, 7
Configuration frame, 436
Configuration granularity, 435
Connections
programmable, 31
Connectivity, 77
maximum, 77
minimum, 77, 79
Constrain
area, 113
timing, 113
Control register, 364
Control unit, 31
Control units
command encoding, 83
hierarchical decomposition, 83, 86
variable latency, 91
Index
Controllability, 138
Conversion
binary to radix-B, 251
radix-B to binary, 252
Converters
binary to decimal, 268
binary to radix-B, 252
decimal to binary, 253, 272
Coprocessor, 365, 425
driver, 429
hardware, 426
CORDIC, 267
Counter
3-to-2, 34
7-to-3, 34
Counters, 11, 167
3-to-2, 167
6-to-3, 171, 179
m-to-2, 170
parallel, 170
up/down, 12
Counters
Critical path, 139
Cryptoprocessor, 365
Cygwin, 421
D
Data bus, 7, 8
Data flow graph, 37
Data introduction interval, 58
Data path, 25, 29, 31
Data path connectivity, 77
Data register, 364
Data2MEM, 394, 450, 459
Data-flow model
DCM, 379
DDR, 367
Decimal
addition, 277
addition-subtration, 282
division, 286
multiplication, 286
non-restoring, 291
SRT-like algorithm, 295
Decoupling logic, 437, 440
Delay
clock-to-output delay, 104
derating factors, 101
fall time, 99
intrinsic delays, 99
propagation delay, 98
rise time, 99
slew rate, 99
Index
Delay Locked Loop (DLL), 113
Demultiplexersp, 5
Derating factors, 135
Digit extension, 174
Digit serial implementations
divider, 75
Digit serial processing, 72
Digital Clock Managers (DCM), 113
Dividers
binary SRT, 229, 232, 247
binary SRT with CSA, 232, 247
delay, 227, 234
digit selection, 236
digit-recurrence, 241, 247
non-restoring, 224, 247
on-the-fly conversion, 229, 236, 237
quotient selection, 227, 232, 235
radix-2k SRT, 234
radix-4 SRT, 238, 239
radix-B, 239
restoring, 227, 247
Division
binary digit-recurrence algorithm, 240
binary SRT algorithm, 229, 232, 247
convergence algorithms
digit-recurrence algorithm, 221, 223, 247
Goldschmidt algorithm, 239
P-D diagram, 223, 224, 236
quotient selection, 222
radix-2, 224
Robertson diagram, 222, 224
Division over GF(2m), 351
D-latch, 13
DMA, 363
Double-clocking, 111
Doubling circuit, 242
DRAM, 367, 368
Drive capabilities, 97
Driving
drive streng, 97
fan-in, 96
fan-out, 96
DSP, 215, 358, 360, 365
Dual port RAM, 119
E
Edge sensitive register, 104
EDIF, 130
EDK, 371
Electronic Automation Design, 1
ELF, 374, 453, 455, 459
465
Elliptic curve, 38
projective coordinates, 47
scalar product, 47
Embedded microprocessor, 359
Embedded system, 359
EMC, 439, 442
Encoding
commands, 83
redundant, 70
Endian, 359
big-endian, 360
little-endian, 359
Executable, 390, 455
Exponential function, 263
2x, 264, 273
accuracy, 264
ejz, 266
F
False paths, 113
Fan-out, 131
Fast or slow path, 113
FIFO, 363, 384
Finite fields, 337
Finite state machine
Mealy, 26, 30, 32
Moore, 32
registered, 33
Finite state machines, 13
Mealy, 13
Moore, 13
FLASH, 439, 443, 454
Flip-flops, 10, 104
temporal parameters, 104
Floating-point numbers
adder, 322
addersubtractor, 326, 330
addition, 309, 313
arithmetic operations, 309
division, 315
multiplication, 309
normalization, 309, 316
packing, 325
rounding, 309
truncation, 318
underflow, 318
unpacking, 322
Floating-point numeration system
unbiased, 319
FPU, 361, 365
FRAM
466
F (cont.)
Frequency synthesis, 113
FSL, 426, 437, 440
FIFO, 367, 426
slot, 428
FSM Encoding, 131
Index
IP cores, 372
ISE, 372
ISim, 372, 410
ISR, 374, 393
J
JTAG, 138
G
Generate function, 157, 278
Genetic algorithms, 144
Glicthless reconfiguration, 436
Glitch, 102
GNU compiler, 394
GNU gprof, 424
GNU tools chain, 372
Goldschmidt algorithm, 239, 245
GPIO, 363, 375, 392
Guard digits, 320
H
Halving circuit, 243
Handshaking protocol, 67, 69
Hard-core, 358, 360
Hardware Description Language, 1
Harvard architecture, 359
Hierarchical control units, 86, 88
Hierarchy preservation, 130
Hold time, 104
Hold time violation, 139
HWICAP, 436, 439, 441, 458
I
ICAP, 435, 442, 458
Implementations
dividers, 246
finite field arithmetic, 348
multipliers, 215
other operations, 268
Incompatibility graph, 48
Incompatibility relation, 48
Input buffer, 20
Input/output buffer, 130
Integers
B0 s complement, 173
Interrupt controller, 363, 383
Inversion
Newton-Raphson algorithm, 245
IO-port components, 19
IP, 357
L
Latch, 104
Latency, 56, 58
average, 67
LatticeMico32, 362
Layout implementation, 445, 449, 450
Level sensitive registers, 104
Libgen, 389
Lifetime, 45, 46
List scheduling, 44
LMB, 376, 439
Logarithm computation, 262
base-2, 252, 262, 273
Logic analyzer, 137
Long-path fault, 110
Look up tables, 2
Loop unrolling, 55, 72
LSB, 364
M
MAC, 358
MDD, 404
MDM, 375, 384, 439
Mean time between failures, 107
Memory, 366
Memory blocks, 17
Memory controller, 363
Mesosynchronous designs, 119
Metastability, 104, 116
recovery time, 107
setup-hold windows, 106
MHS, 373, 376
MicroBlaze, 362, 374, 437, 440
Minimum clock pulse, 104
MMU, 361
Modulo f(x) dividers, 350
binary algorithm, 347
Modulo f(x) division
binary algorithm, 351, 347
Modulo f(x) multiplication, 347
Modulo f(x) multipliers
Index
interleaved multiplier, 347, 348
multiply and reduce, 347, 348
Modulo f(x) operations, 347
addition and subtraction, 347
division, 351
squaring, 350
Modulo f(x) squaring, 348
Modulo m adder-subtractor, 339
Modulo m exponentiation
LSB-first, 346
MSB-first, 347
Modulo m multipliers
double, add and reduce, 341
interleaved multiplier, 340, 342
multiply and reduce
Modulo m operations, 337
addition, 338
Montgomery multiplication, 343
multiplication, 338
subtraction, 338
Modulo m reducer
Modulo p division, 347
binary algorithm, 347
Montgomery product, 343
carry-save addition, 344
MPD, 398
MSB, 364
MSS, 373
MTBF, 107
Multi-core, 360
Multi-cycle paths, 113
Multioperand adders
combinational, 166, 179
sequential, 163, 179
Multiplexers, 15
Multiplication
basic algorithm, 183
mod 2Bn+m, 199, 218
post correction, 206, 218
Multipliers
1-digit by 1-digit, 184
Booth, 204, 218
Booth-2, 210
carry-save, 187, 188, 215218
carry-save for integers, 201
constant, 211, 212, 215
Dadda, 189
delay, 185, 199, 201, 215
integers, 199, 205
mixed-radix, 191
multioperand adders, 189
n-digit by 1-digit, 185
467
parallel, 185, 214
post correction, 218
radix-2k, 217
radix-4, 210
right to left, 185
ripple-carry, 185
shift and add, 195, 201, 218
shift and add with CSA, 218
Wallace, 189
N
NCD, 385
Netlist, 130
Newton-Raphson algorithm, 245
Nios II, 362
Non-restoring algorithm, 291
Normalization
circuit, 324
O
Observability, 138
On-chip debugging, 137
Operations over Z2[x]/f(x), 347
addition and subtraction, 347
multiplication, 348
OS, 358, 361, 388, 408
Overflow, 174
P
Packing, 325
PAO, 400
Partial bitstream, 436, 453
Partial reconfiguration (PR), 435
Partition pin, 436
Peripheral, 362, 364
memory-mapped, 364
port-mapped, 364
Phase shifter, 113
Phase-locked loops (PLL), 113
Pipeline, 55, 67, 360
bandwidth, 58
period, 58
production, 58
rate, 58
speed, 58
Pipelined circuit, 60, 63
adder, 66
deskewing registers, 67
skewing registers, 67
468
P (cont.)
PlanAhead, 437, 445
Platgen, 385
PLB, 375, 398, 439
Power consumption, 119
dynamic, 112
sources, 120
static, 120
PowerPC, 358, 362, 365
Precedence graph, 37
scheduled, 42
Precedence relation, 34
Profiling, 423
Propagate function, 153, 278
Propagation delay, 99
Propagation delay of a register, 104
Proxy logic, 436
PSRAM, 367
Pull-down, 97, 135
Pull-up, 97, 135
R
Race-through, 110, 111
Random access memory, 18
Read only memory, 17
Reconfigurable coprocessor, 435, 438
Reconfigurable module (RM), 436, 438
Reconfigurable modules, 446
Reconfigurable partition, 446
Reconfigurable partition (RP), 436
Register balancing, 131
Register duplication, 131
Register transfer level (RTL), 103
Registers, 11
duplicate, 1, 131
parallel, 11
right shift, 11
Required time, 139
Resource assignment, 44
Response time, 58
Retiming, 131
Ripple-carry adder, 277
RISC, 359
Rounding schemes, 319, 320
RPM, 358
RTL, 358
RTL-level, 135
Runt pulse, 103
S
Scalar product, 38
Schedule
Index
admissible, 37, 38
ALAP, 37, 40
ASAP, 37, 40
Scheduling
operation scheduling, 34
SDK, 372, 388
SDRAM, 367
Segmentation, 58
admissible, 58, 60
Self-timed circuit, 67, 68, 69
adder, 70
Self-timing, 55
Setup time, 104
Setup time violation, 139
Sign bit, 174
Signed, 9
Simgen, 410
Simulated annealing, 44
Simulation, 135
in-circuit, 137
post place & route, 136
post-synthesis, 136
Single-core, 360
Skew, 135
Slack, 139
Slew rate, 135
SoC, 357
Soft-core, 360
Speedup factor, 58
Spikes, 103
Square root
floating-point, 316
Square rooters, 254
convergence methods, 260
fractional numbers, 259
Newton-Raphson method, 260, 273
non-restoring algorithm, 258, 272
restoring algorithm, 272
SRAM, 367
SRT-like algorithm, 295
Start signal, 67
Static partition, 436, 438
Static timing analysis, 139
Status register, 364
STD_LOGIC type, 20
Sticky bit, 322
Stuck-at fault model, 138
Subtractors, 173
20 s complement, 175
B0 s complement, 175
Superscalar, 360
Synchronization
asynchronous FIFO, 117
failures, 110
Index
handshake signaling, 116
hold violation, 111
mesosynchronous, 119
registers, 108
setup violation, 110
synchronizer, 113, 116
true single phase clock (TSCP), 103
Synchronous design, 103
Synthesis, 130
constraints, 131
optimizations, 130
reports, 132
T
Tables, 3
TCL, 404
Ten0 s complement, 282
carry-chain adder-subtractor, 283
numeration system, 282
sign change, 283
Testbench, 410, 414
Throughput, 58
Timer, 393
Timing
analysis, 139
constrains, 113
false paths, 113
fast or slow paths, 113
multi-cycle paths, 113
TLB, 361
Transition time, 98
Tree
multioperand addition, 167
Trigonometric functions, 268
CORDIC, 267
469
pseudo-rotations, 267
Tri-state buffers, 5
Tri-states, 98
True single phase clock (TSCP), 103
U
UART, 363, 374
UCF, 385
Unpacking, 322
Unrolled loop implementations, 73
divider, 75
Unsigned, 9
V
Variable latency operations, 91
Video controller
Von Neumann architecture, 359
W
Wrapper, 414
X
XMD, 373, 396, 408
XMP, 373
XPS, 372, 376
Z
ZBT, 368
Zero-clocking, 110