A Simple Self-Timed Implementation of A Priority Queue For Dictionary Search Problems

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

A Simple Self-Timed Implementation of a

Priority Queue for Dictionary Search Problems


Ali Muhtaroglu, Senior Member, IEEE, Omer Berat Sezer

Abstract This paper describes a sparse priority queue
suitable for reporting the results from a sequence database
search, using a self-timed protocol. The prioritization is simplified
through an insertion sort scheme with no greater/less than logic.
The resulting implementation promises to be compact, fast, and
suitable for the specified application area. The architectural
design has been validated on a prototype platform with Altera
Cyclone II Field Programmable Gate Array (FPGA).

allow negative scores, but such schemes can easily be


converted to a scheme with positive scores only.
2) Request driven insertion for low energy operation: Since
the entries to the queue are temporally sparse, it is
undesirable to have a synchronous implementation with a
constantly toggling clock.
3) Recording of repeats: It is typical of the application that
many results get reported with equivalent significance.
The priority queue needs to favor high scoring repeats
over low scoring unique alignments.
4) Finite queue size determined based on desired
performance: Searches against large databases can
occasionally result in many hits depending on how the
search hit threshold was tuned. It is unpractical and slow
to implement a very long priority queue to accommodate
such circumstances. Instead, it is desirable to have
integrated features to drop the lowest priority (lowest
significance) entries in order to make space for the more
significant ones.

Index TermsHardware sorter, insertion sort, priority queue,


self-timed queue.

I. INTRODUCTION
equence comparison hardware implementations for
applications including dictionary or genetic database
searches are based on quantifying the similarity between a
source string (or sub-string) against a target string (or substring). One of the compared strings can be infinitely long, e.g.
a database. In previous work [1], a superscalar Cellular
Automaton Processor (CAP) implementation was described,
based on Motomuras building block [2], for high performance
sub-string alignment. The output of this machine was a
temporally sparse set of integer pairs, the first number
representing the location of the alignment or the matched substring, and the second representing the significance score. It is
of high interest in such an application to efficiently collect the
most significant outputs, and report through a priority queue.

B. Sparse Priority Queues


Common binary tree based algorithms of the priority queues
are complex, and have high processing and storage hardware
size. Linear array implementations [3,4] offer reduced
complexity, but are far from being simple due to emphasis in
removing global control routes for fully systolic
implementation. Sparse linear array [5] implementations
significantly simplify the hardware if the queue size can be
managed, satisfying requirement (A.1). They are also suitable
for asynchronous self-timed implementations with temporally
sparse characteristics as will be discussed further in the next
section. A sparse priority queue was therefore picked as the
backbone of the implementation in this work.

A. Requirements
The desirable features of the priority queue can be
summarized as follows for the described application area:
1) Simple, compact design: The priority queue typically
services other processors with demanding hardware costs.
Regardless of if the final solution is implemented on a
fully custom chip, or a Field Programmable Gate Array
(FPGA), it needs to have minimal size and low complexity
by taking advantage of the fact that the sorted integers are
unsigned, and have a relatively narrow range of interest. It
is worthwhile to note some similarity scoring schemes

The architectural description of the self-timed priority queue


is provided next. Section 3 provides simulations and prototype
validation results. Finally, conclusions are shared.

II. ARCHITECTURAL DESCRIPTION


A. Self-Timed Asynchronous Handshake Protocol
Self-timed asynchronous design has the desirable
characteristic that each data transfer in the circuit happens at
its maximum handshaking speed. The fast circuit paths are not
penalized by the worst case speed path i.e. they do not have to
wait for the next clock edge that is set to accommodate the

A. Muhtaroglu is with Middle East Technical University Northern Cyprus


Campus, T.R.N.C. via Mersin 10, Turkey (phone: 90-392-661-2933; fax: 90392-661-2949; e-mail: [email protected]).
O. B. Sezer was an undergraduate research assistant with Middle East
Technical University Northern Cyprus Campus, T.R.N.C. via Mersin 10,
Turkey (e-mail: [email protected]).

c
978-1-4244-3523-4/09/$25.002009
IEEE

34

slowest circuit path as is done in synchronous design


methodologies. In addition, omitting the support for a global
clock reduces power due to significantly reduced CV2f
dynamic power, where C is the effective switched capacitance,
V is the supply voltage, and f is the clock switching frequency.
The asynchronous architectures have the advantage of
switching control signals only when there is data to be
transferred between two points, and by definition satisfy
requirement (A.2) in the previous section.
Sutherland [6] first introduced transition signaling based
micropipelines which are elastic i.e. move the data through
handshaking of the neighboring processing elements (PEs)
until two PEs with valid entries are queued back-to-back with
no invalid data (bubble) between them. This characteristic fits
perfectly to the operation of a sparse priority queue. Once all
the entries are entered to the queue and sorted, a global signal
can compress the sparse array in this elastic operation mode to
one end of the queue.
Our asynchronous implementation is based on a 4-phase
(level) signaling scheme instead of Sutherlands transition
signaling. It has been in general shown [7] that level signaling
scheme has the advantage of requiring much simpler circuits
(requirement A.1 in Section I) compared to transition
signaling. In addition, it does not have high noise sensitivity
unlike some other handshake protocols. 4-phase signaling
between two processing elements is depicted in Fig. 1.
Assuming all control signals start low, signal logic levels and
cause-effect relationships are summarized in the figure for data
transfer in one direction. The handshake starts with a RO
(Request-Out) assertion from the transmitter. The signal is
interpreted as a RI (Request-In) at the receiver end, and is used
to latch in new data available from the transmitter if the PE is
unoccupied. Each PE stores one bit of information to indicate
it has a flag. The same bit also initiates the requests (ROs)
from the transmitter to the neighboring receiver PEs. If the
receiver PE is not already occupied, and once it correctly
registers the new flag, it asserts AO (Acknowledge-Out) back,
which becomes AI (Acknowledge-In) at the receiver, and
removes the outstanding request, which is the same as deleting
the existing flag at the transmitter. The deassertion of the RO
turns off RI input of the receiver, which in turn deactivates AO
and AI.
RI

RI

RO

transmitter
AO

Transmitter:

Receiver:

^
^

RO

receiver

AI

AO

AI

RO (Request-Out)
AI (Acknowledge-In)
RI (Request-In)

AO (Acknowledge-Out)

Fig. 1 4-phase level signaling for PE to PE handshake

...
en

en

en

...

Handshake
PEs

en

Data
Array
Data to be
inserted

=?

en

=?

en

...

=?

...

en

Wired-OR / OR

Comparison
Logic

=?

Valid data
indicator

en

Logic to drop lowest


priority entry to
accommodate duplicates

Fig. 2 Self-timed priority queue organization

B. Self-Timed Priority Queue


A bidirectional micropipeline organization with 4-phase
signaling has been used to implement the priority queue. The
building blocks of the queue are shown in Fig. 2. The
bidirectional handshake PEs store a flag to indicate the
corresponding data entry should not be overwritten. The PEs
facilitate leftward movements during data entry and sorting,
and rightward movements while reading the prioritized list.
This sequence is further explained below. The data array is
initialized with an ordered and relevant set of integer keys,
which significantly speeds up the execution of the rest of the
insertion sort operations. When a new data entry arrives, it is
compared in parallel to the initialized values of the data array
registers, and when there is a hit, the corresponding valid bit is
set to commit a data array register to the final priority list. If a
new coming entry is a repeat, the valid bit of the
corresponding entry has already been set, which triggers a
global signal to shift all the entries to the left by one through
the 4-phase handshake. A duplicate of one of the matched data
and its valid bit is generated during this shifting. Thus the
elastic nature of the queue allows dropping of the lowest
priority entries in favor of the repeats (requirement A.3 in
Section I). Due to this feature, the queue can be short with a
few global signals accommodated for simple design
(requirement A.4 in Section I).
C. Modes of Operation
Circular operation has been assumed for the priority queue,
controlled by a single global control signal named readout.
When readout=1, the queue is in report mode. When
readout=0, it is in insertion sort mode, which may last for a
long time, depending on how long it takes for the dictionary
search to be completed by a separate processor. The queue
requires initialization at the beginning of the two main modes.
This is achieved using the rising and falling edges of the
readout signal.
Insertion Sort Mode:
The falling edge of the readout signal initializes the data
array to a predetermined set of keys, ordered from highest

2009 2nd International Conference on Adaptive Science & Technology

35

Handshake
PEs

Handshake
PEs

ack

readout

Data
Array

readout
=0

Data
Array

=?

=?

=?

=?

Fig. 3 Initialization of Insertion Sort Mode

priority on the right to lowest priority on the left (largest to


smallest significance score in the application). The same
falling edge initializes the handshake PEs to logic 1, as
depicted in Fig. 3, indicating all initialized entries should be
kept in the array until a later time when deletion is necessary to
allocate space for additional copies of the data entries. A
queue with four entries is shown in the figure as an example,
where the relevant keys to be tracked are between 3 and 6.
While in the insertion sort mode, the availability of a new
entry is signaled through an active low sort request (sortreq#)
signal. If the new entry matches one of the available keys in
the array, the corresponding valid signal is turned on as in the
example of Fig. 4. The insertion sort of a new entry is fast for
the first occurrence of any key in the array. It is therefore
unlikely that a new entry will arrive before the previous is fully
processed by the priority queue. However, a global signal is
still made available to communicate the completion of sort
operations in the queue. This signal is named sortready, and is
turned off while the queue is processing. The dictionary search
engines can monitor this signal to ensure it is high before
sending a new sort request (sortreq#). Since sortreq# is
utilized by level sensitive logic, it should remain asserted as
long as sortready=0.
In the repeated occurrence of an entry, the global drop-one
signal triggers an acknowledge to the leftmost handshake PE,
which erases its flag (Fig. 5(a)). This causes the micropipeline
to shift (copy) entries to the left by one. The entries to the right
of the matching key are blocked from receiving an
acknowledge and thus do not take part in the copying (Fig.
5(b)). The sortready signal is turned off during the self-timed
shift operation, preventing a new request from showing up
during this time. Note, even though the insertion of the
repeated key has been described in two phases, the full
1

Handshake
PEs

readout
=0

Data
Array

Comparison
Logic

5
=?

=?

=?

=?

sortreq#
=0
Fig. 4 Insertion sort for the first occurrence of a key

36

Valid data
indicator

Comparison
Logic
Valid data
indicator

sortreq#
=0
1

drop-one

Wired-OR / OR

Fig. 5a Insertion sort for the repeated key occurrence 1st phase
req
0

req

req
1

Handshake
PEs

ack

readout
=0

Data
Array

=?

=?

=?

=?

5
Comparison
Logic
Valid data
indicator

sortreq#
=0

Wired-OR / OR

Fig. 5b Insertion sort for the repeated key occurrence 2nd phase

operation takes place asynchronously with a single sortreq#


assertion. The request and acknowledge signals depicted in
Fig. 5(b) are not all asserted simultaneously, but follow the 4phase protocol timing at each stage as previously shown in
Fig. 1.
Report Mode:
The rising edge of the readout signal copies the contents of
the valid data indicators to the handshake PEs as depicted in
Fig. 6.
The micropipeline of PEs switch direction from left to right
when readout=1, which means the data corresponding to the
PEs with active flags will accumulate to the right end of the
data array by initiating proper requests shown in Fig. 7. When
data entries get shifted to the right along with the requestacknowledge pair assertions, the old copies do not get deleted.
Thus, the key 5 appears at three entries of the data array in
our example at the end of the processing, but only the
rightmost two entries with active PE flags are valid. The data
can be read out through the 4-phase handshake by observing
the request signals from the rightmost PE. Alternatively, the
data can also be captured serially by a synchronous receiver
using a constant frequency clock pulse as the acknowledge

2009 2nd International Conference on Adaptive Science & Technology

Handshake
PEs

readout

Data
Array

Valid data
indicator

Fig. 6 Initialization of Report Mode

1
ack

readout
=1

req

req

req
0

1
ack

Handshake
PEs
Data
Array

Fig. 7 Execution of Report Mode

signal going into the rightmost PE once micropipeline is done


with the accumulation process. The frequency of the clock has
to be set low enough in the latter method to allow for the
queue entries to asynchronously shift by one internally
between two clock pulses.

III. IMPLEMENTATION AND VALIDATION


The described self-timed priority queue has been prototyped
on a DE2-70 development platform hosting an Altera Cyclone
II FPGA. Fully parameterized VHDL code has been used for
design entry. The AND and OR logic for global sortready and
dropone signal assertions respectively have been implemented
using regular gates for simplicity, instead of tri-state or opendrain type approaches. Alternate implementations can be
investigated for such global signals with large fan-in if
performance is to be enhanced further. Due to the application
area of interest, the priority queue is not expected to get
excessively long for the management of the few global signals
in this design. The timing simulations and sample validation
results are included in this section to demonstrate the operation
of the system.
A. Simulation
A sample simulation is depicted in Fig. 8 with a queue length
of 16. The simulation duration is kept relatively short for
clarity. Consecutive integers 0 to 15 are stored in the data
array during the initialization at the low-going readout edge.
Only the rightmost 8 entries of the data array are shown as
signal q at the very bottom. Three numbers (shown in hex
format) are then inserted into the queue in the insertion sort
mode when readout=0: 10, 14, and 10. Three valid indicators
(dvq in the figure) are copied to the handshake PEs at the highgoing edge of the readout signal, followed by asynchronous
micropipeline operations to move valid data to the right end.
The three valid data entries are then consecutively read out by
asserting the active low acknowledge signal (ackir_b) going to
the rightmost handshake PE.

Fig. 8 Simulation of data entry, sorting, and reporting

B. Prototype Validation
The simulation of Fig. 8 has been replicated on the
functional validation prototype as depicted in Fig. 9(a-h). Only
the highest priority eight entries in the data array are shown on
the 7-Segment displays. The bottom middle LED(s), lighting
up one by one in Figures 9c-e, represent rightmost eight data
valid indicators. The bottom right LEDs, lighting up in Fig. 9f,
are connected to the reqoutr signals in the simulation, and
show the position of the PE flags at the rightmost end of the
priority queue. The results match the simulations, and
successfully demonstrate the correct functionality of the
platform.
Another test run depicted in Fig. 10 on the same prototype
board show eight numbers entered into the same queue size of
16. The left hand picture shows the queue status after the
number sequence has been entered in the random order 2, C, 8,
F, 5, 8, 3, 5, followed by setting readout=1. The picture on the
right shows the status of the queue after three highest priority
numbers have been read out by asserting the ackir_b signal
three times.
IV. CONCLUSION
An original and simple self-timed priority queue
implementation based on insertion sort has been designed, and
implemented using VHDL on a FPGA prototype. The size of
the priority queue is fully scalable through few global
variables. Validation results, a subset of which has been
included into this report, have demonstrated correct operation.
The design meets the requirements of the dictionary search
applications at a minimum hardware cost.

Fig. 9a readout=1 (rightmost switch) after CLR

2009 2nd International Conference on Adaptive Science & Technology

37

Fig. 9b readout switches to 0; data array is initialized

Fig. 9c Number 10 entered

Fig. 9d Number 14 entered

Fig. 9f readout=1; execution of Report Mode

Fig. 9g Data array and PE flags (reqoutr) after the first right acknowledge

Fig. 9h Data array and valid indicators after the last right acknowledge

Fig. 10 Data queue and PE flags after entering 8 decimal numbers, sorting
(left), and reading out 3 highest priority numbers from the queue (right)

ACKNOWLEDGMENT
Fig. 9e Second occurrence of number 10 entered

38

Some of the ideas behind this work were conceived at


Cornell University while working with John V. Oldfield. Many
thanks for his enthusiastic feedback.

2009 2nd International Conference on Adaptive Science & Technology

REFERENCES
[1]

[2]

[3]
[4]

[5]

[6]
[7]

[8]

[9]

A. Muhtaroglu, A Subsequence Alignment Implementation using


Superscalar Cellular Automaton Processor System, Proceedings of 5th
International Symposium on Electrical and Computer Systems, pp. 155158, Nov. 2008.
M. Motomura, H. Yamada, T. Enomoto, A 2K-Word Dictionary
Search Processor(DISP) LSI with an Approximate Word Search
Capability, IEEE Journal of Solid State Circuits, Vol. 27, No. 6, pp.
883-91, June 1992.
D.T. Lee, H. Chang, C.K. Wong, An On-Chip Compare/ Steer Bubble
Sorter, IEEE Transactions on Computers, Vol. C-30, No. 6, June 1981.
B. Parhami, D. Kwai, Data-Driven Control Scheme for Linear Arrays:
Application of a Stable Insertion Sorter, IEEE Transactions on Parallel
and Distributed Systems, Vol. 10, No. 1, pp. 23-28, January 1999.
A. Itai, A.G. Konheim, M. Rodeh, A Sparse Table Implementation of
Priority Queues, Automata, Languages, and Programming, pp. 417431, 1981
I. Sutherland, Micropipelines, Communications of the ACM, Vol. 32,
No. 1, pp. 720-738, June 1989.
J. Ebergen, S. Furber, A. Saifhashemi, Notes on Pulse Signaling,
Proceedings of 13th IEEE Symposium on Asynchronous Circuits and
Systems, pp. 15-24, March 2007
A. Muhtaroglu, Cellular Automaton Processsor Based Systems for
Genetic Sequence Comparison/Database Searching, M.S. Thesis,
Cornell University, 1996.
Altera, http://www.altera.com

2009 2nd International Conference on Adaptive Science & Technology

39

You might also like