Scan Primitives For Vector Computers
Scan Primitives For Vector Computers
Scan Primitives For Vector Computers
Siddhartha
Guy E. Blelloch
Chatterjee
School
of Computer
Carnegie
Mellon
Pittsburgh,
Introduction
Vector supercomputers have been used to supply the high computing power needed for many applications. However, the performance obtained from these machines critically depends on
the ability to produce code that vectorizes well. Two distinct
approaches have been taken to meet this goal-vectorization
of
dusty decks [18], and language support for vector intrinsics,
as seen in the proposed Fortran 8x standard [ 11. In both cases,
the focus of the work has been in speeding up scientific computations, characterized by regular and static data structures and
predominantly regular access patterns within these data structures. These alternatives are not very effective for problems that
Science
PA 15213-3890
Routine
plus scan (int)
segmented plus scan (int)
parallel radix sort (64 bits)
branch-sums
root-sums
delete-vertices
2.5 I
3.7
896.8
12.2
9.8
19.2
7/i(*)
74.8 (*j
11730.0 (*)
206.5 (t)
208.6(t)
276.2 0)
varying
.OO0 IEEE
I r (clks/elt)
Scan version 1 Scalar version
CH2916-5/90/0000/0666/$01
Zagha
University
Abstract
Ibis paper describes an optimized implementation of a set of mm
(also called ah-prefix-sums) primitives on a single processor of
a CRAY Y-MP, and demonstrates that their use leads to greatly
improved performance for several applications that cannot be
vectorized with existing compiler technology.
The algorithm
used to implement the scans is based on an algorithm for parallel computers and is applicable with minor modifications to
any register-based vector computer. On the CRAY Y-MP, the
asymptotic running time of the plus-scan is about 2.25 times that
of a vector add, and is within 20% of optimal. An important
aspect of our implementation is that a set of segmented versions
of these scans are only marginally more expensive than the unsegmented versions. These segmented versions can be used to
execute a scan on multiple data sets without having to pay the
vector startup cost (I) 1,~) for each set.
The paper describes a radix sorting routine based on the scans
that is 13 times faster than a Fortran version and within 20% of
a highly optimized library sort routine, three operations on trees
that are between 10 and 20 times faster than the corresponding C
versions, and a connectionist learning algorithm that is 10 times
faster than the corresponding C version for sparse and irregular
networks.
Marco
666
Scan primitives
. . . . . (f!oP...cfiC/,,-2)].
f and maximum as operators
operators are adequate for all
We henceforth refer to these
max-scan
respectively. For
[5
21
+-scan(A)
max-scan(A)
=
=
[O
[-,X
5
5
6
5
9
5
13
5
221
91
=
=
[5
[l
1
0
3
0
4
1
3
0
9
1
21
O]
seg-+-scan(A)
[O
91
[%O
head-flags
[I
.SOI ,-02
,%O .v21
.311]
,<30 .311
lengths
[3
21
head-pointers
[O
51
O]
images [ 161.
grid given its end-
geometry [4].
15. To efficiently implement languages that support nested parallel structures [5], such as CM-LISP 1231, on flat parallel
machines.
vector elements
[.30
[ 191.
There has been significant research on parallel and vector algorithms for unsegmented scans. A parallel implementation of
scans on a perfect shuffle network was suggested by Stone [ 191
for polynomial evaluation. The problem with this algorithm is
that it required O(log ,r) time with O( ,,) processors. Thus it
required O(t/ log !I) arithmetic operations as opposed to 0( ,,)
for the serial algorithm.
Stone later showed an optimal version of the algorithm in the context of solving tridiagonal equations [20]. This algorithm can be implemented in O( log u ) time
with 0 (I) / log I) ) processors. When implemented on the CRAY
Y-MP, this algorithm results in running times that are longer than
those for the algorithm presented in this paper, since it requires
intermediate results to be written out to main memory and then
read back in again. Other approaches to implementing scans on
a vector machine are based on cyclic reduction [8] and unrolling
scalar loops [21], which again have larger constant factors than
the implementation presented in the paper.
The frequent use of linear recurrences in scientific code has
motivated research in finding efficient and vector&able methods
for scans. For instance, Livermore loops 5, 6, 11, 19 and 23
involve recurrences that can be expressed in terms of scans.
State-of-the-art vectorizing compilers can recognize such linear
recurrences and use one of the above algorithms to optimize
such loops [15]. However, segmented scans cannot be handled
by such techniques. This is the reason that the performance of
the unsegmented scalar plus scan in Table 1 is significantly better
than that of the segmented plus scan.
667
re z !
9 : 2,
processor0
processor 1
Sum
+-scan(Sum)
0processorO
4
11
6 4
PGGZZ
[12
[0
=
=
J2processor
1,2 17,
1
1 9, 51
processor 3
7
12
j9processor2
2,5 29
18
19
151
371
37
processor
3,8 47
3
?I = (I - 1 ).i + /
Figure 2: Executing
processors.
Implementing
Determining
Scan Primitives
In this section, we present vectorizable algorithms for the unsegmented and segmented scan primitives.
We only consider
the +-scan,
as the algorithms can be easily modified for
max-scan
by changing the addition operation to a maximum.
We first describe a parallel algorithm for plus-scanning avector
of length n on p processors. It has three phases:
0 0 < .I L: s.
l
[13]
The reason for organizing the data thus is that all memory
accesses are now constant-stride, and bank conflicts are avoided.
668
3.1
Algorithm
Vector organization
(0 + . . . + cs-l
I, + *. . + IZ-1
=ooo
2
6
7
=079
=079
2
6
[0
-3
2 1
402
1 1
Vl (after phase 2)
Dst
1
1
2
-3
6
1
5
-3
-1
-1
0
6
7
12
51
initially
after iter 1
after iter 2
afteriter3
8
8
8
8
14
15
6
8
9
in iter 1
in iter 2
in iter 3
8
14
15
Figure 4: Stages in the execution of the unsegmented scan operation. The diagram shows the organization of the source vector
(Src), the intermediate states of Vl during phase 1 of the scan,
the values in Vl after phase 2, and the values written out to the
destination vector (Dst) during phase 3.
[2
Vl (in phase 1)
I.-lo =
1-1, =
Src
= 0;
k =
for
(l-l)
"s;
(i = 0; i
/*
vector
< s;
fetch
i++)
with
stride
k =
for
(l-l)*s;
(i = 0;
/* vector
V2 = <v[i],
/* vector
s */
V2 = <v[i],
v[s+i],
. . . . v[k+i]>;
/* vector
add */
Vl = VI + v2;
<d[il,
Vl
= Vl
< s;
fetch
v[s+i],
store
d[s+i],
t v2;
i++)
with
(
stride
s */
. . . . v[k+i]>;
with
stride
s */
. . ., d[kti]>
= Vl;
/* vector
add */
I.10 = 0
669
Src
Seg
r,;-,
.
l&-l
. -. .
.
~~~1-1)9-1
O64-1
3.2
Algorithm
[2
[l
5
0
8
0
1
1
Vector organization
2
4
1
5
8
1
3
2
3
6
0
5
Flags organization
1
0
0
0
0
1
1
0
1
0
0
0
Vl (in phase 1)
=o
2
6
7
00
5
13
1
3
5
3
0
6
6
11
initially
after iter 1
after iter 2
after iter 3
0
0
1
1
1
1
0
0
0
initially
after iter 1
after iter 2
after iter 3
3
9
9
initer 1
in iter 2
in iter 3
. . ..
(scalar)
=o
[0
3
1
2
0
3
1
6
0
0
0
5]
O]
13
2
6
7
12
0
0
3
0
12
91
Figure 6: Stages in the execution of the segmented scan operation. The diagram shows the organizations of the source vector
(Src) and the flags (Seg), the intermediate states of Vl and Sl
during phase 1 of the scan, the values in Vl after phase 2, and the
values written out to the destination vector (Dst) during phase
3. The flags are bits, and all the other vectors are integers. The
padding on words of the compressed flags is omitted for clarity.
/* OR flag
bits
Sl = Sl 1 VM;
1
(scalar)
*/
for
= 0; i < 1; it+)
{
is the i'th
elt
v1t i]
of register
V1 */
t = Vl[ il;
Vl[i]
= 5;
/* Sl<i>
is the
i'th
bit
of Sl */
if
(Sl<i>
!= 0) s = 0;
9 = 9 t t;
s */
v[kti]>;
*/
F is 1 (vector)
with
VM;
vector
add */
=oooo
1
1
1
=07
0;
(i
/*
{
stride
1
0
Vl (after phase 2)
Dst
it+)
with
4
0
Sl (in phase 1)
VI. = 0;
Sl = 0;
k = (1-l)*s;
for
(i = 0; i < s;
/* vector
fetch
V2 = <v[i],
v[s+i],
/* flag
fetch
VM = F[i];
/* reset
V1 if
VI = VI merge
0
Vl = Vl t i72; /*
=
=
*/
I
670
This loop runs in scalar mode, but the cost is small and fixed,
and it involves no memory transfers.
k =
for
(l-l)
*s;
(i = 0; i < 5; i++)
{
/* vector
fetch
with
stride
s */
V2 = <v[i],
v[s+i],
. . . . v[k+i]>;
/* flag
fetch
(scalar)
*/
VM = F[i];
/* reset
VI if F is 1 (vector)
*/
V1 = VI merge
0 with
VM;
/* vector
put with
stride
s */
= Vl;
d[s+i],
. . . . d[k+i]>
<d[il,
VI = VI + V2;/*
vector
add */
Algorithm
Other useful segmented scan operations include max- and copyscans. A segmented copy-scan copies values placed at the starting positions of segments across the segment, and can be implemented in a manner similar to the segmented plus-scan3. The
copy-scan copies either the running value or the new value based
on the flag at that point, in the same way that the plus-scan either adds the value to the running sum or clears the sum. The
details are similar to the segmented plus-scan and are therefore
not repeated here.
Using superscripts s and u to indicate, respectively, the segmented and unsegmented versions of the scan operation, the
times to execute the segmented scan operation by the two methods above are given by:
I
I I,
I,$( It + //;,2)
I:( I? + ttt
t02)
3.3 Results
Table 2 presents timing results for our implementations of the
scan primitives. We characterize the performance of each operation with two numbers, I, (= l/~,,) and ,,L,z [9], that measure,
respectively, the incremental cost of processing one element (in
clock cycles), and the vector half-performance length. Our efforts have been mostly concerned with reducing I, as this is
(asymptotically) the dominant factor in the running time.
The primitives are implemented in Cray Assembler Language
(CAL) [6] as a set of routines callable from the C programming language. The loops are unrolled twice to avoid register
?his is not a scan as defined in Section 2, but is so called because it can
be implemented in a similar manner.
Applications
In this section we discuss three applications using these primitives: sorting, tree operations, and a connectionist learning algorithm called back-propagation.
These applications were chosen
Operation
Elementwise
p-add-i(C,
A, B, L)
p-sub-i(C,
A, B, L)
pneg-i
(B, A, L)
pmax-i
(C, A, B, L)
p-inv-i
extbit(C,
p..sel-i(C,
L)
A(1)
+ B(1)
A(1)
- B(1)
-A(I)
.GT. B(1))
C 1) = A(1)
C I) = B(1)
.EQ. 0)
B I) = 1
B I) = 0
-gbit
(A(I),
64-N)
IF (F(1)
.EQ. 0)
THEN C(1)
= A(1)
ELSE C(1)
= B(1)
A(1)
= N
L)
L)
B,
L)
C(B(I))
= AtI)
C(I)
= A(B(I))
IF (F(I)
.EQ. 1)
C(B(I))
= AtI)
(B,
A,
L)
A,
N,
L)
A,
F,
B,
L)
p-distr..i(A,
N,
Permutation-based
perm-i
(C, A, B,
bperm-i
(C, A, B,
sel-perm-i
(C, A,
F,
C(1)
=
C(1)
=
B(1)
=
IF (A(I)
THEN
ELSE
IF (A(I)
THEN
ELSE
C(1)
=
4.1
Operation
Sorting
p-add-i
p-sub-i
P-neg-i
p-max-i
p-inv-i
ext bit
p-Gl-i
p-distr-i
perm-i
(identity)
(reverse)
(unstructured)
bperm-i
(identity)
(unstructured)
selgerm
i
(identity)
1,
(clks/elt)
l/2
(elts)
1.14
1.16
1.11
2.12
1.11
1.11
2.13
1.09
483
767
710
398
700
709
483
668
1.24
1.60
1.82
663
499
419
1.56
1.87
557
446
3.39
332
bit to the bottom of a vector, and packs the keys with a 1 in the
bit to the top of the same vector. It maintains the order within
both groups. The sort works because each split
operation
sorts the keys with respect to the current bit (0 down, 1 up) and
maintains the sorted order of all the lower bits-remember
that
we iterate upwards from the least significant bit. Figure 7 shows
an example of the sort along with code to implement it.
We now consider how the split
operation can be implemented using the scan primitives (see Figure 8). The basic idea
is to determine a new index for each element and permute the
elements to these new indices. This can be done with a pair of
scans as follows. We plus-scan the inverse of the Flags
into a
vector I -down. This gives the correct offset for all the elements
with a 0 flag. We now generate the offsets for the elements
with a 1 and put them in I-UP, by executing a plus-scan on the
flags (we must first add to the initial element the number of 0
elements). We then select the indices from the two vectors I -up
and I -down based on the original flags vector. Figure 8 shows
our code that implements the split
operation. Figure 9 shows
a Fortran implementation of the split operation. The available
Fortran compiler did not vectorize this code.
The performance of the parallel radix sort on 32- and 64-bit
inputs is shown in Table 5. This is within 20% of a highly
optimized library sort routine and 13 times faster than a sort
written in Fortran.
On the Cray Y-MP, the split
operation can be implemented
with the compress-index
vector instruction with about the
same performance (I,) as our implementation.
It is interesting
that we can get almost the same performance without ever using
this instruction which is designed specifically for split
type
operations.
672
subroutine
integer
integer
[5 7 3
v(O)
V = split(V,
[1
V(O})
[4 2 5 7 3
O]
l]
v(l)
V = split(V,
[O
O]
V(1))
[4
1 2 7 31
V(2j)
=[l
10
= [12
3
VP)
V = split(V,
14
21
1110
1
0101
4 5
10
split(dst,
dst(*),
src(*),
len
src,
fl,
fl(*)
index
= 1
do 10 j = 1, len
if
(fl(j)
.EQ. 0) then
dst(index)
= src(j)
index
= index
+ 1
endif
do
20
j = 1, len
(fl(j)
.EQ. 1) then
dst(index)
= src(j)
index
= index
+ 1
endif
if
71
20
Figure 7: An example of the split radix sort on a vector containing three bit values. The V(u) notation signifies extracting the
t, bit of each element.
return
end
Figure 9: The split
operation implemented
Fortran compilerdidnotvectorizethiscode.
Routine
p..invb
(NotF,
Flags,
len)
;
spl it operation
Parallel radix sort (32 bits)
Parallel radix sort (64 bits)
plus-scan-i(I_down,
NotF,
len);
/* number
of OS in Flags
*/
sum
= NotF[len-l]
+ I-down[len-11;
/* setup
correct
offset
*/
Flags[O]
+= sum;
plus-scan-i(I-up,
Flags,
len);
/* restore
flags
*/
Flags[O]
-= sum;
/* fix
initial
element
of I-up */
I-up[O]
= sum;
p-sel-i(Index,
I-up,
I-down,
Flags,
len);
perm-i(Result,
V, Index,
len);
1,
(clocks/element)
The
"l/Z
(elements)
10.07
421.35
896.83
665
560
523
[5
21
[l
1
1
01
NotF
I -down
I-UP
Index
=
=
=
[O
[O
0
0
l]
[3
21
Result
[4
71
0 opJm2
q]
[m q q kij 6 6 q 71
in Fortran.
V
Flags
len)
an 0
and
All
are
673
Routine
branch-sums
root-sums
delete-vertices
Original Tree
1 Vector Version
1 Scalar Version
1,
I6
II/Z
)I/2
(clks/elt)
(elts)
(clks/elt)
(elm)
12.2
489
206.5
9.8
526
483
208.6
276.2
2
2
19.2
Branch Sums
Table 6: I, (= 1/I,,~.) and I! 1,~values for the three tree operations.
The numbers are for one processor of a Cray Y-ME
Root Sums
Figure 10: Illustration of the tree operations.
vertices the x marks the nodes to be deleted.
for the vector and serial versions of the operations are shown in
Table 6.
The trees need not be binary or balanced for these operations.
These basic operations can be used to execute the following
useful operations on a tree:
l
Passing a flag down from a set of vertices to all their descendants. This is executed with an root-sums
operation
on a tree with a 1 in vertices that want their descendants
marked. Any vertex that has a number greater than 0 as a
result is marked.
This paper has presented efficient implementations of both unsegmented and segmented scan primitives on vector machines,
and shown how this leads to greatly improved performance on
a representative set of applications. Each of the primitives has
almost optimal asymptotic performance, but since they are made
available to the application as a library of function calls, we
cannot optimize across the primitives. Such optimizations could
reduce temporary storage, allocate resources better, and reduce
674
Cl21 Peter M. Kogge and Harold S. Stone. A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations. IEEE Transactions on Computers, C22(8):78&793, August 1973.
[131 Richard E. Ladner and Michaei J. Fischer. Parallel Prefix
Computation. Journal of the Association for Computing
Machinery, 27(4):831-838, October 1980.
1141
References
r11 American National StandardsInstitute. American National
Standardfor Information Systems Programming
Fortran: S8(X3.9-198x), March 1989.
Language
r41
Com-
i361 JamesJ. Little, Guy E. Blelloch, and Todd A. Cass. Algorithmic Techniques for Computer Vision on a Fine-Grained
Parallel Machine. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 11(3):244-257, March 1989.
WI Harold S. Stone. Parallel Processsingwith the Perfect Shuffle. IEEE Transactions on Computers, C-20(2): 153-161,
1971.
c201 Harold S. Stone. Parallel Tridiagonal Equation Solvers.
ACM Transactions
and Distributed
Computing,
1988.
W. Daniel Hillis and Guy L. Steele Jr. Data Parallel Algorithms. Communications of the ACM, 29(12), December
1986.
R, W, Hackney. A Fast Direct Solution of Poissons Equation Using Fourier Analysis. Journal of the Association for
Computing Machinery, 12(1):95-l 13, January 1965.
Software, 1(4):289-
on Mathematical
8(2),
February 1990.
171
MA, 1988.
Journal
Language.
Wiley,
Addison-Wesley,
675