Performance Analysis of The Alpha 21364-Based HP GS1280 Multiprocessor
Performance Analysis of The Alpha 21364-Based HP GS1280 Multiprocessor
Performance Analysis of The Alpha 21364-Based HP GS1280 Multiprocessor
Zarka Cvetanovic
Hewlett-Packard Corporation
[email protected]
Abstract
This paper evaluates performance characteristics of the HP These improvements enhanced single-CPU performance
GS1280 shared memory multiprocessor system. The and contributed to excellent multiprocessor scaling. We
GS1280 system contains up to 64 Alpha 21364 CPUs describe and analyze these architectural advances and
connected together via a torus-based interconnect. We present key results and profiling data to clarify the
describe architectural features of the GS1280 system. We benefits of these design features. We contrast GS1280 to
compare and contrast the GS1280 to the previous- two previous-generation Alpha systems, both based on a
generation Alpha systems: AlphaServer GS320 and 21264 processor: GS320 – a 32-CPU SMP NUMA system
ES45/SC45. We further quantitatively show the with switch-based interconnect [2], and SC45 – a 4-CPU
performance effects of these features using application ES45 systems connected in a cluster configuration via a
results and profiling data based on the built-in fast Quadrics switch. [4][5][6][7].
performance counters. We find that the HP GS1280 often S PEC fp _rate2000 (Peak)
provides 2 to 3 times the performance of the AlphaServer HP G S 1 2 8 0 /1 .1 5 G Hz (1 -1 6 P p ub lis he d , 3 2 P e s tim ate d )
HP S C 4 5 /1 .2 5 G Hz (1 -4 P p ub lishe d , >4 P e stim ate d )
GS320 at similar clock frequencies. We find the key 600 HP G S 3 2 0 /1 .2 G Hz (p ub lis he d )
IB M/p 6 9 0 /6 5 0 T urb o /1 .3 /1 .4 5 G Hz (p ub lis he d )
reasons for such performance gains are advances in 550 S G I A ltix 3 K /1 G Hz (p ub lishe d )
S UN F ire V 4 8 0 /V 8 8 0 /0 .9 G Hz (p ub lis he d )
memory, inter-processor, and I/O subsystem designs. 500
450
400
1. Introduction 350
The HP AlphaServer 1280 is a shared memory 300
multiprocessor containing up to 64 fourth-generation Alpha 250
21364 microprocessors [1]. Figure 1 compares the 200
performance of GS1280 to the other systems using the 150
SPECfp_rate2000, a multiprocessor throughput standard 100
benchmark [8]. We show the published SPECfp_rate2000 50
results as of March 2003, with the exception of the 32P 0
0 5 10 15 20 25 30 35
GS1280 for which the data was measured on an # CPUs
engineering prototype, but not published yet. We use Figure 1. SPECfp_rate2000 comparison.
floating-point rather than integer SPEC benchmarks for this
comparison since several of the floating-point benchmarks We include results from kernels that exercise memory
stress memory bandwidth, while all integer benchmarks fit subsystem [9][10]. We include profiling results for standard
well in the MB-size caches and thus are not a good benchmarks (SPEC CPU2000 [8]). In addition, we analyze
indicator of memory system performance. The results in characteristics of representatives from 3 application classes
Figure 1 indicate that GS1280 scales well in memory- that impose various levels of stress on memory subsystem
bandwidth intensive workloads and has substantial and processor interconnect. We use profiles based on the
performance advantage over the previous-generation Alpha built-in non-intrusive CPU hardware monitors [3]. These
platforms despite disadvantage in the processor clock monitors are useful tools for analyzing system behavior
frequency. We analyze key performance characteristics of with various workloads. In addition, we use tools based on
the GS1280 in this paper to expose the key design features the EV7-specific performance counters: Xmesh [11].
that allowed GS1280 to reach such performance levels. Xmesh is a graphical tool that displays run-time
information on utilization of CPUs, memory controllers,
The GS1280 system contains many architectural advances inter-processor (IP) links, and I/O ports.
– both in the microprocessor and in the surrounding
memory system - that contribute to its performance. The The remainder of this paper is organized as follows:
21364 processor [1][16] uses the same core as the previous- Section 2 describes the architecture of the GS1280 system.
generation 21264 processor [4]. However, 21364 includes Section 3 describes the memory system improvements in
three additional components: (1) an on-chip L2 cache, (2) GS1280. Section 4 describes the inter-processor
two on-chip Direct Rambus (RDRAM) memory controllers performance characteristics. Section 5 discusses application
and (3) a router. The combination of these components performance. Section 6 shows tradeoffs associated with
helped achieve improved access time to the L2 cache and memory striping. Section 7 summarizes comparisons.
local/remote memory. Section 8 concludes.
2. GS1280 System Overview network, the router multiplexes a physical link among
The Alpha 21364 (EV7) microprocessor [1] shown in several virtual channels. Each input port has two first-
Figure 2 integrates the following components on a single level arbiters, called the local arbiters, each of which
chip: (1) second-level (L2) cache, (2) a router, (3) two selects a candidate packet among those waiting at the
memory controllers (Zboxes), and (4) a 21264 (EV68) input port. Each output port has a second-level arbiter,
microprocessor core. The processor frequency is 1.15 GHz. called the global arbiter, which selects a packet from
The memory controllers and inter-processor links operate at those nominated for it by the local arbiters.
767 MHz (data rate). The L2 cache is 1.75 MB in size, 7-
way set-associative. The load-to-use L2 cache latency is 12 The global directory protocol is a forwarding protocol [16].
cycles (10.4 ns). The data path to the cache is 16-bytes There are 3 types of messages: Requests, Forwards, and
wide, resulting in peak bandwidth of 18.4 GB/s. There are Responses. A requesting processor sends a Request
16 Victim buffers from L1 to L2 and from L2 to memory. message to the directory. If the block is local, the directory
The two integrated memory controllers connect processor is updated and a Response is sent back. If the block is in
directly to the RDRAM memory. The peak memory Exclusive state, the Forward message is sent to the owner
bandwidth is 12.3 GB/s (8 channels, 2 bytes each). There of the block, who sends the Response to the requestor and
can be up to 2048 pages open simultaneously. The optional directory. If the block is in Shared state (and the request is
5th channel is provided as a redundant channel. The four to modify the block), Forward/invalidates are sent to each
interprocessor links are capable of 6.2 GB/s each (2 of the shared copies, and a Response is sent to the
unidirectional links with 3.1 GB/s each). The IO chip is requestor.
connected to the EV7 via a full-duplex link capable of 3.1
GB/s. To optimize network buffer and link utilization, the 21364
routing protocol uses minimal adaptive routing algorithm.
Chip Block Diagram Only a path with minimum number of hops from source to
destination is used. However, a message can choose the
Data IPx4 less congested minimal path (adaptive protocol). Both the
Buffers Router
I/O coherence and adaptive routing protocols can introduce
L2 Tag deadlocks in the 21364 network. The coherence protocol
Array
L2
Memory
RDRAM can introduce deadlocks due to cyclic dependence between
Controller 0
Data different packet classes. For example, the Request packets
Array L2 Cache can fill up the network and prevent the Response packets
Controller Memory
RDRAM from ever reaching their destinations. The 21364 breaks
Core Controller 1
this cyclic dependence by creating virtual channels for
L1 Cache Data each class of coherence packets and prioritizing the
Address & Control dependence among these classes. By creating separate
virtual channels for each class of packets, the router
Figure 2. 21364 block diagram. guarantees that each class of packets can be drained
independent of other classes. Thus, a Response packet can
M M M M never block behind a Request packet. A Request can
364 364 364 364 generate a Block Response, but a Block Response cannot
IO IO IO IO
generate a Request.
M M M M
364 364 364 364 Adaptive routing can generate two types of deadlocks:
IO IO IO IO
intra-dimensional (because the network is a torus, not a
M M M M mesh) and inter-dimensional (arises in any square portion
364 364 364 364 of the mesh). The intra-dimensional deadlock is solved
IO IO IO IO
with virtual channels: VC0 and VC1. The inter-
dimensional deadlocks are solved by allowing message to
Figure 3. A 12-processor 21364-based multiprocessor. route in one dimension (e.g. East-West) before routing in
the next dimension (e.g. North-South) [12]. Additionally,
The router [16] connects multiple 21364s in a two- to facilitate adaptive routing, the 21364 provides a separate
dimensional, adaptive, torus network (Figure 3). The virtual channel called the Adaptive channel for each class.
router connects to 4 links that connect to 4 neighbors in Any message (other than I/O packets) can route through
the torus: North, South, East, and West. Each router the Adaptive channel. However, if the Adaptive channels
routes packets arriving from several input ports (L2 fill up, packets can enter the deadlock-free channels.
cache, ZBoxes, I/O, and other routers) to several output
ports. (i.e., L2 cache, ZBoxes, I/O, and other routers). To The previous-generation GS320 system uses a switch to
avoid deadlocks in the coherence protocol and the connect four processors to the four memory modules in a
single Quad Building Block (QBB) and then a hierarchical
switch to connect QBBs into the larger-scale much lower access time than the off-chip caches in
multiprocessor (up to 32 CPUs) [2]. GS320/ES45.
16k
280
4k
16m
ES45/1.25GHz
4m
1k
260
1m
256
256k
240 stride (by te s)
GS320/1.22GHz
64
64k
220
16
16k
d atase t size (by te s)
Latency (ns)
4k
4
200
180
160 Figure 5. GS1280 dependent load latency for various strides.
140
120
100
80 3.2 Memory Bandwidth
60
40 The STREAM benchmark [10] measures sustainable
20 memory bandwidth in megabytes per second (MB/s) across
0
four vector kernels: Copy, Scale, Sum, and SAXPY.
4k
8k
16k
32k
64k
128k
256k
512k
1m
2m
4m
8m
16m
32m
64m
128m
lower axis varies the referenced data size to fit in different 250
levels of the memory system hierarchy. Data is accessed in a
200
stride of 64 bytes (cache block). The results in Figure 4
show that GS1280 has 3.8 times lower “dependent-load” 150
memory latency (32MB size) than the previous-generation 100
GS320. This indicates that large-size applications that are 50
not blocked to take advantage of 16MB cache will run
substantially faster on GS1280 than on the 21264-based 0
platforms. For data range between 1.75MB and 16MB, the 0 10 20 30 40 50 60 70
# CPUs
latency is higher on GS1280 than on GS320 and ES45, since
the block is fetched from memory on GS1280 vs. from the Figure 6. McCalpin STREAM bandwidth comparison.
16MB L2 cache on GS320/ES45. This indicates that the
application sizes that fall in this range are likely to run We show only the results for the Triad kernel in Figure 6
slower on GS1280 than on the previous-generation (the other kernels have similar characteristics). This data
platforms. For datasizes between 64KB and 1.75MB, latency indicates that the memory bandwidth on GS1280 is
is again much lower on GS1280 than GS320/ES45. That is substantially higher than the previous-generation GS320
because the L2 cache in GS1280 is on-chip, thus providing and all other systems shown.
M c C alp in S T RE AM (T ria d)
On average, GS1280 shows advantage over both GS320
and ES45 in SPECfp2000, and comparable performance in
4 C P Us SPECint2000. Note that some benchmarks demonstrate a
substantial advantage on GS1280 over ES45/GS320. For
example, swim shows 2.3 times advantage on GS1280 vs.
ES45 and 4 times advantage vs. GS320. However, many
GS 320/1.2GHz other benchmarks show comparable performance (e.g.
1 CPU E S 45/1.25GHz most integer benchmarks). Yet, there are cases where
GS 1280/1.15GHz
GS320 and ES45 outperform GS1280 (e.g. facerec and
amp). In order to better understand the causes of such
0 2 4 6 8 10 12 14 16 18 20 22 24 26 differences, we generated profiles that show memory
B andw idth (GB /s) controller utilization for all benchmarks (Figures 10 and
Figure 7. STREAM bandwidth for 1-4 CPUs. 11).
IPC Comparison: SPECint2000
Figure 7 indicates that GS1280 exhibits not only 1-CPU GS320/1.22GHz
SPECint2000 ES45/1.25GHz
advantage in memory bandwidth (due to high-bandwidth
twolf GS1280/1.15GHz
memory-controller design provided by the 21364
bzip2
processor), it also provides linear scaling in bandwidth as
vortex
the number of CPUs increases. This is due to GS1280
perlbmk
memory design where each CPU has its own local
memory, thus avoiding contention for memory between gap
the case on ES45 and GS320, where four CPUs contend parser
likely to be even more pronounced as the number of CPUs 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6
increases. One such example is the SPEC throughput
benchmarks shown in Figure 1. Figure 9. IPC for SPECint2000.
50
applu CPU system. Note that GS320 has two levels of latency:
mesa local (within a set of 4 CPUs called QBB) and remote
galgel
art
(outside that QBB). The GS1280 system has many levels
40
equake of remote latency, depending on how many hops need to
facerec be passed from source to destination.
30
ammp
lucas Figure 12 indicates that GS1280 shows 4 times advantage
fma3d in average memory latency on 16 CPUs. The advantage is
sixtyrack
20 even higher (6.6 times) when Read-Dirty instead of Read-
apsi
Clean latencies are compared. Note that in the case of
Read-Dirty, a cache block is read from another processor’s
10 cache rather than from memory.
13
17
21
25
29
33
37
41
45
49
53
57
61
0 ->14
Tim e stam p
0 ->13
Figure 10. GS1280 memory controller utilization in 0 ->12
SPECfp2000. 0 ->11
0 ->10
28 0 ->8
gzip 0 ->7
vpr 0 ->6
gcc
0 ->5
24 mcf
Memory Controller Utilization (percentage)
0 ->4
crafty
0 ->3
parser
20 eon 0 ->2 GS320/1.2GHz
gap 0 ->1 GS1280/1.15GHz
prlbmk 0 -> 0
vortex 0 100 200 300 400 500 600 700 800 900
16
bzip2 Latency (ns)
twolf
Figure 12. Local/remote latency on 16 CPUs.
12
500
GS1280/1.15GHz Although interesting from the theoretical point of view,
GS320/1.2GHz this phenomenon has no implications on performance of
400
real applications, as we have not observed any applications
300 that operate even close to this point.
200
100 4.1 Shuffle Interconnect
0
We discovered that performance of an 8-CPU GS1280
4 8 16 32 64
configuration could be improved by a simple swap of the
number of CPUs cables (which we call “shuffle”)[12]. Figures 16 and 17
Figure 14. Remote memory latency for 4-64 CPUs. show how the connections are changed from the standard
“torus” interconnect (Figure 16) to “shuffle” (Figure 17) in
order to improve performance. The redundant North-South
4. Interprocessor Bandwidth connections in an 8-CPU torus are used to connect to the
The memory latency in Figure 14 is measured on an idle
furthest nodes to crate a shuffle interconnect.
system with only 2 CPUs exchanging messages. In this
section, we evaluate interprocessor network response as the
load increases. This is needed in order to characterize
applications that require all CPUs to communicate
simultaneously (more closely related to the real application
environment).
Rating
The 2-hop shuffle provides lower additional (2-5%) gain. 1200
1000
800
Shuffle Improvements
2500 600
400
current
2000
200
shuffle
shuffle_2hop 0
0 5 10 15 20 25 30 35
latency (ns)
1500 # CPUs
Figure 19. Fluent Performance.
1000
0
12
0 5000 10000 15000 20000 25000 30000
bandwidth (MB/sec) 10
Utilization (%)
35
IP-links (average)
30 600
Utilization (%)
25
400
20
15 200
10
0
5
0 10 20 30 40 50 60 70
0 # CPUs
37:23.1
37:48.2
38:13.2
38:38.3
39:03.4
39:28.4
39:53.5
40:18.6
40:43.6
41:08.7
41:33.7
41:58.8
42:23.9
42:49.0
43:14.1
43:39.3
Figure 22. Memory Controller utilization in SP GUPS: Memory and IP-link Utilization (32P
100 GS1280)
In order to explain this advantage, we show the memory 90
and IP-link utilization in Figure 22. Figure 22 shows that 80
memory bandwidth utilization is high in SP (26%). 70
Utilization (%)
memory controller
GS1280 has substantially higher memory bandwidth than 60 average North/South
ES45 and GS320 (Figures 6 and 7), thus the advantage in 50 average East/West
SP. Since ES45 has higher memory bandwidth than GS320 40
(Figure 7), GS1280 shows even higher advantage vs. 30
GS320 than EC45. The IP links utilization in these MPI 20
kernels is low (Figure 22). We also observed that IP link 10
utilization is low in many other MPI applications. The 0
GS1280 provides very high IP-link bandwidth that in many
11:02.2
11:04.3
11:06.3
11:08.3
11:10.3
11:12.3
11:14.3
11:16.4
11:18.4
11:20.5
11:22.5
11:24.5
latency (ns)
2000
The results of our evaluation of memory striping are
presented in Figures 25 and 26. Figure 25 shows that 1500
striping degrades performance 10-30% in throughput
applications due to increased inter-processor traffic. We 1000
observed degradation as high as 70% in some applications.
500
Degradation from Striping: SPECfp_rate2000
301.apsi 0
0 2000 4000 6000 8000 10000 12000 14000
200.sixtrack bandwidth (MB/sec)
191.fm a3d
Figure 26. Improvement from striping.
189.lucas
188.am m p
187.facerec
183.equake
179.art
178.galgel
177.m esa
173.applu
172.m grid
171.swim
168.wupwise
We have heavily relied on profiling analysis based on [7] Z. Cvetanovic and D. D. Donaldson, “AlphaServer
the built-in performance counters (Xmesh) throughout 4100 Performance Characterization”, Digital Technical
this study. Such tools are crucial for understanding Journal, Vol 8 No. 4, 1996: pp. 3-20.
system behavior. We have used profiles to explain why
some workloads perform exceptionally well on GS1280, [8] SPEC CPU2000 Benchmarks available at
while others show comparable (or even worse) http://www.spec.org/osg/cpu2000/results
performance than GS320 and ES45. In addition, these SPEC® and SPEC CPU2000® are registered trademarks of
tools are crucial for identifying areas for improving the Standard Performance Evaluation Corporation.
performance on GS1280: e.g. Xmesh can detect hot-
spots, heavy traffic on the IP links (indicate poor [9] Information about the lmbench available at
memory locality), etc. Once such bottlenecks are http://www.bitmover.com/lmbench/
recognized, various techniques can be used to improve
performance. [10] The STREAM benchmark information available at
http://www.cs.viginia.edu/stream
The GS1280 system is the last-generation Alpha server.
In our future work, we plan to extend our analysis to [11] Z. Cvetanovic, P. Gilbert, J. Campoli, “Xmesh: a
non-Alpha based large-scale multiprocessor platforms. graphical performance monitoring tool for large-
We will also place more emphasis on characterizing real scale multiprocessors”, HP internal report, 2002.
I/O intensive applications.
[12] J. Duato, S Yalamanchili, L. Ni, “Interconnection
Networks, an Engineering Approach”, IEEE
Acknowledgments
Computer Society, Los Alamitos, California
The author would like to thank to Jason Campoli, Peter
Gilbert, and Andrew Feld for profiling data collection.
[13] Fluent data available at
Special thanks to Darrel Donaldson for his guidance and
http://www.fluent.com/software/fluent/fl5bench/fullres.htm
help with the GS1280 system and to Steve Jenkins, Sas
Durvasula, and Jack Zemcik for supporting this work.
[14] NAS Parallel data available at
http://www.nas.nasa.gov/Software/NPB/