Performance Analysis of The Alpha 21364-Based HP GS1280 Multiprocessor

Performance Analysis of the Alpha 21364-based HP GS1280 Multiprocessor
Zarka Cvetanovic
Hewlett-Packard Corporation
[email protected]
Abstract
This paper evaluates performance characteristics of the HP These improvements enhanced single-CPU performance
GS1280 shared memory multiprocessor system. The and contributed to excellent multiprocessor scaling. We
GS1280 system contains up to 64 Alpha 21364 CPUs describe and analyze these architectural advances and
connected together via a torus-based interconnect. We present key results and profiling data to clarify the
describe architectural features of the GS1280 system. We benefits of these design features. We contrast GS1280 to
compare and contrast the GS1280 to the previous- two previous-generation Alpha systems, both based on a
generation Alpha systems: AlphaServer GS320 and 21264 processor: GS320 – a 32-CPU SMP NUMA system
ES45/SC45. We further quantitatively show the with switch-based interconnect [2], and SC45 – a 4-CPU
performance effects of these features using application ES45 systems connected in a cluster configuration via a
results and profiling data based on the built-in fast Quadrics switch. [4][5][6][7].
performance counters. We find that the HP GS1280 often S PEC fp _rate2000 (Peak)
provides 2 to 3 times the performance of the AlphaServer HP G S 1 2 8 0 /1 .1 5 G Hz (1 -1 6 P p ub lis he d , 3 2 P e s tim ate d )
HP S C 4 5 /1 .2 5 G Hz (1 -4 P p ub lishe d , >4 P e stim ate d )
GS320 at similar clock frequencies. We find the key 600 HP G S 3 2 0 /1 .2 G Hz (p ub lis he d )
IB M/p 6 9 0 /6 5 0 T urb o /1 .3 /1 .4 5 G Hz (p ub lis he d )
reasons for such performance gains are advances in 550 S G I A ltix 3 K /1 G Hz (p ub lishe d )
S UN F ire V 4 8 0 /V 8 8 0 /0 .9 G Hz (p ub lis he d )
memory, inter-processor, and I/O subsystem designs. 500
450
400
1. Introduction 350
The HP AlphaServer 1280 is a shared memory 300
multiprocessor containing up to 64 fourth-generation Alpha 250
21364 microprocessors [1]. Figure 1 compares the 200
performance of GS1280 to the other systems using the 150
SPECfp_rate2000, a multiprocessor throughput standard 100
benchmark [8]. We show the published SPECfp_rate2000 50
results as of March 2003, with the exception of the 32P 0
0 5 10 15 20 25 30 35
GS1280 for which the data was measured on an # CPUs
engineering prototype, but not published yet. We use Figure 1. SPECfp_rate2000 comparison.
floating-point rather than integer SPEC benchmarks for this
comparison since several of the floating-point benchmarks We include results from kernels that exercise memory
stress memory bandwidth, while all integer benchmarks fit subsystem [9][10]. We include profiling results for standard
well in the MB-size caches and thus are not a good benchmarks (SPEC CPU2000 [8]). In addition, we analyze
indicator of memory system performance. The results in characteristics of representatives from 3 application classes
Figure 1 indicate that GS1280 scales well in memory- that impose various levels of stress on memory subsystem
bandwidth intensive workloads and has substantial and processor interconnect. We use profiles based on the
performance advantage over the previous-generation Alpha built-in non-intrusive CPU hardware monitors [3]. These
platforms despite disadvantage in the processor clock monitors are useful tools for analyzing system behavior
frequency. We analyze key performance characteristics of with various workloads. In addition, we use tools based on
the GS1280 in this paper to expose the key design features the EV7-specific performance counters: Xmesh [11].
that allowed GS1280 to reach such performance levels. Xmesh is a graphical tool that displays run-time
information on utilization of CPUs, memory controllers,
The GS1280 system contains many architectural advances inter-processor (IP) links, and I/O ports.
– both in the microprocessor and in the surrounding
memory system - that contribute to its performance. The The remainder of this paper is organized as follows:
21364 processor [1][16] uses the same core as the previous- Section 2 describes the architecture of the GS1280 system.
generation 21264 processor [4]. However, 21364 includes Section 3 describes the memory system improvements in
three additional components: (1) an on-chip L2 cache, (2) GS1280. Section 4 describes the inter-processor
two on-chip Direct Rambus (RDRAM) memory controllers performance characteristics. Section 5 discusses application
and (3) a router. The combination of these components performance. Section 6 shows tradeoffs associated with
helped achieve improved access time to the L2 cache and memory striping. Section 7 summarizes comparisons.
local/remote memory. Section 8 concludes.
2. GS1280 System Overview network, the router multiplexes a physical link among
The Alpha 21364 (EV7) microprocessor [1] shown in several virtual channels. Each input port has two first-
Figure 2 integrates the following components on a single level arbiters, called the local arbiters, each of which
chip: (1) second-level (L2) cache, (2) a router, (3) two selects a candidate packet among those waiting at the
memory controllers (Zboxes), and (4) a 21264 (EV68) input port. Each output port has a second-level arbiter,
microprocessor core. The processor frequency is 1.15 GHz. called the global arbiter, which selects a packet from
The memory controllers and inter-processor links operate at those nominated for it by the local arbiters.
767 MHz (data rate). The L2 cache is 1.75 MB in size, 7-
way set-associative. The load-to-use L2 cache latency is 12 The global directory protocol is a forwarding protocol [16].
cycles (10.4 ns). The data path to the cache is 16-bytes There are 3 types of messages: Requests, Forwards, and
wide, resulting in peak bandwidth of 18.4 GB/s. There are Responses. A requesting processor sends a Request
16 Victim buffers from L1 to L2 and from L2 to memory. message to the directory. If the block is local, the directory
The two integrated memory controllers connect processor is updated and a Response is sent back. If the block is in
directly to the RDRAM memory. The peak memory Exclusive state, the Forward message is sent to the owner
bandwidth is 12.3 GB/s (8 channels, 2 bytes each). There of the block, who sends the Response to the requestor and
can be up to 2048 pages open simultaneously. The optional directory. If the block is in Shared state (and the request is
5th channel is provided as a redundant channel. The four to modify the block), Forward/invalidates are sent to each
interprocessor links are capable of 6.2 GB/s each (2 of the shared copies, and a Response is sent to the
unidirectional links with 3.1 GB/s each). The IO chip is requestor.
connected to the EV7 via a full-duplex link capable of 3.1
GB/s. To optimize network buffer and link utilization, the 21364
routing protocol uses minimal adaptive routing algorithm.
Chip Block Diagram Only a path with minimum number of hops from source to
destination is used. However, a message can choose the
Data IPx4 less congested minimal path (adaptive protocol). Both the
Buffers Router
I/O coherence and adaptive routing protocols can introduce
L2 Tag deadlocks in the 21364 network. The coherence protocol
Array
L2
Memory
RDRAM can introduce deadlocks due to cyclic dependence between
Controller 0
Data different packet classes. For example, the Request packets
Array L2 Cache can fill up the network and prevent the Response packets
Controller Memory
RDRAM from ever reaching their destinations. The 21364 breaks
Core Controller 1
this cyclic dependence by creating virtual channels for
L1 Cache Data each class of coherence packets and prioritizing the
Address & Control dependence among these classes. By creating separate
virtual channels for each class of packets, the router
Figure 2. 21364 block diagram. guarantees that each class of packets can be drained
independent of other classes. Thus, a Response packet can
M M M M never block behind a Request packet. A Request can
364 364 364 364 generate a Block Response, but a Block Response cannot
IO IO IO IO
generate a Request.
M M M M
364 364 364 364 Adaptive routing can generate two types of deadlocks:
IO IO IO IO
intra-dimensional (because the network is a torus, not a
M M M M mesh) and inter-dimensional (arises in any square portion
364 364 364 364 of the mesh). The intra-dimensional deadlock is solved
IO IO IO IO
with virtual channels: VC0 and VC1. The inter-
dimensional deadlocks are solved by allowing message to
Figure 3. A 12-processor 21364-based multiprocessor. route in one dimension (e.g. East-West) before routing in
the next dimension (e.g. North-South) [12]. Additionally,
The router [16] connects multiple 21364s in a two- to facilitate adaptive routing, the 21364 provides a separate
dimensional, adaptive, torus network (Figure 3). The virtual channel called the Adaptive channel for each class.
router connects to 4 links that connect to 4 neighbors in Any message (other than I/O packets) can route through
the torus: North, South, East, and West. Each router the Adaptive channel. However, if the Adaptive channels
routes packets arriving from several input ports (L2 fill up, packets can enter the deadlock-free channels.
cache, ZBoxes, I/O, and other routers) to several output
ports. (i.e., L2 cache, ZBoxes, I/O, and other routers). To The previous-generation GS320 system uses a switch to
avoid deadlocks in the coherence protocol and the connect four processors to the four memory modules in a
single Quad Building Block (QBB) and then a hierarchical
switch to connect QBBs into the larger-scale much lower access time than the off-chip caches in
multiprocessor (up to 32 CPUs) [2]. GS320/ES45.
3. Memory Subsystem Figure 5 shows dependent load latency on GS1280 as both

In this section we characterize memory subsystem of dataset size and stride increase. This data indicates that the
GS1280 and compare to the previous-generation Alpha latency increases from ~80ns for open-page access to
platforms. The section includes the analysis of local ~130ns for closed-page access (larger-stride access).
memory latency, memory bandwidth, single-CPU
performance, and remote memory latency. D epend en t M em ory L atency on G S1280
1 2 0 -1 4 0
1 0 0 -1 2 0
3.1 Local memory latency for dependent loads 8 0 -1 0 0
The 21374 processor provides two RDRAM memory 6 0 -8 0

4 0 -6 0
controllers with 12.3 GB/s peak memory bandwidth. Each 140
2 0 -4 0
120
processor can be configured with 0, 1, or 2 memory 0 -2 0
100
controllers. The L2 1.75MB on-chip cache is 7-way set
80
associative. The L2 cache on ES45 and GS320 is 16MB, late ncy (n s)
60
off-chip, direct-mapped.
40
Dependent Load Latency 20

320
300 0
GS1280/1.15GHz
16k
280
4k
16m
ES45/1.25GHz
4m
1k
260
1m
256
256k
240 stride (by te s)
GS320/1.22GHz
64
64k
220
16
16k
d atase t size (by te s)
Latency (ns)
4k
4
200
180
160 Figure 5. GS1280 dependent load latency for various strides.
140
120
100
80 3.2 Memory Bandwidth
60
40 The STREAM benchmark [10] measures sustainable
20 memory bandwidth in megabytes per second (MB/s) across
0
four vector kernels: Copy, Scale, Sum, and SAXPY.
4k
8k
16k
32k
64k
128k
256k
512k
1m
2m
4m
8m
16m
32m
64m
128m
Dataset size (bytes) McCalpin Stream: Triad

HP GS1280 1.15GHz (published, >16P estimated)
Figure 4. Dependent load latency comparison. HP GS320/1.2GHz (published)
HP SC45/1.25GHz (estimated >4P)
400
IBM/p690 Power4/1.3GHz (published)
Figure 4 compares the dependent-load latency [9]. The
350
dependent-load latency measures load-to-use latency where
each load depends on the result from the previous load. The 300
Bandwidth (GB/s)
lower axis varies the referenced data size to fit in different 250
levels of the memory system hierarchy. Data is accessed in a
200
stride of 64 bytes (cache block). The results in Figure 4
show that GS1280 has 3.8 times lower “dependent-load” 150
memory latency (32MB size) than the previous-generation 100
GS320. This indicates that large-size applications that are 50
not blocked to take advantage of 16MB cache will run
substantially faster on GS1280 than on the 21264-based 0
platforms. For data range between 1.75MB and 16MB, the 0 10 20 30 40 50 60 70
# CPUs
latency is higher on GS1280 than on GS320 and ES45, since
the block is fetched from memory on GS1280 vs. from the Figure 6. McCalpin STREAM bandwidth comparison.
16MB L2 cache on GS320/ES45. This indicates that the
application sizes that fall in this range are likely to run We show only the results for the Triad kernel in Figure 6
slower on GS1280 than on the previous-generation (the other kernels have similar characteristics). This data
platforms. For datasizes between 64KB and 1.75MB, latency indicates that the memory bandwidth on GS1280 is
is again much lower on GS1280 than GS320/ES45. That is substantially higher than the previous-generation GS320
because the L2 cache in GS1280 is on-chip, thus providing and all other systems shown.
M c C alp in S T RE AM (T ria d)
On average, GS1280 shows advantage over both GS320
and ES45 in SPECfp2000, and comparable performance in
4 C P Us SPECint2000. Note that some benchmarks demonstrate a
substantial advantage on GS1280 over ES45/GS320. For
example, swim shows 2.3 times advantage on GS1280 vs.
ES45 and 4 times advantage vs. GS320. However, many
GS 320/1.2GHz other benchmarks show comparable performance (e.g.
1 CPU E S 45/1.25GHz most integer benchmarks). Yet, there are cases where
GS 1280/1.15GHz
GS320 and ES45 outperform GS1280 (e.g. facerec and
amp). In order to better understand the causes of such
0 2 4 6 8 10 12 14 16 18 20 22 24 26 differences, we generated profiles that show memory
B andw idth (GB /s) controller utilization for all benchmarks (Figures 10 and
Figure 7. STREAM bandwidth for 1-4 CPUs. 11).
IPC Comparison: SPECint2000
Figure 7 indicates that GS1280 exhibits not only 1-CPU GS320/1.22GHz
SPECint2000 ES45/1.25GHz
advantage in memory bandwidth (due to high-bandwidth
twolf GS1280/1.15GHz
memory-controller design provided by the 21364
bzip2
processor), it also provides linear scaling in bandwidth as
vortex
the number of CPUs increases. This is due to GS1280
perlbmk
memory design where each CPU has its own local
memory, thus avoiding contention for memory between gap
jobs that run simultaneously on several CPUs. This is not eon
the case on ES45 and GS320, where four CPUs contend parser
for the same memory. Therefore, bandwidth improvement crafty
from one to four CPUs on ES45/GS320 is less-than-linear mcf
(as indicated in Figure 7). The data in Figures 6 and 7 cc1
indicate that the memory-bandwidth intensive applications vpr
will run exceptionally well on GS1280. The advantage is gzip
likely to be even more pronounced as the number of CPUs 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6
increases. One such example is the SPEC throughput
benchmarks shown in Figure 1. Figure 9. IPC for SPECint2000.
3.3. Single-CPU performance: CPU2000

Figures 8 and 9 compare Instructions-per-Cycle (IPC) for Figures 10 and 11 illustrate memory controller utilization
floating-point (fp) and integer SPEC CPU2000 profiling histograms for SPEC CPU2000 benchmarks on
benchmarks on GS1280 vs. GS320 and ES45 [8]. GS1280. The profiles are collected using the 21364 built-in
performance counters and are shown as a function of the
IPC Comparison: SPECfp2000 elapsed time for the entire benchmark run. This data
SPECfp2000 indicates that the benchmarks with high memory utilization
apsi are the same benchmarks that show significant advantage
sixtrack on GS1280. Swim is the leader with 53% utilization,
fma3d GS320/1.22GHz followed by applu, lucas, equake, and mgrid (20-30%),
lucas ES45/1.25GHz fma3d, art, wupwise, and galgel (10-20%). Interestingly,
GS1280/1.15GHz
ammp
facerec has 8% utilization: still GS1280 has lower IPC than
facerec
the other systems. That is due to the smaller cache size on
equake
GS1280 (1.75MB vs. 16MB on GS320/ES45). The
art
simulation results show that facerec dataset fits in the 8MB
galgel
mesa
cache, but not in the 1.75MB cache. Therefore, facerec
applu
accesses memory on GS1280, while it fetches data mostly
mgrid
from the 16MB cache on GS320 and ES45. Figure 4
swim illustrates that the cache access on GS320 is faster than the
wupwise memory access on GS1280.
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2
Figure 8. IPC for SPECfp2000.

order to characterize applications that do not fit well in
SPECfp2000: Memory Controller Utilization local memory, we need to understand how local latency
60
wupwise compares to the remote latency. Figure 12 compares local
swim and remote memory latency on GS320 and GS1280.
mgrid Latency is measured from CPU0 to all other CPUs in a 16-
Memory Controller Utilization (percentage)
50
applu CPU system. Note that GS320 has two levels of latency:
mesa local (within a set of 4 CPUs called QBB) and remote
galgel
art
(outside that QBB). The GS1280 system has many levels
40
equake of remote latency, depending on how many hops need to
facerec be passed from source to destination.
30
ammp
lucas Figure 12 indicates that GS1280 shows 4 times advantage
fma3d in average memory latency on 16 CPUs. The advantage is
sixtyrack
20 even higher (6.6 times) when Read-Dirty instead of Read-
apsi
Clean latencies are compared. Note that in the case of
Read-Dirty, a cache block is read from another processor’s
10 cache rather than from memory.
GS1280 vs. GS320 Latency: 16P

average
0
0 ->15
1
13
17
21
25
29
33
37
41
45
49
53
57
61
0 ->14
Tim e stam p
0 ->13
Figure 10. GS1280 memory controller utilization in 0 ->12
SPECfp2000. 0 ->11
0 ->10
SPECint2000: Memory Controller Utilization 0 ->9
28 0 ->8
gzip 0 ->7
vpr 0 ->6
gcc
0 ->5
24 mcf
Memory Controller Utilization (percentage)
0 ->4
crafty
0 ->3
parser
20 eon 0 ->2 GS320/1.2GHz
gap 0 ->1 GS1280/1.15GHz
prlbmk 0 -> 0
vortex 0 100 200 300 400 500 600 700 800 900
16
bzip2 Latency (ns)
twolf
Figure 12. Local/remote latency on 16 CPUs.
12
83 145 186 154

8 139 175 221 182
181 221 259 222
154 191 235 195
4
Figure 13. Remote memory latencies (ns) on GS1280
(each square represents a CPU in a 16-CPU torus).
0
Figure 13 illustrates measured latency from node 0 to all
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
Timestamp other nodes in the 16-CPU GS1280 system (each square is

Figure 11. GS1280 memory controller utilization in a CPU within a 4x4 torus). The local memory latency of
SPECint2000. 83ns is increased to 139-154 ns for the 1-hop neighbors.
Note that the 1-hop latency is the lowest for the neighbors
on the same module (139 ns), and the highest for the
3.4. Remote memory latency neighbors that are connected via a cable (154 ns). The 2-
In Sections 3.1 and 3.2 we contrasted local memory hop latency is 175-195 ns (6 nodes are 2-hop away). The 4-
latency and bandwidth on GS1280 and previous-generation hop latency (worst-case for 16 CPUs) is 259 ns (1 node is
platforms. Local memory characteristics are important for 4-hops away).
the single-CPU workloads and multiprocessor workloads
that fit well in processor’s local memory. However, in
Figure 14 shows the average load-to-use latency as the Figure 15 compares bandwidth under increasing load on
number of CPUs increases. Figures 12 and 14 show that GS1280 and GS320. Each CPU randomly selects another
GS1280 has significant advantage over GS320 not only in CPU to send a Read request to. The test is started with a
local, but also in remote memory latency. This data single outstanding load (leftmost point). For each
indicates that applications that are not structured to fit well additional point, one outstanding load is added (up to 30
within a processor’s local memory will run much more outstanding requests). In ideal case, bandwidth will
efficiently on GS1280 that on GS320. In addition, this increase (moving to the right), and latency will not change
advantage will be even more pronounced in applications (line stays low and flat).
that require high amount of data sharing (parallel
workloads) due to efficient Read-Dirty implementation in Figure 15 indicates that GS1280 shows increase in latency,
GS1280. but it is not nearly as high as in GS320. GS1280 is much
more resilient to the load: bandwidth increases at much
Average Load-to-Use Latency smaller latency increase. This is an important system
900
feature for applications that require substantial inter-
800
processor (IP) bandwidth. Figure 14 also indicates another
700 interesting phenomenon: as the load is increased beyond
600 saturation point, the delivered bandwidth starts to decrease.
latency (ns)
500
GS1280/1.15GHz Although interesting from the theoretical point of view,
GS320/1.2GHz this phenomenon has no implications on performance of
400
real applications, as we have not observed any applications
300 that operate even close to this point.
200
100 4.1 Shuffle Interconnect
0
We discovered that performance of an 8-CPU GS1280
4 8 16 32 64
configuration could be improved by a simple swap of the
number of CPUs cables (which we call “shuffle”)[12]. Figures 16 and 17
Figure 14. Remote memory latency for 4-64 CPUs. show how the connections are changed from the standard
“torus” interconnect (Figure 16) to “shuffle” (Figure 17) in
order to improve performance. The redundant North-South
4. Interprocessor Bandwidth connections in an 8-CPU torus are used to connect to the
The memory latency in Figure 14 is measured on an idle
furthest nodes to crate a shuffle interconnect.
system with only 2 CPUs exchanging messages. In this
section, we evaluate interprocessor network response as the
load increases. This is needed in order to characterize
applications that require all CPUs to communicate
simultaneously (more closely related to the real application
environment).
Load Test: Max 30 Outstanding Memory References

6500 Figure 16. Torus. Figure 17. Shuffle.
6000
5500 GS1280/16P Table 1 shows the performance improvement from shuffle

GS1280/32P
5000 GS1280/64P vs. torus using a simple analytical model. Note that shuffle
GS320/32P
GS320/16P
is more beneficial in rectangular rather than in square
4500
shaped interconnects (bisection width and worst-case
4000
latency). The benefits in average latency increase as the
latency (ns)
3500 system size grows.

3000
2500 Table 1: Performance gains from shuffle.

2000 aver. latency worst latenc y bis ec tion width
1500 4x2 1.200 1.500 2.000
1000 4x4 1.067 1.333 1.000
500
8x4 1.171 1.500 2.000
8x8 1.185 1.333 1.000
0
0 20000 40000 60000 80000 100000 120000 140000 16x 8 1.371 1.500 2.000
bandwidth (MB/sec) 16x 16 1.454 1.778 1.000
Figure 15. Load test comparison.
Figure 18 shows the performance gains from shuffle
measured on an 8-CPU GS1280 prototype. We FLUENT 6: fl5l1
experimented with two shuffle routing approaches: (1) HP GS1280/1.15GHz
shuffle with 1-hop: shuffle links are used as the initial HP SC45/1.25GHz
2200 HP GS320/1.22GHz
(and only) hop, (2) shuffle with 2-hops: shuffle links are SUN SUNFIRE6800 USIII/900MHz
2000
used for 1 and 2 hops (e.g. we use shuffle links to alleviate IBM pSeries 690_Turbo/Power4/1.3GHz
load on horizontal links). The performance data indicates 1800 Dell Pwredge2650 Xeon/2.4GHz
that 1-hop shuffle provides between 5% and 25% 1600

performance gain (depending on network load) vs. torus. 1400
Rating
The 2-hop shuffle provides lower additional (2-5%) gain. 1200
1000
800
Shuffle Improvements
2500 600
400
current
2000
200
shuffle
shuffle_2hop 0
0 5 10 15 20 25 30 35
latency (ns)
1500 # CPUs
Figure 19. Fluent Performance.
1000
500 Fluent: Memory and IP-link Utilization

14
0
12
0 5000 10000 15000 20000 25000 30000
bandwidth (MB/sec) 10
Utilization (%)
Figure 18. Performance Improvement from Shuffle. 8 memory controllers (average)
5. Application performance 6 IP-links (average)

In this section, we compare GS1280 to the other systems
using 3 types of applications: (1) CPU-intensive applications 4
that do not stress either memory controller or inter-processor 2
(IP) links, (2) memory-bandwidth intensive applications that
stress memory bandwidth, but not the IP links (many MPI 0
applications belong to this category), and (3) applications
54:43.2
54:53.3
55:03.3
55:13.4
55:23.4
55:33.4
55:43.5
55:53.6
56:03.7
56:13.7
56:23.7
56:33.8
56:43.8
56:53.8
57:03.9
57:13.9
57:23.9
that stress both IP-links and memory bandwidth. An
example of application that represents each class is included. timestamp
The utilization of memory controllers and IP links is
measured using the 21364 built-in performance counters Figure 20. Memory and IP-links utilization in Fluent.
[11].
5.1. CPU-intensive application: Fluent (CFD) 5.2. Memory-Bandwidth Intensive application:

Fluent is the standard Computational Fluid Dynamics NAS Parallel
application [13]. For comparison, we selected a large case NAS Parallel benchmarks represent a collection of kernels
(l1) that models flow around a fighter aircraft (Figure 19). that are important in many technical applications [14]. The
The results in Figure 19 are published as of March 2003. kernels are decomposed using MPI and can run on either
This data indicates that GS1280 shows comparable shared-memory or cluster systems. With the exception of
performance to ES45. Examining Figure 20 with measured EP (embarrassingly parallel), majority of these kernels
utilization shows that the reason is that this application does (solvers, FFT, grid, integer sort) put significant stress on
not put significant stress on either memory controller or IP- memory bandwidth (when size C is used). Note that
links bandwidth. The large 16MB cache in ES45 often although these are small kernels, they provide the same
provides advantage in applications that can be blocked for level of stress on memory subsystem as many large real
cache re-use (such as Fluent). applications.
5.3. IP bandwidth intensive application: GUPS
NAS Parallel SP
HP GS1280/1.15GHz
GUPS is a multithreaded (OpenMP) application where each
14000 HP SC45/1.25GHz thread updates an item randomly picked from the large table
12000
HP GS320/1.2GHz [15]. Since the table is so large that it spans the entire
memory in the system, this application puts substantial stress
10000 on the IP-link bandwidth (Figure 24). In this application,
8000 GS1280 shows the most substantial advantage over the other
MOPS
systems, as shown in Figure 23. This is because this

6000 application exploits substantial IP-link bandwidth advantage
4000 on GS1280, as discussed in Section 4 (Figure 15). It is also
interesting that the links show uneven utilization in Figure
2000
24: East/West links show higher utilization than North/South
0 links. This is because the link utilization is higher on
0 5 10 15 20 25 30 horizontal than on vertical links in a 4x8 torus. This is also
# CPUs the reason for the bend in performance at 32 CPUs: the
Figure 21. SP Performance comparison. cross-sectional bandwidth is comparable in both 16P and
32P torus configurations.
Figure 21 compares GS1280 performance to the previous-
generation Alpha platforms in the SP solver. This data GUPS Performance Comparison
shows substantial advantage on GS1280 compared to the 1200 GS1280/1.15GHz
other systems. GS320/1.2GHz
1000 ES45/1.25GHz
40
SP: Memory and IP-link Utilization
memory controllers (average) 800
Mupdates/s
35
IP-links (average)
30 600
Utilization (%)
25
400
20
15 200
10
0
5
0 10 20 30 40 50 60 70
0 # CPUs
37:23.1
37:48.2
38:13.2
38:38.3
39:03.4
39:28.4
39:53.5
40:18.6
40:43.6
41:08.7
41:33.7
41:58.8
42:23.9
42:49.0
43:14.1
43:39.3
Figure 23. GUPS Performance comparison.

timestamp
Figure 22. Memory Controller utilization in SP GUPS: Memory and IP-link Utilization (32P
100 GS1280)
In order to explain this advantage, we show the memory 90
and IP-link utilization in Figure 22. Figure 22 shows that 80
memory bandwidth utilization is high in SP (26%). 70
Utilization (%)
memory controller
GS1280 has substantially higher memory bandwidth than 60 average North/South
ES45 and GS320 (Figures 6 and 7), thus the advantage in 50 average East/West
SP. Since ES45 has higher memory bandwidth than GS320 40
(Figure 7), GS1280 shows even higher advantage vs. 30
GS320 than EC45. The IP links utilization in these MPI 20
kernels is low (Figure 22). We also observed that IP link 10
utilization is low in many other MPI applications. The 0
GS1280 provides very high IP-link bandwidth that in many
11:02.2
11:04.3
11:06.3
11:08.3
11:10.3
11:12.3
11:14.3
11:16.4
11:18.4
11:20.5
11:22.5
11:24.5
cases exceeds the needs of MPI applications (many of

which are designed for cluster interconnects with much timestamp
lower bandwidth requirements).
Figure 24. Memory and IP-link utilization in GUPS.
6. Memory Striping A more extensive study over a variety of applications
Memory striping allows interleaving of 4 cache lines indicated that only a small portion of applications benefit
across two CPUs, starting with CPU0/controller0, then from striping (while most others degrade performance).
CPU0/controller1, and then CPU1/controller0, and finally
CPU1/controller1. The CPUs chosen to participate in Hot-Spot Improvement from Striping
striping are the closest neighbors (CPUs on the same 3500
module). Striping provides performance benefit in Non-striped
alleviating hot spots, where a hot-spot traffic is spread 3000
across 2 CPUs (instead of one). The disadvantage of Striped
memory striping is that it puts additional burden on the IP 2500
links between pairs of CPUs.
latency (ns)
2000
The results of our evaluation of memory striping are
presented in Figures 25 and 26. Figure 25 shows that 1500
striping degrades performance 10-30% in throughput
applications due to increased inter-processor traffic. We 1000
observed degradation as high as 70% in some applications.
500
Degradation from Striping: SPECfp_rate2000
301.apsi 0
0 2000 4000 6000 8000 10000 12000 14000
200.sixtrack bandwidth (MB/sec)
191.fm a3d
Figure 26. Improvement from striping.
189.lucas
188.am m p
187.facerec
183.equake
179.art
178.galgel
177.m esa
173.applu
172.m grid
171.swim
168.wupwise
0% 5% 10% 15% 20% 25% 30% 35%
Figure 25. Degradation from striping.
Figure 26 shows that striping improves performance of a

hot-spot traffic pattern (all CPUs read data from CPU0) up
to 80%. We use the Xmesh tool based on built-in
performance counters to recognize the hot-spot traffic
(Figure 27). The tool indicates that the IP-link and memory Figure 27. Xmesh with a hot-spot.
traffic on the links to/from CPU0 (left corner) is higher
than on any other CPU, and that Zbox utilization on that
CPU is 53% (much higher than on any other CPU). We
observed 30% improvement in real applications that
generate hot-spot traffic.
7. Summary Comparisons advantage on GS1280. The GS1280 advantage in ISV
In Section 5 we analyzed performance of representatives applications ranges from 1.2-2.1 times. The swim and
of three application classes. In this section, we compare GUPS applications benefit from memory and IP-link
GS1280 to GS320 across a wider range of applications bandwidth advantage in GS1280.
(Figure 28). The data in Figure 28 is shown as the ratio of
GS1280 improvement vs. GS320. The data is grouped in This data indicates that the designs of memory interface,
the following categories: system components (CPU, I/O subsystem, and interprocessor interconnect have
memory, Inter-Processor, I/O), integer and commercial profound effect on application performance. Often, the
benchmarks (SPECint_rate2000, SAP Transaction emphasis is placed on processor design, and other system
Processing, Decision Support [8][17]), HPTC standard components take lower priority. Our study indicates that
benchmarks (SPECfp_rate2000, NAS Parallel, and the key factor for achieving high application performance
SPEComp2001 [8][14]), HPTC applications (CFD, is the balanced design that includes not only a high-
chemistry, weather prediction, structural modeling) performance processor, but also matching high-
[13][18][19][20][21], and two cases where GS1280 shows performance designs of memory, interconnect, and I/O
the highest improvement (GUPS and swim) [8][15]. subsystems.
GS1280/1.15GHz Advantage vs. GS320/1.2GHz: Performance Ratios
8. Conclusions
swim 32P (from SPEComp2001)
We evaluated the architecture and performance
GUPS internal (32P)
characteristics of the HP AlphaServer GS1280, based on
the Alpha 21364 processor. The Alpha 21364 shows
Gaussian98 internal 32P (chemistry)
substantial departure in processor design compared to
Nwchem internal 32P (SiOSi3)
the previous-generation Alpha processors. It incorporates
MM5 internal 32P (weather)
(1) on-ship cache (smaller than the off-chip cache in the
Dyna/Neon internal 16P (crash)
previous-generation 21264), (2) two memory controllers
StarCD 32P published (CFD)
that provide exceptional memory bandwidth, and (3) a
Fluent 32P published (CFD)
router that allows efficient glue-less large-scale
Nastran internal xlem (4P)
multiprocessor design. The 21364 processor places on a
single chip all components that previously required an
SPEComp2001 published (16P)
entire CPU module.
SPECfp_rate2000 published (16P)
NAS Parallel internal (16P)

The results from our analysis show that this is a superior
design for building large-scale multiprocessors. The
Decision Support internal (32P)
exceptional memory bandwidth that GS1280 provides is
SAP SD Transaction Processing published (32P)
important for a number of applications that cannot be
SPECint_rate2000 published (16P)
structured to allow for cache reuse. We observed 2-4
I/O bandwidth (32P)
times advantage of GS1280 vs. the previous-generation
Inter-Processor bandwidth (32P)
AlphaServer GS320 in this type of applications (e.g.
memory latency (Dirty remote)
NAS Parallel). The low latency and exceptional
memory latency (local)
bandwidth on IP links allow for very good scaling in
memory copy bw (32P)
applications that cannot be blocked to fit in the local
memory copy bw (1P)
memory of each processor. We observed even higher
CPU speed
advantage of GS1280 vs. the previous-generation
AlphaServer GS320 in this type of applications (e.g.
0 1 2 3 4 5 6 7 8 9 10 11
over 10 times in GUPS). Since Alpha 21364 preserved
Figure 28. GS1280 vs. GS320 summary comparisons. the same core as the previous-generation Alpha 21264
(and the CPU clock speeds are comparable), the
The data in Figure 28 indicates that GS1280 shows the applications that are blocked to fit well in the on-chip
most significant improvement in IP bandwidth (over 10 caches perform comparably on GS1280 and GS320 (e.g.
times), and I/O and memory bandwidth (8 times). The SPECint2000). Some applications take advantage of the
application comparisons show that although GS1280 and large 16MB cache, and therefore run faster on GS320
GS320 have comparable processor clock speeds, the than on GS1280 (e.g. facerec from SPECfp2000).
majority of applications run faster on GS1280 than However, most applications benefit from the GS1280
GS320. The exceptions are the small integer benchmarks design, indicating that the architecture of memory
(SPECint2000) that fit well in the on-chip caches. The interface, interprocessor interconnect, and I/O subsystem
commercial workloads show between 1.3-1.6 times is as important as the processor design.
advantage on GS1280 vs. GS320. The standard HPTC
benchmarks (SPEC and NAS Parallel) show between 1.7- We proposed a simple change in routing (called shuffle)
2.6 times gain mainly due to memory-bandwidth that provides substantial performance improvements on
an 8-CPU torus interconnect. We also determined that [6] Z. Cvetanovic and D. Bhandarkar, “Performance
striping memory across two processors is beneficial only Characterization of the Alpha 21164 Microprocessor
in applications that generate a hot-spot traffic, while it Using TP and SPEC Workloads,” The Second
was detrimental for majority of applications due to International Symposium on High-Performance Computer
increased nearest-neighbor bandwidth. Architecture (February 1996), pp. 270–280.
We have heavily relied on profiling analysis based on [7] Z. Cvetanovic and D. D. Donaldson, “AlphaServer
the built-in performance counters (Xmesh) throughout 4100 Performance Characterization”, Digital Technical
this study. Such tools are crucial for understanding Journal, Vol 8 No. 4, 1996: pp. 3-20.
system behavior. We have used profiles to explain why
some workloads perform exceptionally well on GS1280, [8] SPEC CPU2000 Benchmarks available at
while others show comparable (or even worse) http://www.spec.org/osg/cpu2000/results
performance than GS320 and ES45. In addition, these SPEC® and SPEC CPU2000® are registered trademarks of
tools are crucial for identifying areas for improving the Standard Performance Evaluation Corporation.
performance on GS1280: e.g. Xmesh can detect hot-
spots, heavy traffic on the IP links (indicate poor [9] Information about the lmbench available at
memory locality), etc. Once such bottlenecks are http://www.bitmover.com/lmbench/
recognized, various techniques can be used to improve
performance. [10] The STREAM benchmark information available at
http://www.cs.viginia.edu/stream
The GS1280 system is the last-generation Alpha server.
In our future work, we plan to extend our analysis to [11] Z. Cvetanovic, P. Gilbert, J. Campoli, “Xmesh: a
non-Alpha based large-scale multiprocessor platforms. graphical performance monitoring tool for large-
We will also place more emphasis on characterizing real scale multiprocessors”, HP internal report, 2002.
I/O intensive applications.
[12] J. Duato, S Yalamanchili, L. Ni, “Interconnection
Networks, an Engineering Approach”, IEEE
Acknowledgments
Computer Society, Los Alamitos, California
The author would like to thank to Jason Campoli, Peter
Gilbert, and Andrew Feld for profiling data collection.
[13] Fluent data available at
Special thanks to Darrel Donaldson for his guidance and
http://www.fluent.com/software/fluent/fl5bench/fullres.htm
help with the GS1280 system and to Steve Jenkins, Sas
Durvasula, and Jack Zemcik for supporting this work.
[14] NAS Parallel data available at
http://www.nas.nasa.gov/Software/NPB/
References [15] GUPS information available at

[1] Peter Bannon, “EV7”, Microprocessor Forum, Oct. http://iram.cs.berkeley.edu/~brg/dis/gups/
2001.
[16] “EV7 System Design Specification”, Internal HP
[2] K. Gharachorloo, M. Sharma, S.Steely, S. Van Doren, document, August 2002.
“Architecture and Design and AlphaServer GS320”,
ASPLOS 2000. [17] SAP Benchmark results available at
http://www.sap.com/benchmark/index.asp?content=http
[3] DCPI and ProfileMe External Release page: ://www.sap.com/benchmark/sd2tier.asp
http://www.research.digital.com/SRC/dcpi/release.html
[18] StarCD data available at
[4] Z. Cvetanovic and R. Kessler, “Performance Analysis http://www.cd-adapco.com/support/bench/315/aclass.htm
of the Alpha 21264-based Compaq ES40 System”, The
27th Annual International Symposium on Computer [19] LS-Dyna information available at
Architecture, June 10-14, 2000, pp. 60-70. http://www.arup.com/dyna/applications/crash/crash.htm
[5] Z. Cvetanovic and D. Bhandarkar, "Characterization of [20] NWchem data available at

Alpha AXP Performance Using TP and SPEC http://www.emsl.pnl.gov:2080/docs/nwchem/nwchem.html
Workloads", The 21st Annual International Symposium
on Computer Architecture, April 1994, pp. 60 - 70. [21] MM5 data available at
http://www.mmm.ucar.edu/mm5/mm5-home.html

Performance Analysis of The Alpha 21364-Based HP GS1280 Multiprocessor

Uploaded by

Copyright:

Available Formats

Performance Analysis of The Alpha 21364-Based HP GS1280 Multiprocessor

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Performance Analysis of The Alpha 21364-Based HP GS1280 Multiprocessor

Uploaded by

Copyright:

Available Formats

Performance Analysis of the Alpha 21364-based HP GS1280 Multiprocessor

3. Memory Subsystem Figure 5 shows dependent load latency on GS1280 as both

3.1 Local memory latency for dependent loads 8 0 -1 0 0

The 21374 processor provides two RDRAM memory 6 0 -8 0

Dependent Load Latency 20

Dataset size (bytes) McCalpin Stream: Triad

jobs that run simultaneously on several CPUs. This is not eon

for the same memory. Therefore, bandwidth improvement crafty

from one to four CPUs on ES45/GS320 is less-than-linear mcf

(as indicated in Figure 7). The data in Figures 6 and 7 cc1

indicate that the memory-bandwidth intensive applications vpr

will run exceptionally well on GS1280. The advantage is gzip

3.3. Single-CPU performance: CPU2000

Figure 8. IPC for SPECfp2000.

GS1280 vs. GS320 Latency: 16P

SPECint2000: Memory Controller Utilization 0 ->9

83 145 186 154

Timestamp other nodes in the 16-CPU GS1280 system (each square is

Load Test: Max 30 Outstanding Memory References

5500 GS1280/16P Table 1 shows the performance improvement from shuffle

3500 system size grows.

2500 Table 1: Performance gains from shuffle.

that 1-hop shuffle provides between 5% and 25% 1600

500 Fluent: Memory and IP-link Utilization

Figure 18. Performance Improvement from Shuffle. 8 memory controllers (average)

5. Application performance 6 IP-links (average)

5.1. CPU-intensive application: Fluent (CFD) 5.2. Memory-Bandwidth Intensive application:

systems, as shown in Figure 23. This is because this

Figure 23. GUPS Performance comparison.

cases exceeds the needs of MPI applications (many of

0% 5% 10% 15% 20% 25% 30% 35%

Figure 25. Degradation from striping.

Figure 26 shows that striping improves performance of a

NAS Parallel internal (16P)

References [15] GUPS information available at

[5] Z. Cvetanovic and D. Bhandarkar, "Characterization of [20] NWchem data available at

You might also like