Distributed Shared Memory
Distributed Shared Memory
Distributed Shared Memory
Chapter Review
o Introduction
o What is Shared Memory
o Bus-Based Multiprocessors
o Ring Based Multiprocessors
o Switched Multiprocessors
o NUMA
o Comparison of shared Memory System
Distributed Shared Memory
Distributed Shared
Memory
INTRODUCTION & THISIS
Page 1
Distributed Shared Memory
Page 2
Distributed Shared Memory
Bus-Based Multiprocessors
Bus based multiprocessor consists of some difficult to program. One of the solutions for it is to
number of CPUs all connected to a common bus, implement write-through cache and snoopy cache.
along with a memory module. A simple In write-through cache, cache memories are
configuration is to have a high-speed backplane or designed so that whenever a word is written to
motherboard into which CPU or memory cards can the cache, it is written through to memory as well.
be inserted. A typical bus has 32 or 64 address In addition, all caches constantly monitor the bus.
lines, 32 or 64 data lines, and perhaps 32 or more Whenever a cache sees a write occurring to a
control lines, all of which operate in parallel. The memory address present in its cache, it either
problem with this scheme is that with as few as 4 removes that entry from its cache or updates the
or 5 CPUs, the bus will usually be overloaded, and cache entry with the new value. Such a cache is
performance will drop drastically. The solution is called snoopy cache. because it is always
to add the high-speed cache memory between the snooping on the bus.
CPUs and the bus as shown in the fig. The cache
holds the most recently accessed words. All One particularly simple and common protocol is
memory requests go through the cache. If the called write through. When a CPU first reads a
word requested is in the cache, the cache itself word from memory, that word is fetched over the
responds to the CPU and no bus request is made. bus and is stored in the cache of the CPU making
If the cache is large enough, the probability of the request. If that word is needed again later, the
success, called the hit rate, will be high and the CPU can take it from the cache without making a
amount of bus traffic per CPU will drop memory request, thus reducing bus traffic. These
dramatically, allowing many more CPUs in the two cases read miss (word not cached) and read
system. However, the introduction of cache also hit (word cached) are shown in Fig. 1-2 as the first
brings the serious problem with it suppose that two lines in the table. In simple systems, only the
two CPUs A and B each read the same word into word requested is cached, but in most, a block of
their respective caches. Then A overwrites the words of say, 16 or 32 words, is transferred and
word. When B next reads that word, it gets the old cached on the initial access and kept for possible
value from its cache, not the value A just wrote. future use.
The memory is now incoherent, and the system is
Page 3
Distributed Shared Memory
Event Action taken by a cache in response to its Action taken by a cache in response to a
own CPU's operation remote CPU's operation
Read Fetch data from memory and store in cache (No action)
miss
Fig. 1-2. The write-through cache consistency protocol. The entries for hit in the third column mean that
the snooping CPU has the word in its cache, not that the requesting CPU has it.
Our protocol manages cache blocks, each of which cache entry. A's state is changed to DIRTY, as
can be in one of the following three states: shown in Fig. 1-3(c). The DIRTY state means
1. INVALID — This cache block does not contain that A has the only cached copy of W and that
valid data. memory is out-of-date for W.
2. CLEAN — Memory is up-to-date; the block may At this point, A overwrites the word again, as
be in other caches. shown in Fig. 1-3(d). The write is done locally, in
3. DIRTY — Memory is incorrect; no other cache the cache, with no bus traffic. All subsequent
holds the block. writes also avoid updating memory.
The basic idea is that a word that is being read by Sooner or later, some other CPU, C in Fig. 1-3(e),
multiple CPUs is allowed to be present in all their
accesses the word. A sees the request on the bus
caches. A word that is being heavily written by
only one machine is kept in its cache and not and asserts a signal that inhibits memory from
written back to memory on every write to reduce responding. Instead, A provides the needed word
bus traffic. and invalidates its own entry. C sees that the word
is coming from another cache, not from memory,
The operation of the protocol can best be
and that it is in DIRTY state, so it marks the entry
illustrated by an example. For simplicity in this
example, we will assume that each cache block accordingly. C is now the owner, which means
consists of a single word. Initially, B has a cached that it can now read and write the word without
copy of the word at address W, as illustrated in making bus requests. However, it also has the
Fig. 1-3(a). The value is W1. The memory also has responsibility of watching out for other CPUs that
a valid copy. In Fig. 1-3(b), A requests and gets a request the word and servicing them itself. The
copy of W from the memory. Although B sees the word remains in DIRTY state until it is purged from
read request go by, it does not respond to it.
the cache it is currently residing in for lack of
Now A writes a new value, W2 to W. B sees the space. At that time, it disappears from all caches
write request and responds by invalidating its and is written back to memory.
Page 4
Distributed Shared Memory
Many small multiprocessors use a cache consistency protocol similar to this one, often with small
variations. It has three important properties:
Page 5
Distributed Shared Memory
Ring-Based Multiprocessors
The next step along the path toward into 32-byte blocks, which is the unit in
distributed shared memory systems are which transfers between machines take
ring-based multiprocessors, exemplified place.
by Memnet (Delp, 1988; Delp et al., 1991;
and Tarn et al., 1990). In Memnet, a single All the machines in Memnet are connected
address space is divided into a private part together in a modified token-passing ring.
and a shared part. The private part is The ring consists of 20 parallel wires, which
divided up into regions so that each together allow 16 data bits and 4 control
machine has a piece for its stacks and bits to be sent every 100 nsec, for a data
other unshared data and code. The shared rate of 160 Mbps. The ring is illustrated in
part is common to all machines (and Fig. 1-4(a). The ring interface, MMU (Memory
distributed over them) and is kept Management Unit), cache, and part of the
consistent by a hardware protocol roughly memory are integrated together in
similar to those used on bus-based the Memnet device, which is shown in the
multiprocessors. Shared memory is divided top third of Fig. 1-4(b).
Page 6
Distributed Shared Memory
Fig. 1-4. (a) The Memnet ring. (b) A single machine. (c) The block table.
Unlike the bus-based multiprocessors of one machine. In both cases, a block need
Fig. 1-1, in Memnet there is no centralized not be present on its home machine. All the
global memory. Instead, each 32-byte block home machine does is provide a guaranteed
in the shared address space has a home place to store the block if no other machine
machine on which physical memory is wants to cache it. This feature is needed
always reserved for it, in the Home because there is no global memory. In
memory field of Fig. 1-4(b). A block may be effect, the global memory has been spread
cached on a machine other than its home out over all the machines.
machine. (The cache and home memory
The Memnet device on each machine
areas share the same buffer pool, but since
contains a table, shown in Fig. 1-4(c), which
they are used slightly differently, we treat
contains an entry for each block in the
them here as separate entities.) A read-only
shared address space, indexed by block
block may be present on multiple machines;
number. Each entry contains a Valid bit
a read-write block may be present on only
telling whether the block is present in the
Page 7
Distributed Shared Memory
Page 8
Distributed Shared Memory
Switched Multiprocessors
Although bus-based multiprocessors and ring- out is to add more bus bandwidth. One approach
based multiprocessors work fine for small systems is to change the topology, going, for example,
(up to around 64 CPUs), they do not scale well to from one bus to two buses or to a tree or grid. By
systems with hundreds or thousands of CPUs. As changing the topology of the interconnection
CPUs are added, at some point the bus or ring network, it is possible to add additional
bandwidth saturates. Adding additional CPUs does communication capacity.
not improve the system performance.
A different method is to build the system as a
Two approaches can be taken to attack the hierarchy. Continue to put some number of CPUs
problem of not enough bandwidth: on a single bus, but now regard this entire unit
(CPUs plus bus) as a cluster. Build the system as
1. Reduce the amount of communication. multiple clusters and connect the clusters using
an intercluster bus, as shown in Fig. 1-5(a). As
2. Increase the communication capacity.
long as most CPUs communicate primarily within
We have already seen an example of an attempt their own cluster, there will be relatively little
to reduce the amount of communication by using intercluster traffic. If one intercluster bus proves
caching. Additional work in this area might center to be inadequate, add a second intercluster bus,
on improving the caching protocol, optimizing the or arrange the clusters in a tree or grid. If still
block size, reorganizing the program to increase more bandwidth is needed, collect a bus, tree, or
locality of memory references, and so on. grid of clusters together into a super-cluster, and
break the system into multiple superclusters. The
Nevertheless, eventually there comes a time when superclusters can be connected by a bus, tree, or
every trick in the book has been used, but the grid, and so on. Fig. 1-5(b) shows a system with
insatiable designers still want to add more CPUs three levels of buses.
and there is no bus bandwidth left. The only way
Page 9
Distributed Shared Memory
Fig. 1-5. (a) Three clusters connected by an intercluster bus to form one supercluster. (b) Two
superclusters connected by a supercluster bus.
In this section we will look at a hierarchical design The total address space available in the prototype
based on a grid of clusters. The machine, is 256M, divided up into 16 regions of 16M each.
called Dash, was built as a research project at The global memory of cluster 0 holds addresses 0
stanford university (Lenoski et al., 1992). Although to 16M. The global memory of cluster 1 holds
many other researchers are doing similar work, addresses 16M to 32M, and so on. Memory is
this one is a typical example. In the remainder of cached and transferred in units of 16-byte blocks,
this section we with focus on the 64-CPU so each cluster has 1M memory blocks within its
prototype that was actually constructed, but the address space.
design principles have been chosen carefully so
that one could equally well build a much larger Directories
version. The description given below has been
Each cluster has a directory that keeps track of
simplified slightly in a few places to avoid going
which clusters currently have copies of its blocks.
into unnecessary detail.
Since each cluster owns 1M memory blocks, it has
A simplified diagram of the Dash prototype is 1M entries in its directory, one per block. Each
presented in Fig. 1-6(a). It consists of 16 clusters, entry holds a bit map with one bit per cluster
each cluster containing a bus, four CPUs, 16M of telling whether or not that cluster has the block
the global memory, and some I/O equipment currently cached. The entry also has a 2-bit field
(disks, etc.). To avoid clutter in the figure, the I/O telling the state of the block. The directories are
equipment and two of the CPUs have been essential to the operation of Dash, as we shall see.
omitted from each cluster. Each CPU is able to In fact, the name Dash comes from "Directory
snoop on its local bus, as in Fig. 1-1(b), but not on Architecture for Shared memory."
other buses.
Page 10
Distributed Shared Memory
Having 1M entries of 18 bits each means that the Caching is done on two levels: a first-level cache
total size of each directory is over 2M bytes. With and a larger second-level cache. The first-level
16 clusters, the total directory memory is just over cache is a subset of the second-level cache, so
36M, or about 14 percent of the 256M. If the only the latter will concern us here. Each (second-
number of CPUs per cluster is increased, the level) cache monitors the local bus using a
amount of directory memory is not changed. Thus protocol somewhat similar to the cache ownership
having more CPUs per cluster allows the directory protocol of Fig. 1-3.
cost to be amortized over a larger number of
CPUs, reducing the cost per CPU. Also, the cost of Each cache block can be in one of the following
the directory and bus controllers per CPU are three states:
reduced. In theory, the design works fine with one
1. UNCACHED — The only copy of the block is in
CPU per cluster, but the cost of the directory and
this memory.
bus hardware per CPU then becomes larger.
2. CLEAN — Memory is up-to-date; the block may
A bit map is not the only way to keep track of
be in several caches.
which cluster holds which cache block. An
alternative approach is to organize each directory 3. DIRTY — Memory is incorrect; only one cache
entry as an explicit list telling which clusters hold holds the block.
the corresponding cache block. If there is little
sharing, the list approach will require fewer bits, The state of each cache block is stored in
but if there is substantial sharing, it will require the State field of its directory entry, as shown in
more bits. Lists also have the disadvantage of Fig. 1-6(b
being variable-length data structures, but these
problems can be solved. The M.I.T. Alewife
multiprocessor (Agarwal et al., 1991; and Kranz et
al., 1993), for example, is similar to Dash in many
respects, although it uses lists instead of bit maps
in its directories and handles directory overflows
in software.
Caching
Page 11
Distributed Shared Memory
Page 12
Distributed Shared Memory
to see what state the block is in. If it is UNCACHED board cache and the block is dirty, the write can
or CLEAN, the hardware fetches the block from its proceed immediately. If it has the block but it is
global memory and sends it back to the clean, a packet is first sent to the home cluster
requesting cluster. It then updates its directory, requesting that all other copies be tracked down
marking the block as cached in the requester's and invalidated.
cluster (if it was not already so marked).
If the requesting CPU does not have the cache
If, however, the needed block is DIRTY, the block, it issues a request on the local bus to see if
directory hardware looks up the identity of the any of the neighbors have it. If so, a cache-to-
cluster holding the block and forwards the request cache (or memory-to-cache) transfer is done. If
there. The cluster holding the dirty block then the block is CLEAN, all other copies, if any, must
sends it to the requesting cluster and marks its be invalidated by the home cluster.
own copy as CLEAN because it is now shared. It
also sends a copy back to the home cluster so that If the local broadcast fails to turn up a copy and
memory can be updated, and the block state the block is homed elsewhere, a packet is sent to
changed to CLEAN. All these cases are the home cluster. Three cases can be
summarized in Fig. 1-7(a). Where a block is distinguished here. If the block is UNCACHED, it is
marked as being in a new state, it is the home marked dirty and sent to the requester. If it is
directory that is changed, as it is the home CLEAN, all copies are invalidated and then the
directory that keeps track of the state. procedure for UNCACHED is followed. If it is DIRTY,
the request is forwarded to the remote cluster
Writes work differently. Before a write can be currently owning the block (if needed). This
done, the CPU doing the write must be sure that it cluster invalidates its own copy and satisfies the
is the owner of the only copy of the cache block in request. The various cases are shown in Fig. 1-7(b
the system. If it already has the block in its on-
(a)
Page 13
Distributed Shared Memory
(b)
Block R's cache Neighbor's cache Home cluster's Some cluster's cache
state memory
Fig. 1-7. Dash protocols. The columns show where the block was found. The rows show the
state it was in. The contents of the boxes show the action taken. R refers to the requesting
CPU. An empty box indicates an impossible situation. (a) Reads. (b) Writes.
Page 14
Distributed Shared Memory
Page 15
Distributed Shared Memory
NUMA Multiprocessors
If nothing else, it should be abundantly clear by order of magnitude slower than it would have had
now that hardware caching in large it been in local memory.
multiprocessors is not simple. Complex data
structures must be maintained by the hardware
and intricate protocols, such as those of Fig. 1-7, Examples of NUMA Multiprocessors
must be built into the cache controller or MMU.
To make the concept of a NUMA machine clearer,
The inevitable consequence is that large consider the example of Fig. 1-8(a), Cm*, the first
multiprocessors are expensive and not in NUMA machine (Jones et al., 1977). The machine
widespread use. consisted of a number of clusters, each consisting
of a CPU, a microprogram-Mable MMU, a memory
However, researchers have spent a considerable module, and possibly some I/O devices, all
amount of effort looking at alternative designs connected by a bus. No caches were present, and
that do not require elaborate caching schemes. no bus snooping occurred. The clusters were
One such architecture is the NUMA (Nonuniform connected by intercluster buses, one of which is
shown in the figure.
Memory Access) multiprocessor. Like a
traditional UMA (Uniform Memory When a CPU made a memory reference, the
Access) multiprocessor, a numa machine has a request went to the CPU's MMU, which then
single virtual address space that is visible to all examined the upper bits of the address to see
CPUs. When any CPU writes a value to which memory was needed. If the address was
location a, a subsequent read of a by a different local, the MMU just issued a request on the local
processor will return the value just written. bus. If it was to a distant memory, the MMU built a
request packet containing the address (and for a
The difference between UMA and NUMA machines write, the data word to be written), and sent it to
lies not in the semantics but in the performance. the destination cluster over an intercluster bus.
On a NUMA machine, access to a remote memory Upon receiving the packet, the destination MMU
is much slower than access to a local memory, carried out the operation and returned the word
and no attempt is made to hide this fact by (for a read) or an acknowledgement (for a write).
hardware caching. The ratio of a remote access to Although it was possible for a CPU to run entirely
a local access is typically 10:1, with a factor of two from a remote memory, sending a packet for each
variation either way not being unusual. Thus, a word read and each word written slowed down
CPU can directly execute a program that resides operation by an order of magnitude.
in a remote memory, but the program may run an
Page 16
Distributed Shared Memory
The BBN Butterfly. The CPUs on the right are the same as those on the left (i.e., the architecture is really a
cylinder)
Figure 1-8(b) shows another NUMA machine, the handle the cache fault, running out of remote
BBN Butterfly. In this design, each CPU is coupled memory can be only fractionally more expensive
directly to one memory. Each of the small squares than running out of local memory. The
in Fig. 1-8(b) represents a CPU plus memory pair. consequence of this observation is that it does not
The CPUs on the right-hand side of the figure are matter so much which pages live in which
the same as those on the left. The CPUs are wired memory: code and data are automatically moved
up via eight switches, each having four input ports by the hardware to wherever they are needed
and four output ports. Local memory requests are (although a bad choice of the home cluster for
handled directly; remote requests are turned into each page in Dash adds extra overhead).
request packets and sent to the appropriate
memory via the switching network. Here, too, NUMA machines do not have this property, so it
programs can run remotely, but at a tremendous matters a great deal which page is located in
penalty in performance. which memory (i.e., on which machine). The key
issue in NUMA software is the decision of where to
Although neither of these examples has any global place each page to maximize performance. Below
memory, NUMA machines can be equipped with we will briefly summarize some ideas due to
memory that is not attached to any CPU. LaRowe and Ellis (1991). Other work is described
in (Cox and Fowler, 1989; LaRowe et al., 1991;
Bolosky et al. (1989), for example, describe a bus- and Ramanathan and Ni, 1991).
based NUMA machine that has a global memory
that does not belong to any CPU but can be When a program on a NUMA machine starts up,
accessed by all of them (in addition to the local pages may or may not be manually prepositioned
memories). on certain processors' machines (their home
processors). In either case, when a CPU tries to
access a page that is not currently mapped into its
Properties of NUMA Multiprocessors address space, it causes a page fault. The
operating system catches the fault and has to
NUMA machines have three key properties that
make a decision. If the page is read-only, the
are of concern to us:
choice is to replicate the page (i.e., make a local
1. Access to remote memory is possible. copy without disturbing the original) or to map the
virtual page onto the remote memory, thus
2. Accessing remote memory is slower than forcing a remote access for all addresses on that
accessing local memory. page. If the page is read-write, the choice is to
migrate the page to the faulting processor
3. Remote access times are not hidden by (invalidating the original page) or to map the
caching. virtual page onto the remote memory.
The first two points are self-explanatory. The third The trade-offs involved here are simple. If a local
may require some clarification. In Dash and most copy is made (replication or migration) and the
other modern UMA multiprocessors, remote page is not reused much, considerable time will
access is slower than local access as well. What have been wasted fetching it for nothing. On the
makes this property bearable is the presence of other hand, if no copy is made, the page is
caching. When a remote word is touched, a block mapped remote, and many accesses follow, they
of memory around it is fetched to the requesting will all be slow. In essence, the operating system
processor's cache, so that subsequent references has to guess if the page will be heavily used in the
go at full speed. Although there is a slight delay to
Page 17
Distributed Shared Memory
Page 18
Distributed Shared Memory
Shared memory systems cover a broad spectrum, distributed shared memory). In Fig. 1-9 the
from systems that maintain consistency entirely in spectrum is shown explicitly.
hardware to those that do it entirely in software.
We have studied the hardware end of the On the left-hand side of Fig. 1-9 we have the
single-bus multiprocessors that have hardware
caches and keep them
consistent by snooping on
the bus. These are the
simplest shared-memory
machines and operate
entirely in hardware.
Various machines made by
Sequent and other
vendors and the
experimental DEC Firefly
workstation (five VAXes on
a common bus) fall into
this category. This design
works fine for a small or
medium number of CPUs,
spectrum in some detail and have given a brief but degrades rapidly when
summary of the software end (page-based the bus saturates.
distributed shared memory and object-based
Fig. 1-9. The spectrum of shared memory blocks. Complex algorithms are used to maintain
machines. consistency, but since they are stored primarily in
MMU microcode (with exceptions potentially
On the left-hand side of Fig. 1-9 we have the handled in software), they count as mostly
single-bus multiprocessors that have hardware "hardware" implementations.
caches and keep them consistent by snooping on
the bus. These are the simplest shared-memory Next come the NUMA machines. These are hybrids
machines and operate entirely in hardware. between hardware and software control. As in a
Various machines made by Sequent and other multiprocessor, each NUMA CPU can access each
vendors and the experimental DEC Firefly word of the common virtual address space just by
workstation (five VAXes on a common bus) fall reading or writing it. Unlike in a multiprocessor,
into this category. This design works fine for a however, caching (i.e., page placement and
small or medium number of CPUs, but degrades migration) is controlled by software (the operating
rapidly when the bus saturates. system), not by hardware (the MMUs). Cm* (Jones
et al., 1977) and the BBN Butterfly are examples
Next come the switched multiprocessors, such as of NUMA machines.
the Stanford Dash machine and the M.I.T. Alewife
machine. These also have hardware caching but Continuing along the spectrum, we come to the
use directories and other data structures to keep machines running a page-based distributed
track of which CPUs or clusters have which cache shared memory system such as IVY (Li, 1986) and
Page 19
Distributed Shared Memory
Mirage (Fleisch and Popek, 1989). Each of the the operating system occurs and the required
CPUs in such a system has its own private page must be fetched by software. The operating
memory and, unlike the NUMA machines and UMA system acquires the necessary page by sending a
multiprocessors, cannot reference remote message to the machine where the page is
memory directly. When a CPU addresses a word in currently residing and asking for it. Thus both
the address space that is backed by a page placement and access are done in software here.
currently located on a different machine, a trap to
Then we come to machines that share only a consistency. Everything is done in software here,
selected portion of their address spaces, namely with no hardware support at all. Orca (Bal, 1991)
shared variables and other data structures. The is an example of this design, and Linda (Carriero
Munin (Bennett et al., 1990) and Midway (Bershad and Gelernter, 1989) is similar to it in some
et al., 1990) systems work this way. User-supplied important ways.
information is required to determine which
variables are shared and which are not. In these The differences between these six types of
systems, the focus changes from trying to pretend systems are summarized in Fig. 1-10, which shows
that there is a single common memory to how to them from tightly coupled hardware on the left to
maintain a set of replicated distributed data loosely coupled software on the right. The first
structures consistent in the face of updates, four types offer a memory model consisting of a
potentially from all the machines using the shared standard, paged, linear virtual address space. The
data. In some cases the paging hardware detects first two are regular multiprocessors and the next
writes, which may help maintain consistency two do their best to simulate them. Since the first
efficiently. In other cases, the paging hardware is four types act like multiprocessors, the only
not used for consistency management. operations possible are reading and writing
memory words. In the fifth column, the shared
Finally, we have systems running object-based variables are special, but they are still accessed
distributed shared memory. Unlike all the others, only by normal reads and writes. The object-based
programs here cannot just access the shared systems, with their encapsulated data and
data. They have to go through protected methods, methods, can offer more general operations and
which means that the runtime system can always represent a higher level of abstraction than raw
get control on every access to help maintain memory.
? MuItiprocessors ? ? DSM ?
Page 20
Distributed Shared Memory
The real difference between the multiprocessors different, being a high-speed bus (or collection of
and the DSM systems is whether or not remote buses) for the multiprocessors and a conventional
data can be accessed just by referring to their LAN (usually) for the DSM systems (although
addresses. On all the multiprocessors the answer sometimes the difference between a "bus" and a
is yes. On the DSM systems it is no: software "network" is arguable, having mainly to do with
intervention is always needed. Similarly, the number of wires).
unattached global memory, that is, memory not
associated with any particular CPU, is possible on The next point relates to who does data migration
the multiprocessors but not on the DSM systems when it is needed. Here the NUMA machines are
(because the latter are collections of separate like the DSM systems: in both cases it is the
computers connected by a network). software, not the hardware, which is responsible
for moving data around from machine to machine.
In the multiprocessors, when a remote access is Finally, the unit of data transfer differs for the six
detected, a message is sent to the remote systems, being a cache block for the UMA
memory by the cache controller or MMU. In the multiprocessors, a page for the NUMA machines
DSM systems it is sent by the operating system or and page-based DSM systems, and a variable or
runtime system. The medium used is also object for the last two.
Page 21
Distributed Shared Memory
Page 22