Distributed Shared Memory

Distributed Shared Memory
INTRODUCTION & THISIS
Chapter Review
o Introduction
o What is Shared Memory
o Bus-Based Multiprocessors
o Ring Based Multiprocessors
o Switched Multiprocessors
o NUMA
o Comparison of shared Memory System
 Distributed Shared Memory
Distributed Shared
Memory
INTRODUCTION & THISIS
A distributed shared memory is a mechanism

allowing end-users' processes to access shared
data without using inter-process communications.
In other words, the goal of a DSM system is to
make inter-process communications transparent
to end-users. Both hardware and software
implementations have been proposed in the
literature. From a programming point of view, two In any case, implementing a DSM system implies
approaches have been studied: to address problems of data location, data access,
sharing and locking of data, data coherence. Such
 Shared virtual memory: This notion is problems are not specific to parallelism but have
very similar to the well-known concept of connections with distributed or replicated
paged virtual memory implemented in databases management systems (transactional
mono-processor systems. The basic idea is model), networks (data migrations), uniprocessor
to group all distributed memories together operating systems (concurrent programming),
into a single wide address space. distributed systems.
Drawbacks: such systems do not allow to
consider the semantics of shared data: the Shared memory machines
data granularity is arbitrarily fixed to some
page size whatever the type and the actual are composed of a set of processors accessing to
size of the shared data might be. The a single memory. The main advantage of these
programmer has no means to provide machines lies in their easy programmability since
information about these data. there are no communications between the
processors. However, contentions and collisions
 Object DSM: in that class of approaches, (due to concurrent accesses to the shared
shared data are objects i.e. variables with memory) prevent applications to be scalable.
access functions. In his applications, the Therefore, in spite of some attempts, most of
user has only to define which data shared-memory machines are limited to a small
(objects) are shared. The whole number of processors (< 32).
management of the shared objects
(creation, access, modification) is handled Distributed memory machines,
by the DSM system. In opposite of SVM
on the contrary, can integrate a large number of
systems which work at operating system
processors (> 1000). Applications using these
layer, objects DSM systems propose a
architectures can be scalable (if the programmer
programming model alternative to the
is good enough!). The communication model
classical message-passing.
usually used is message passing, which can allow
considerable speedups. However, its programming
 Page 1
remains difficult since end-users have to handle

all communication operations.
 Page 2
Bus-Based Multiprocessors
Fig. 1-1. (a) A multiprocessor. (b) A multiprocessor with caching.
Bus based multiprocessor consists of some difficult to program. One of the solutions for it is to
number of CPUs all connected to a common bus, implement write-through cache and snoopy cache.
along with a memory module. A simple In write-through cache, cache memories are
configuration is to have a high-speed backplane or designed so that whenever a word is written to
motherboard into which CPU or memory cards can the cache, it is written through to memory as well.
be inserted. A typical bus has 32 or 64 address In addition, all caches constantly monitor the bus.
lines, 32 or 64 data lines, and perhaps 32 or more Whenever a cache sees a write occurring to a
control lines, all of which operate in parallel. The memory address present in its cache, it either
problem with this scheme is that with as few as 4 removes that entry from its cache or updates the
or 5 CPUs, the bus will usually be overloaded, and cache entry with the new value. Such a cache is
performance will drop drastically. The solution is called snoopy cache. because it is always
to add the high-speed cache memory between the snooping on the bus.
CPUs and the bus as shown in the fig. The cache
holds the most recently accessed words. All One particularly simple and common protocol is
memory requests go through the cache. If the called write through. When a CPU first reads a
word requested is in the cache, the cache itself word from memory, that word is fetched over the
responds to the CPU and no bus request is made. bus and is stored in the cache of the CPU making
If the cache is large enough, the probability of the request. If that word is needed again later, the
success, called the hit rate, will be high and the CPU can take it from the cache without making a
amount of bus traffic per CPU will drop memory request, thus reducing bus traffic. These
dramatically, allowing many more CPUs in the two cases read miss (word not cached) and read
system. However, the introduction of cache also hit (word cached) are shown in Fig. 1-2 as the first
brings the serious problem with it suppose that two lines in the table. In simple systems, only the
two CPUs A and B each read the same word into word requested is cached, but in most, a block of
their respective caches. Then A overwrites the words of say, 16 or 32 words, is transferred and
word. When B next reads that word, it gets the old cached on the initial access and kept for possible
value from its cache, not the value A just wrote. future use.
The memory is now incoherent, and the system is
 Page 3
Event Action taken by a cache in response to its Action taken by a cache in response to a
own CPU's operation remote CPU's operation
Read Fetch data from memory and store in cache (No action)
miss
Read hit Fetch data from local cache (No action)
Write Update data in memory and store in cache (No action)

miss
Write Update memory and cache Invalidate cache entry

hit
Fig. 1-2. The write-through cache consistency protocol. The entries for hit in the third column mean that
the snooping CPU has the word in its cache, not that the requesting CPU has it.
Our protocol manages cache blocks, each of which cache entry. A's state is changed to DIRTY, as
can be in one of the following three states: shown in Fig. 1-3(c). The DIRTY state means
1. INVALID — This cache block does not contain that A has the only cached copy of W and that
valid data. memory is out-of-date for W.
2. CLEAN — Memory is up-to-date; the block may At this point, A overwrites the word again, as
be in other caches. shown in Fig. 1-3(d). The write is done locally, in
3. DIRTY — Memory is incorrect; no other cache the cache, with no bus traffic. All subsequent
holds the block. writes also avoid updating memory.
The basic idea is that a word that is being read by Sooner or later, some other CPU, C in Fig. 1-3(e),
multiple CPUs is allowed to be present in all their
accesses the word. A sees the request on the bus
caches. A word that is being heavily written by
only one machine is kept in its cache and not and asserts a signal that inhibits memory from
written back to memory on every write to reduce responding. Instead, A provides the needed word
bus traffic. and invalidates its own entry. C sees that the word
is coming from another cache, not from memory,
The operation of the protocol can best be
and that it is in DIRTY state, so it marks the entry
illustrated by an example. For simplicity in this
example, we will assume that each cache block accordingly. C is now the owner, which means
consists of a single word. Initially, B has a cached that it can now read and write the word without
copy of the word at address W, as illustrated in making bus requests. However, it also has the
Fig. 1-3(a). The value is W1. The memory also has responsibility of watching out for other CPUs that
a valid copy. In Fig. 1-3(b), A requests and gets a request the word and servicing them itself. The
copy of W from the memory. Although B sees the word remains in DIRTY state until it is purged from
read request go by, it does not respond to it.
the cache it is currently residing in for lack of
Now A writes a new value, W2 to W. B sees the space. At that time, it disappears from all caches
write request and responds by invalidating its and is written back to memory.
 Page 4
Fig. 1-3. An example of how a cache ownership protocol works.
Many small multiprocessors use a cache consistency protocol similar to this one, often with small
variations. It has three important properties:
1. Consistency is achieved by having all the caches do bus snooping.
2. The protocol is built into the memory management unit.
2. 3. The entire algorithm is performed in well under a memory cycle.

3. As we will see later, some of these do not hold for larger (switched) multiprocessors,
and none of them hold for distributed shared memory.
 Page 5
Ring-Based Multiprocessors
The next step along the path toward into 32-byte blocks, which is the unit in
distributed shared memory systems are which transfers between machines take
ring-based multiprocessors, exemplified place.
by Memnet (Delp, 1988; Delp et al., 1991;
and Tarn et al., 1990). In Memnet, a single All the machines in Memnet are connected
address space is divided into a private part together in a modified token-passing ring.
and a shared part. The private part is The ring consists of 20 parallel wires, which
divided up into regions so that each together allow 16 data bits and 4 control
machine has a piece for its stacks and bits to be sent every 100 nsec, for a data
other unshared data and code. The shared rate of 160 Mbps. The ring is illustrated in
part is common to all machines (and Fig. 1-4(a). The ring interface, MMU (Memory
distributed over them) and is kept Management Unit), cache, and part of the
consistent by a hardware protocol roughly memory are integrated together in
similar to those used on bus-based the Memnet device, which is shown in the
multiprocessors. Shared memory is divided top third of Fig. 1-4(b).
 Page 6
Fig. 1-4. (a) The Memnet ring. (b) A single machine. (c) The block table.
Unlike the bus-based multiprocessors of one machine. In both cases, a block need
Fig. 1-1, in Memnet there is no centralized not be present on its home machine. All the
global memory. Instead, each 32-byte block home machine does is provide a guaranteed
in the shared address space has a home place to store the block if no other machine
machine on which physical memory is wants to cache it. This feature is needed
always reserved for it, in the Home because there is no global memory. In
memory field of Fig. 1-4(b). A block may be effect, the global memory has been spread
cached on a machine other than its home out over all the machines.
machine. (The cache and home memory
The Memnet device on each machine
areas share the same buffer pool, but since
contains a table, shown in Fig. 1-4(c), which
they are used slightly differently, we treat
contains an entry for each block in the
them here as separate entities.) A read-only
shared address space, indexed by block
block may be present on multiple machines;
number. Each entry contains a Valid bit
a read-write block may be present on only
telling whether the block is present in the
 Page 7
cache and up to date, an Exclusive bit, (i.e., the Exclusive bit is set), the word is

specifying whether the local copy, if any, is just written locally.
the only one, a Home bit, which is set only
If the needed block is present but it is not
if this is the block's home machine,
the only copy, an invalidation packet is first
an Interrupt bit, used for forcing interrupts,
sent around the ring to force all other
and a Location field that tells where the
machines to discard their copies of the
block is located in the cache if it is present
block about to be written. When the
and valid.
invalidation packet arrives back at the
Having looked at the architecture of sender, the Exclusive bit is set for that
Memnet, let us now examine the protocols block and the write proceeds locally.
it uses. When the CPU wants to read a word
If the block is not present, a packet is sent
from shared memory, the memory address
out that combines a read request and an
to be read is passed to the Memnet device,
invalidation request. The first machine that
which checks the block table to see if the
has the block copies it into the packet and
block is present. If so, the request is
discards its own copy. All subsequent
satisfied immediately. If not, the Memnet
machines just discard the block from their
device waits until it captures the circulating
caches. When the packet comes back to the
token, then puts a request packet onto the
sender, it is stored there and written.
ring and suspends the CPU. The request
packet contains the desired address and a Memnet is similar to a bus-based
32-byte dummy field. multiprocessor in most ways. In both cases,
read operations always return the value
As the packet passes around the ring, each
most recently written. Also, in both designs,
Memnet device along the way checks to see
a block may be absent from a cache,
if it has the block needed. If so, it puts the
present in multiple caches for reading, or
block in the dummy field and modifies the
present in a single cache for writing. The
packet header to inhibit subsequent
protocols are similar, too; however, Memnet
machines from doing so. If the
has no centralized global memory.
block's Exclusive bit is set, it is cleared.
Because the block has to be somewhere, The biggest difference between bus-based
when the packet comes back to the sender, multiprocessors and ring-based
it is guaranteed to contain the block multiprocessors such as Memnet is that the
requested. The CPU sending the request former are tightly coupled, with the CPUs
then stores the block, satisfies the request, normally being in a single rack. In contrast,
and releases the CPU. the machines in a ring-based multiprocessor
can be much more loosely coupled,
A problem arises if the requesting machine
potentially even on desktops spread around
has no free space in its cache to hold the
a building, like machines on a LAN, although
incoming block. To make space, it picks a
this loose coupling can adversely effect
cached block at random and sends it home,
performance. Furthermore, unlike a bus-
thus freeing up a cache slot. Blocks
based multiprocessor, a ring-based
whose Home bit are set are never chosen
multiprocessor like Memnet has no
since they are already home.
separate global memory. The caches are all
Writes work slightly differently than reads. there is. In both respects, ring-based
Three cases have to be distinguished. If the multiprocessors are almost a hardware
block containing the word to be written is implementation of distributed shared
present and is the only copy in the system memory.
 Page 8
One is tempted to say that a ring-based respectively). Nevertheless, it does exist,

multiprocessor is like a duck-billed platypus and shows that the two categories are not
— theoretically it ought not exist because it quite so distinct as one might think.
combines the properties of two categories
said to be mutually exclusive
(multiprocessors and distributed shared
memory machines; mammals and birds,
Switched Multiprocessors
Although bus-based multiprocessors and ring- out is to add more bus bandwidth. One approach
based multiprocessors work fine for small systems is to change the topology, going, for example,
(up to around 64 CPUs), they do not scale well to from one bus to two buses or to a tree or grid. By
systems with hundreds or thousands of CPUs. As changing the topology of the interconnection
CPUs are added, at some point the bus or ring network, it is possible to add additional
bandwidth saturates. Adding additional CPUs does communication capacity.
not improve the system performance.
A different method is to build the system as a
Two approaches can be taken to attack the hierarchy. Continue to put some number of CPUs
problem of not enough bandwidth: on a single bus, but now regard this entire unit
(CPUs plus bus) as a cluster. Build the system as
1. Reduce the amount of communication. multiple clusters and connect the clusters using
an intercluster bus, as shown in Fig. 1-5(a). As
2. Increase the communication capacity.
long as most CPUs communicate primarily within
We have already seen an example of an attempt their own cluster, there will be relatively little
to reduce the amount of communication by using intercluster traffic. If one intercluster bus proves
caching. Additional work in this area might center to be inadequate, add a second intercluster bus,
on improving the caching protocol, optimizing the or arrange the clusters in a tree or grid. If still
block size, reorganizing the program to increase more bandwidth is needed, collect a bus, tree, or
locality of memory references, and so on. grid of clusters together into a super-cluster, and
break the system into multiple superclusters. The
Nevertheless, eventually there comes a time when superclusters can be connected by a bus, tree, or
every trick in the book has been used, but the grid, and so on. Fig. 1-5(b) shows a system with
insatiable designers still want to add more CPUs three levels of buses.
and there is no bus bandwidth left. The only way
 Page 9
Fig. 1-5. (a) Three clusters connected by an intercluster bus to form one supercluster. (b) Two
superclusters connected by a supercluster bus.
In this section we will look at a hierarchical design The total address space available in the prototype
based on a grid of clusters. The machine, is 256M, divided up into 16 regions of 16M each.
called Dash, was built as a research project at The global memory of cluster 0 holds addresses 0
stanford university (Lenoski et al., 1992). Although to 16M. The global memory of cluster 1 holds
many other researchers are doing similar work, addresses 16M to 32M, and so on. Memory is
this one is a typical example. In the remainder of cached and transferred in units of 16-byte blocks,
this section we with focus on the 64-CPU so each cluster has 1M memory blocks within its
prototype that was actually constructed, but the address space.
design principles have been chosen carefully so
that one could equally well build a much larger Directories
version. The description given below has been
Each cluster has a directory that keeps track of
simplified slightly in a few places to avoid going
which clusters currently have copies of its blocks.
into unnecessary detail.
Since each cluster owns 1M memory blocks, it has
A simplified diagram of the Dash prototype is 1M entries in its directory, one per block. Each
presented in Fig. 1-6(a). It consists of 16 clusters, entry holds a bit map with one bit per cluster
each cluster containing a bus, four CPUs, 16M of telling whether or not that cluster has the block
the global memory, and some I/O equipment currently cached. The entry also has a 2-bit field
(disks, etc.). To avoid clutter in the figure, the I/O telling the state of the block. The directories are
equipment and two of the CPUs have been essential to the operation of Dash, as we shall see.
omitted from each cluster. Each CPU is able to In fact, the name Dash comes from "Directory
snoop on its local bus, as in Fig. 1-1(b), but not on Architecture for Shared memory."
other buses.
 Page 10
Having 1M entries of 18 bits each means that the Caching is done on two levels: a first-level cache
total size of each directory is over 2M bytes. With and a larger second-level cache. The first-level
16 clusters, the total directory memory is just over cache is a subset of the second-level cache, so
36M, or about 14 percent of the 256M. If the only the latter will concern us here. Each (second-
number of CPUs per cluster is increased, the level) cache monitors the local bus using a
amount of directory memory is not changed. Thus protocol somewhat similar to the cache ownership
having more CPUs per cluster allows the directory protocol of Fig. 1-3.
cost to be amortized over a larger number of
CPUs, reducing the cost per CPU. Also, the cost of Each cache block can be in one of the following
the directory and bus controllers per CPU are three states:
reduced. In theory, the design works fine with one
1. UNCACHED — The only copy of the block is in
CPU per cluster, but the cost of the directory and
this memory.
bus hardware per CPU then becomes larger.
2. CLEAN — Memory is up-to-date; the block may
A bit map is not the only way to keep track of
be in several caches.
which cluster holds which cache block. An
alternative approach is to organize each directory 3. DIRTY — Memory is incorrect; only one cache
entry as an explicit list telling which clusters hold holds the block.
the corresponding cache block. If there is little
sharing, the list approach will require fewer bits, The state of each cache block is stored in
but if there is substantial sharing, it will require the State field of its directory entry, as shown in
more bits. Lists also have the disadvantage of Fig. 1-6(b
being variable-length data structures, but these
problems can be solved. The M.I.T. Alewife
multiprocessor (Agarwal et al., 1991; and Kranz et
al., 1993), for example, is similar to Dash in many
respects, although it uses lists instead of bit maps
in its directories and handles directory overflows
in software.
Each cluster in Dash is connected to an interface

that allows the cluster to communicate with other
clusters. The interfaces are connected by
intercluster links (primitive buses) in a rectangular
grid, as shown in Fig. 1-6(a). As more clusters are
added to the system, more intercluster links are
added, too, so the bandwidth increases and the
system scales. The intercluster link system
uses wormhole routing, which means that the
first part of a packet can be forwarded even
before the entire packet has been received, thus
reducing the delay at each hop. Although not
shown in the figure, there are actually two sets of
intercluster links, one for request packets and one
for reply packets. The intercluster links cannot be
snooped upon.
Caching
 Page 11
Protocols transfer of the block is executed to place the block

in the requesting CPU's cache. If the block is
The Dash protocols are based on ownership and
CLEAN, a copy is made; if it is DIRTY, the home
invalidation. At every instant, each cache block
directory is informed that the block is now CLEAN
has a unique owner. For UNCACHED or CLEAN
and shared. Either way, a hit from one of the
blocks, the block's home cluster is the owner. For
caches satisfies the instruction but does not affect
dirty blocks, the cluster holding the one and only
any directory's bit map.
copy is the owner. Writing on a CLEAN block
requires first finding and invalidating all existing If the block is not present in any of the cluster's
copies. This is where the directories come in. caches, a request packet is sent to the block's
home cluster, which can be determined by
To see how this mechanism works, let us first
examining the upper 4 bits of the memory
consider how a CPU reads a memory word. It first
address. The home cluster might well be the
checks its own caches. If neither cache has the
requester's cluster, in which case the message is
word, a request is issued on the local cluster bus
not sent physically. The directory management
to see if another CPU in the cluster the block has
hardware at the home cluster examines its tables
containing it. If one does, a cache-to-cache
 Page 12
to see what state the block is in. If it is UNCACHED board cache and the block is dirty, the write can
or CLEAN, the hardware fetches the block from its proceed immediately. If it has the block but it is
global memory and sends it back to the clean, a packet is first sent to the home cluster
requesting cluster. It then updates its directory, requesting that all other copies be tracked down
marking the block as cached in the requester's and invalidated.
cluster (if it was not already so marked).
If the requesting CPU does not have the cache
If, however, the needed block is DIRTY, the block, it issues a request on the local bus to see if
directory hardware looks up the identity of the any of the neighbors have it. If so, a cache-to-
cluster holding the block and forwards the request cache (or memory-to-cache) transfer is done. If
there. The cluster holding the dirty block then the block is CLEAN, all other copies, if any, must
sends it to the requesting cluster and marks its be invalidated by the home cluster.
own copy as CLEAN because it is now shared. It
also sends a copy back to the home cluster so that If the local broadcast fails to turn up a copy and
memory can be updated, and the block state the block is homed elsewhere, a packet is sent to
changed to CLEAN. All these cases are the home cluster. Three cases can be
summarized in Fig. 1-7(a). Where a block is distinguished here. If the block is UNCACHED, it is
marked as being in a new state, it is the home marked dirty and sent to the requester. If it is
directory that is changed, as it is the home CLEAN, all copies are invalidated and then the
directory that keeps track of the state. procedure for UNCACHED is followed. If it is DIRTY,
the request is forwarded to the remote cluster
Writes work differently. Before a write can be currently owning the block (if needed). This
done, the CPU doing the write must be sure that it cluster invalidates its own copy and satisfies the
is the owner of the only copy of the cache block in request. The various cases are shown in Fig. 1-7(b
the system. If it already has the block in its on-
(a)
Location where the block was found
Block R's Neighbor's cache Home cluster's Some cluster's cache

state cach memory
e
UNCACHE Send block to R;

D mark as CLEAN
and cached only in
R's cluster
CLEAN Use Copy block to R's cache Copy block from

block memory to R;
mark as also
cached in R's
cluster
 Page 13
DIRTY Use Send block to R and to Send block to R and to home

block home cluster; tell home to cluster (if cached elsewhere};
mark it as CLEAN and tell home to mark it as CLEAN
cached in R's cluster and also cached in R's cluster
(b)
Location where the block was found
Block R's cache Neighbor's cache Home cluster's Some cluster's cache
state memory
UNCACHE Send block to R;

D mark as DIRTY and
cached only in R's
cluster
CLEAN Send message to Copy and invalidate Send block to R;

home asking for block; send message invalidate all
exclusive ownership to home asking for cached copies;
in DIRTY state; if exclusive ownership mark it as DIRTY
granted, use block in DIRTY state and cached only in
R's cluster
DIRTY Use block Cache-to-cache Send block directly to

transfer to R; R; invalidate cached
invalidate neighbor's copy; home marks it
copy as DIRTY and cached
only in R's cluster
Fig. 1-7. Dash protocols. The columns show where the block was found. The rows show the
state it was in. The contents of the boxes show the action taken. R refers to the requesting
CPU. An empty box indicates an impossible situation. (a) Reads. (b) Writes.
Obviously, maintaining memory consistency effect on performance. To get around these

in Dash (or any large multiprocessor) is problems, Dash uses a variety of special
nothing at all like the simple model of Fig. 6- techniques, such as two sets of intercluster
1(b). A single memory access may require a links, pipelined writes, and different
substantial number of packets to be sent. memory semantics than one might expect.
Furthermore, to keep memory consistent, We will discuss some of these issues later.
the access usually cannot be completed For the time being, the bottom line is that
until all the packets have been this implementation of "shared memory"
acknowledged, which can have a serious requires a large data base (the directories),
 Page 14
a considerable amount of computing power

(the directory management hardware), and
a potentially large number of packets
that must be sent and acknowledged. We

will see later that implementing distributed
shared memory has precisely the same
properties. The difference between the two
lies much more in the implementation
technique than in the ideas, architecture, or
algorithms.
 Page 15
NUMA Multiprocessors
If nothing else, it should be abundantly clear by order of magnitude slower than it would have had
now that hardware caching in large it been in local memory.
multiprocessors is not simple. Complex data
structures must be maintained by the hardware
and intricate protocols, such as those of Fig. 1-7, Examples of NUMA Multiprocessors
must be built into the cache controller or MMU.
To make the concept of a NUMA machine clearer,
The inevitable consequence is that large consider the example of Fig. 1-8(a), Cm*, the first
multiprocessors are expensive and not in NUMA machine (Jones et al., 1977). The machine
widespread use. consisted of a number of clusters, each consisting
of a CPU, a microprogram-Mable MMU, a memory
However, researchers have spent a considerable module, and possibly some I/O devices, all
amount of effort looking at alternative designs connected by a bus. No caches were present, and
that do not require elaborate caching schemes. no bus snooping occurred. The clusters were
One such architecture is the NUMA (Nonuniform connected by intercluster buses, one of which is
shown in the figure.
Memory Access) multiprocessor. Like a
traditional UMA (Uniform Memory When a CPU made a memory reference, the
Access) multiprocessor, a numa machine has a request went to the CPU's MMU, which then
single virtual address space that is visible to all examined the upper bits of the address to see
CPUs. When any CPU writes a value to which memory was needed. If the address was
location a, a subsequent read of a by a different local, the MMU just issued a request on the local
processor will return the value just written. bus. If it was to a distant memory, the MMU built a
request packet containing the address (and for a
The difference between UMA and NUMA machines write, the data word to be written), and sent it to
lies not in the semantics but in the performance. the destination cluster over an intercluster bus.
On a NUMA machine, access to a remote memory Upon receiving the packet, the destination MMU
is much slower than access to a local memory, carried out the operation and returned the word
and no attempt is made to hide this fact by (for a read) or an acknowledgement (for a write).
hardware caching. The ratio of a remote access to Although it was possible for a CPU to run entirely
a local access is typically 10:1, with a factor of two from a remote memory, sending a packet for each
variation either way not being unusual. Thus, a word read and each word written slowed down
CPU can directly execute a program that resides operation by an order of magnitude.
in a remote memory, but the program may run an
Fig. 1-8. (a) A simplified view

of the Cm* system. (b)
 Page 16
The BBN Butterfly. The CPUs on the right are the same as those on the left (i.e., the architecture is really a
cylinder)
Figure 1-8(b) shows another NUMA machine, the handle the cache fault, running out of remote
BBN Butterfly. In this design, each CPU is coupled memory can be only fractionally more expensive
directly to one memory. Each of the small squares than running out of local memory. The
in Fig. 1-8(b) represents a CPU plus memory pair. consequence of this observation is that it does not
The CPUs on the right-hand side of the figure are matter so much which pages live in which
the same as those on the left. The CPUs are wired memory: code and data are automatically moved
up via eight switches, each having four input ports by the hardware to wherever they are needed
and four output ports. Local memory requests are (although a bad choice of the home cluster for
handled directly; remote requests are turned into each page in Dash adds extra overhead).
request packets and sent to the appropriate
memory via the switching network. Here, too, NUMA machines do not have this property, so it
programs can run remotely, but at a tremendous matters a great deal which page is located in
penalty in performance. which memory (i.e., on which machine). The key
issue in NUMA software is the decision of where to
Although neither of these examples has any global place each page to maximize performance. Below
memory, NUMA machines can be equipped with we will briefly summarize some ideas due to
memory that is not attached to any CPU. LaRowe and Ellis (1991). Other work is described
in (Cox and Fowler, 1989; LaRowe et al., 1991;
Bolosky et al. (1989), for example, describe a bus- and Ramanathan and Ni, 1991).
based NUMA machine that has a global memory
that does not belong to any CPU but can be When a program on a NUMA machine starts up,
accessed by all of them (in addition to the local pages may or may not be manually prepositioned
memories). on certain processors' machines (their home
processors). In either case, when a CPU tries to
access a page that is not currently mapped into its
Properties of NUMA Multiprocessors address space, it causes a page fault. The
operating system catches the fault and has to
NUMA machines have three key properties that
make a decision. If the page is read-only, the
are of concern to us:
choice is to replicate the page (i.e., make a local
1. Access to remote memory is possible. copy without disturbing the original) or to map the
virtual page onto the remote memory, thus
2. Accessing remote memory is slower than forcing a remote access for all addresses on that
accessing local memory. page. If the page is read-write, the choice is to
migrate the page to the faulting processor
3. Remote access times are not hidden by (invalidating the original page) or to map the
caching. virtual page onto the remote memory.
The first two points are self-explanatory. The third The trade-offs involved here are simple. If a local
may require some clarification. In Dash and most copy is made (replication or migration) and the
other modern UMA multiprocessors, remote page is not reused much, considerable time will
access is slower than local access as well. What have been wasted fetching it for nothing. On the
makes this property bearable is the presence of other hand, if no copy is made, the page is
caching. When a remote word is touched, a block mapped remote, and many accesses follow, they
of memory around it is fetched to the requesting will all be slow. In essence, the operating system
processor's cache, so that subsequent references has to guess if the page will be heavily used in the
go at full speed. Although there is a slight delay to
 Page 17
future. If it guesses wrong, a performance penalty instruction restarted. Subsequent references to

will be extracted. that page are done in hardware, with no software
intervention. If no other action were taken, then a
Whichever decision is made, the page is mapped wrong decision once made could never be
in, either local or remote, and the faulting reversed.
 Page 18
Comparison of Shared Memory

Systems
Shared memory systems cover a broad spectrum, distributed shared memory). In Fig. 1-9 the
from systems that maintain consistency entirely in spectrum is shown explicitly.
hardware to those that do it entirely in software.
We have studied the hardware end of the On the left-hand side of Fig. 1-9 we have the
single-bus multiprocessors that have hardware
caches and keep them
consistent by snooping on
the bus. These are the
simplest shared-memory
machines and operate
entirely in hardware.
Various machines made by
Sequent and other
vendors and the
experimental DEC Firefly
workstation (five VAXes on
a common bus) fall into
this category. This design
works fine for a small or
medium number of CPUs,
spectrum in some detail and have given a brief but degrades rapidly when
summary of the software end (page-based the bus saturates.
distributed shared memory and object-based
Fig. 1-9. The spectrum of shared memory blocks. Complex algorithms are used to maintain
machines. consistency, but since they are stored primarily in
MMU microcode (with exceptions potentially
On the left-hand side of Fig. 1-9 we have the handled in software), they count as mostly
single-bus multiprocessors that have hardware "hardware" implementations.
caches and keep them consistent by snooping on
the bus. These are the simplest shared-memory Next come the NUMA machines. These are hybrids
machines and operate entirely in hardware. between hardware and software control. As in a
Various machines made by Sequent and other multiprocessor, each NUMA CPU can access each
vendors and the experimental DEC Firefly word of the common virtual address space just by
workstation (five VAXes on a common bus) fall reading or writing it. Unlike in a multiprocessor,
into this category. This design works fine for a however, caching (i.e., page placement and
small or medium number of CPUs, but degrades migration) is controlled by software (the operating
rapidly when the bus saturates. system), not by hardware (the MMUs). Cm* (Jones
et al., 1977) and the BBN Butterfly are examples
Next come the switched multiprocessors, such as of NUMA machines.
the Stanford Dash machine and the M.I.T. Alewife
machine. These also have hardware caching but Continuing along the spectrum, we come to the
use directories and other data structures to keep machines running a page-based distributed
track of which CPUs or clusters have which cache shared memory system such as IVY (Li, 1986) and
 Page 19
Mirage (Fleisch and Popek, 1989). Each of the the operating system occurs and the required
CPUs in such a system has its own private page must be fetched by software. The operating
memory and, unlike the NUMA machines and UMA system acquires the necessary page by sending a
multiprocessors, cannot reference remote message to the machine where the page is
memory directly. When a CPU addresses a word in currently residing and asking for it. Thus both
the address space that is backed by a page placement and access are done in software here.
currently located on a different machine, a trap to
Then we come to machines that share only a consistency. Everything is done in software here,
selected portion of their address spaces, namely with no hardware support at all. Orca (Bal, 1991)
shared variables and other data structures. The is an example of this design, and Linda (Carriero
Munin (Bennett et al., 1990) and Midway (Bershad and Gelernter, 1989) is similar to it in some
et al., 1990) systems work this way. User-supplied important ways.
information is required to determine which
variables are shared and which are not. In these The differences between these six types of
systems, the focus changes from trying to pretend systems are summarized in Fig. 1-10, which shows
that there is a single common memory to how to them from tightly coupled hardware on the left to
maintain a set of replicated distributed data loosely coupled software on the right. The first
structures consistent in the face of updates, four types offer a memory model consisting of a
potentially from all the machines using the shared standard, paged, linear virtual address space. The
data. In some cases the paging hardware detects first two are regular multiprocessors and the next
writes, which may help maintain consistency two do their best to simulate them. Since the first
efficiently. In other cases, the paging hardware is four types act like multiprocessors, the only
not used for consistency management. operations possible are reading and writing
memory words. In the fifth column, the shared
Finally, we have systems running object-based variables are special, but they are still accessed
distributed shared memory. Unlike all the others, only by normal reads and writes. The object-based
programs here cannot just access the shared systems, with their encapsulated data and
data. They have to go through protected methods, methods, can offer more general operations and
which means that the runtime system can always represent a higher level of abstraction than raw
get control on every access to help maintain memory.
? MuItiprocessors ? ? DSM ?
Item Single Switched NUMA Page Shared Object based

bus based variable
Linear, shared virtual address Yes Yes Yes Yes No No

space?
Possible operations R/W R/W R/W R/W R/W General
Encapsulation and methods? No No No No No Yes
 Page 20
Is remote access possible in Yes Yes Yes No No No

hardware?
Is unattached memory Yes Yes Yes No No No

possible?
Who converts remote memory MMU MMU MMU OS Runtime Runtime

accesses to messages? system system
Transfer medium Bus Bus Bus Network Network Network
Data migration done by Hardware Hardware Softwar Software Software Software

e
Transfer unit Block Block Page Page Shared Object

variable
Fig. 1-10. Comparison of six kinds of shared memory systems.
The real difference between the multiprocessors different, being a high-speed bus (or collection of
and the DSM systems is whether or not remote buses) for the multiprocessors and a conventional
data can be accessed just by referring to their LAN (usually) for the DSM systems (although
addresses. On all the multiprocessors the answer sometimes the difference between a "bus" and a
is yes. On the DSM systems it is no: software "network" is arguable, having mainly to do with
intervention is always needed. Similarly, the number of wires).
unattached global memory, that is, memory not
associated with any particular CPU, is possible on The next point relates to who does data migration
the multiprocessors but not on the DSM systems when it is needed. Here the NUMA machines are
(because the latter are collections of separate like the DSM systems: in both cases it is the
computers connected by a network). software, not the hardware, which is responsible
for moving data around from machine to machine.
In the multiprocessors, when a remote access is Finally, the unit of data transfer differs for the six
detected, a message is sent to the remote systems, being a cache block for the UMA
memory by the cache controller or MMU. In the multiprocessors, a page for the NUMA machines
DSM systems it is sent by the operating system or and page-based DSM systems, and a variable or
runtime system. The medium used is also object for the last two.
 Page 21
 Page 22

Distributed Shared Memory

Uploaded by

Copyright:

Available Formats

Distributed Shared Memory

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Distributed Shared Memory

Uploaded by

Copyright:

Available Formats

Distributed Shared Memory

INTRODUCTION & THISIS

A distributed shared memory is a mechanism

remains difficult since end-users have to handle

Fig. 1-1. (a) A multiprocessor. (b) A multiprocessor with caching.

Read hit Fetch data from local cache (No action)

Write Update data in memory and store in cache (No action)

Write Update memory and cache Invalidate cache entry

Fig. 1-3. An example of how a cache ownership protocol works.

1. Consistency is achieved by having all the caches do bus snooping.

2. The protocol is built into the memory management unit.

2. 3. The entire algorithm is performed in well under a memory cycle.

cache and up to date, an Exclusive bit, (i.e., the Exclusive bit is set), the word is

One is tempted to say that a ring-based respectively). Nevertheless, it does exist,

Each cluster in Dash is connected to an interface

Protocols transfer of the block is executed to place the block

Location where the block was found

Block R's Neighbor's cache Home cluster's Some cluster's cache

UNCACHE Send block to R;

CLEAN Use Copy block to R's cache Copy block from

DIRTY Use Send block to R and to Send block to R and to home

Location where the block was found

UNCACHE Send block to R;

CLEAN Send message to Copy and invalidate Send block to R;

DIRTY Use block Cache-to-cache Send block directly to

Obviously, maintaining memory consistency effect on performance. To get around these

a considerable amount of computing power

that must be sent and acknowledged. We

Fig. 1-8. (a) A simplified view

future. If it guesses wrong, a performance penalty instruction restarted. Subsequent references to

Comparison of Shared Memory

Item Single Switched NUMA Page Shared Object based

Linear, shared virtual address Yes Yes Yes Yes No No

Possible operations R/W R/W R/W R/W R/W General

Encapsulation and methods? No No No No No Yes

Is remote access possible in Yes Yes Yes No No No

Is unattached memory Yes Yes Yes No No No

Who converts remote memory MMU MMU MMU OS Runtime Runtime

Transfer medium Bus Bus Bus Network Network Network

Data migration done by Hardware Hardware Softwar Software Software Software

Transfer unit Block Block Page Page Shared Object

Fig. 1-10. Comparison of six kinds of shared memory systems.

You might also like