Cache Coherence Dubois Briggs 1982

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

EFFECTS OF CACHE COHERENCY IN MULTIPROCESSORS t

by

Michel Dubois and Fay~ A. Briggs

School of Electrical Engineering


Purdue University
West Lafayette, IN 47907

Abstract necting the I/O devices to the processors, with


some slight degradations. Each I/O processor and
In many commercial multiprocessor systems, each its associated CPU compete for the cache. Moreo-
processor accesses the memory through a private ver, cache space is also occupied by I/O process-
cache. One problem that could limit the extensi- ing. It is, however, usually preferred, as in the
bility of the system and its performance is the C.mmp. Since a shared cache contains blocks that
enforcement of cache coherence. A mechanism must cannot be present in any other cache, the coher-
exist which prevents the existence of several dif- ence problem between cache and memory is the only
ferent copies of the same data block in different potential problem for shared caches.
private caches. In this paper, we present an in- In a system with private caches, coherence
depth analysis of the effect of cache coherency in problems also exist between the different caches.
multiprocessors. A novel analytical model for the Therefore, private cache systems are more complex
program behavior of a multitasked system is intro- to design than shared cache systems. However, the
duced. The model includes the behavior of each private cache concept has the potential for high
process and the interactions between processes performance. While each reference to a shared
with regard to the sharing of data blocks. An ap- cache must traverse an interconnection network, a
proximation is developed to derive the main ef- private cache is next to the processor. Hence,
fects of the cache coherency contributing to de- most references can be completed at the cache
gradations in system performance. speed, if the hit ratio is high enough. Conse-
quently, higher speedups are expected from private
I. Introduction cache systems. The performance of the system with
Caches hav~ proved to be highly effective in shared caches has been investigated in [YEH81].
increasing the throughput of small and large Many commercial multiprocessor systems, such as
uniprocessor systems [CON69, BEL74, STR76, BAD79]. IBM 370/168 MP, IBM 3081 and UNIVAC 1100/80 MP are
This success has prompted their use in commercial systems with private caches [SAT80]. In [DUB81],
[SAT80] and scientific [WID80] tightly coupled we have described a hybrid system with both shared
multiprocessor systems. Many issues related to and private caches.
caches are common to both uniprocessor and mul- In this paper, we consider only the system with
tiprocessor systems. An overview of these issues private caches, and study the effect of the cache
is presented in [KAP73], and the reader is assumed coherency on the system performance. For analyti-
to be familiar with them in the following discus- cal tractability we restrict this study to fully
sion. associative caches.
There are two basic configurations of caches in
multiprocessors: the shared cache or the private 2. Cache Coherence Solution
cache. A shared cache is referenced by all the The problem of cache coherence has a detrimen-
processors, as shown in Fig. 1.a. It is usually a tal impact on system performance. It has two
high-speed buffer for a fixed section of memory causes. The first, and most obvious, is the
(i.e., one or several modules). A private cache result of data sharing between processes partici-
is dedicated to one single processor but buffers pating in a multitasked algorithm or between in-
blocks from several main memory sections. A sys- dependent operating system processes running on
tem with private caches is depicted in Fig. 1.b. different processors. A second and somewhat sub-
Cache organization, block size, cache size and re- tle cause is the migration of processes. Assume,
placement algorithms are not restricted by any of for example, that process A executes on processor
the configurations. Generally a set-associative I, then is allowed to migrate to processor 2 and
cache with LRU replacement policy within each set finally migrates back to processor 1. If the
is adopted. cache is not swept on a context switch, then some
The cache coherence problem affects the system stale data could still be present in the cache of
design at various levels. Inconsistencies may oc- processor I when process A is scheduled on proces-
cur between cache and memory and also between dif- sor I for the second time.
ferent caches. The side effects on I/O processing To solve the coherence problem, static and
caused by data inconsistencies between the caches dynamic solutions have been proposed. In the
and the main memory are usually resolved by con- static solution, sometimes called software coher-
ence check, pages in shared memory are tagged at
compile time as cacheable or non-cacheable. Exam-
tThis research was supported by NSF grant ECS ples of non-cacheable items are semaphores and
80-1650.

299
0149-7111/82/0000/0299500.75 © 1982 IEEE
system tables. All pages containing shared write- it without any consistency check. If processor k
able data must be non-cacheable. References to wants to write into its copy of block i, it must
non-cacheable data are sent directly to the shared make sure that all other copies of block i in oth-
memory. If the degree of sharing is large, the er caches are invalidated. To do this, the global
resulting performance degradation may be signifi- table is consulted. It should indicate the set of
cant, since the memory is typically 5 to 10 times caches that own a copy of block i. The global
slower than the cache. Moreover, the static solu- table is updated to record the fact that processor
tion is not applicable to systems in which k owns block i with read-write access right. Fi-
processes can migrate, unless the cacheable blocks nally, the local RW flag is set to indicate that
of a migrating process are invalidated selectively the block is being modified.
in the cache. In this implementation, a block copy in a cache
In the dynamic solution, there are many ap- is invalidated whenever the cache receives a sig-
proaches to solving the cache coherence problem. nal from some other processor attempting to write
Our method is to connect each pair of caches into it. Moreover, a cache owning a read-write
through a high-speed link. The write-through pol- copy of a block may receive a signal from another
icy is used, and on each write all the other cache controller requesting to own a read-only
caches are signaled simultaneously to perform the copy. In this case, the read-write copy becomes
memory update. When a cache controller receives a read-only. The high level flowcharts of Fig. 2
signal, it must service it by searching the cache describe the mechanisms enforcing the cache coher-
content for the received block address. If w t is ence checks on a read (Fig. 2.a) and on a write
(Fig 2.b) operation, for a system with write-back
the probability of a write for each processor and
policy. Due to the high traffic intensity gen-
there are P processors in the system, the average
erated by write-throughs to shared memory, it may
number of signals received per processor reference
not be suited to high performance multiprocessors
is (P-l) w t. This number can become quite large especially when the number of processors is large.
as P increases beyond 2 or 4. Note also that the Hence, the write policy adopted in this paper is
total number of one-way links between the caches write-back [KAP73]. There are three sources of
is P(P-I). A simple implementation would be to performance degradations in the algorithms of
tie the caches on a bus or on several busses. But Figs. 2.a and 2.b. These sources are
the traffic on these busses would be such that the I. Degradation of the average hit ratio,
response time may be unacceptable. Note that to due to block invalidation,
ensure correctness of execution, a processor must 2. Traffic between caches to enforce the
send a message and wait for an acknowledge signal coherence rules,
from all other processors before it can complete a 3. Accesses to the global table resulting
write operation. Furthermore, buffering may be in conflicts.
required to accommodate peak loads of invalidation Consider a block i present in the cache. If
signals. Proposals to improve this scheme involve the coherence mechanism is disabled, the next time
between the memory
checking for pending

filtering most of the write signals by recording block i is referenced - if ever - a miss occurs
information about the status of each memory block only if block i has been replaced. When con-
in a global table. The global table can be inter- sistency is enforced, there are more chances that
rogated by any processor. To limit the accesses the next reference to block i results in a miss,
to the global table, local flags are also distri- since block i may be invalidated before it is re-
buted among the cache directories. Depending on placed. The net effect is a degradation of the
r-w instructions.
and the participating processor -

the status of the local flags and the type of re- hit ratio, in direct relation with the invalida-
Note: EX RW Access is a contract

quest, the processor is allowed to proceed or has tion traffic intensity between caches.
to consult the global table. The degradations caused by the traffic between
The philosophy behind the cache coherence check caches come from the overhead incurred by proces-
is that an arbitrary number of caches can have a sors requesting to own data block copies and by
copy of a block, provided that all the copies are processors receiving an injunction to modify the
identical. They are identical if the processor status of one block in their caches. When a pro-
associated with each of the caches has not at- cessor requests ownership of a copy of a block of
tempted to modify its copy since the copy was data, there are two cases in which it may have to
brought in its cache. We refer to such a copy as wait. The first case occurs when the request is
a read-only copy (RO). To be able to modify a for an RW copy and at least one other processor
block copy in its cache, a processor must own the owns an RO copy. The other case occurs when the
block copy with read and write access right. We request is for a copy when some other processor
refer to such a copy as a read-write copy (RW). owns an RW copy. The waiting period is longer in
Only one processor can, at any time, own a read- the latter case, since one write-back to memory
write copy of a block. has to be completed before the processor can read
To enforce the cache consistency rule, global the requested block. Received signals necessitate
as well as local information is constantly updated invalidation and updates of RW flags in the cache
by the cache and memory controllers. When a pro- and also result in an increase of the write-backs
cessor k loads a data block i on a read miss, the to memory. This effect increases the traffic
processor obtains a read-only copy of the block in between the processor and the memory. A write-
its cache. The state of this block copy is back is completed "in the background" if a buffer
recorded in the global table and in the cache is provided ("Flagged Register Swap") or "on line"
directory (by resetting a local RW flag in the otherwise ("Flagged Swap") [KAP73]. In the imple-
cache directory). As long as the copy of block i mentation shown in Figs. 2.a and 2.b, main memory
remains present in the cache, processor k can read is used as the mean of communication between the

300
processors. Another implementation consists in tasked systems. Moreover, the reference strings
exchanging block copies on a dedicated interpro- may be different from one run to the other,
cessor communication system, as a high-speed bus. depending on the timing of events in the system.
In this latter case, a given block copy is written Algorithms running in MIMD mode may be either syn-
back to memory only when it is RW and it is re- chronized or asynchronous [KUN80]. In particular,
placed. We consider only the first implementation asynchronous algorithms may have random traces for
in our analysis. The model developed in this pa- different runs of the same programs. In this re-
per could be utilized to analyze the second imple- gard, the stochastic modeling approach in studying
mentation. MIMD processes is all the more appropriate.
The accesses to the global table result in con- In this section, we propose a simple analytical
flicts. The degradations are implementation model of program behavior to analyze the effect of
dependent. Tang has proposed to replicate all the the cache coherence problem on the performance of
cache directories in the global table [TAN76]. multiprocessors with private caches. We assume
This implementation is expensive, because it re- first that the MIMD algorithm is decomposed into P
quires associative addressing. Moreover, interro- processes, each running on one processor. The P
gations of the table must be accomplished by processes are stochastically homogeneous; i.e.,
searching all the directory copies in parallel, their traces are realizations of the same stochas-
which reduces the concurrency in accessing the tic process. When the cache block size is fixed r
global table. In [CEN78], a different organiza- the reference string corresponding to each reali-
tion was proposed, in which more concurrency is zation of a process trace is the merging of two
possible. For each block in main memory there is streams: one for private and read-only shared
an entry in the directory. Each entry contains P block (referred to as P-block) references and the
"presence" flags and one "modified" flag. The other for writeable shared block (referred to as
modified flag is set if some processor owns an RW S-block) references. With probability qs" the
copy of the block: in this case, the copy in main next reference is for an S-block, and with proba-
memory is stale. For block i, presence flag k is bility (1-qs) , it is for a P-block. References to
raised if and only if processor cache k has a copy
of block i. These presence flags limit the number P-blocks do not affect cache coherence. However,
of caches that are signaled on an invalidation re- they contribute to the accesses to the global
quest to the actual number of caches having a copy table for miss references and also for hit refer-
of the block. The modified flag, when it is ences involving a write on an RO copy. Also, they
reset, permits read-only accesses. In these im- do not create invalidation traffic, nor do they
plementations there is a cooperation between the degrade the hit ratio of the other caches.
cache controllers and the memory controller in the For the references to P-data, any model could
processing of invalidation signals. The organiza- be chosen. However, in the examples we ran, the
tion of the cache coherence mechanism given in model for the P-block reference stream is a
[TAN76] and [CENT8] can be distributed. To im- Least-Recently-Used Stack Model (LRUSM) with coef-
prove the concurrency in the coherence mechanism, ficients derived from the linear paging model
the organization of the global table may be a [DEN80, SPIT7, SAL74]. Besides analytical tracta-
two-dimensional array of directory modules as in bility, this model has the advantage that by modi-
an L'M memory [BRI77], with L small, of the order fying one parameter the locality of the references
of I or 2. Details of this implementation is to P-blocks is tuned at will. The references to
given in [DUB82]. S-blocks are drawn as in the Independent-Reference
The model we develop in the following section Model (IRM) [SPI77]. The cache is managed as a
allows us to determine the impact of all the per- fully associative cache with LRU replacement. The
formance degradations except for the access con- overall program behavior is observed in Fig. 3.
tentions to the global table. Emphasis in this There are N P-blocks and N S-blocks. The P-
p s
paper is to evaluate the impact of the enforcement block references are characterized by a stack dis-
of data consistency on the hit ratio of the cache
tance probability distribution ai, such that
and the invalidation traffic between caches. The
amount of blocking, which results from a request
I " for i ~ 1
for an RW copy when RO copies exist in other ai =
caches, is very difficult to obtain and is not at- I
and a = I -- for i = 1 [SPI77]
tempted here. I b "
Of course, these probabilities are normalized
3. Workload Model for a Multiprocessor
N
A workload model for MIMD systems is, in gen- P
eral, much more complex than for uniprocessor sys- so that
~ ai=1. The references to S-blocks are
tems. It must include a description of each i=I
stream as well as a specification of the interac- made according to a fixed probability distribution
tions between streams. Good parsimonious models Pi" i = 1,...,N s. The probability that a refer-
are thus particularly difficult to develop. The
ence to a given S-block i is a write reference is
problem of validation is compounded by the fact
wi, for i = 1,...,N s. The probability that a
that traces of multitasked MIMD programs are dif-
ficult to obtain. The cooperation between proces- given reference is to read an S-block i is
sors requires special instructions and mechanisms qs " Pi • (I - wi). Similarly, the probability
not present in multiple SISD systems (e.g., user
that a reference is a write on S-block i is
level synchronization mechanisms). Few compilers
exist to produce the object code running on multi- qs " Pi " wi"

301
In this model, the effects of process migration check, we first examine the system without cache
on the cache coherence are neglected. This ap- coherence enforcement, which can be observed by
proximation is tantamount to neglecting transients running one of the processes on a uniprocessor.
after a context switch. The idea that S-blocks We assume that the curve of the hit ratio as a
are accessed according to the IRM reflects the function of the cache size is known for a given
fact that shared writeable data are usually in block size. This curve can be obtained by simula-
different memory locations than P-blocks and that tion from a trace or by analytical methods. Let
accesses to them exhibit less locality than h(K) denote this function, with K as the cache
accesses to P-blocks. As for the models of pro- size. The determination of h(K) is a classical
gram behavior for uniprocesses, the model in Fig. problem and will not be addressed here. The hy-
3 may describe a process behavior during a phase potheses of the program behavior were stated in
of a program [DEN80]. Phases are separated by section 3.
abrupt transitions during which the parameters of We first evaluate the probability of a write-
the model may change• The model developed here back on each S-block i. In the following, we
assumes recurring events• In reality, a block may designate by h. the probability that a reference
be written into and never referenced thereafter• to S-block i results in a hit (i.e., S-block i is
The model implies, however, that a block is always present in the cache at the time of a reference to
referenced again eventually. it). The Markov graph of Fig. 4 describes the
reference pattern to S-block i. An S-block i can
4. Performance Degradations Due to Coherence Check assume one of two states, RO and RW. State tran-
Since the model us recurrent, we can consider sitions occur right after a reference to S-block
two successive accesses to a given S-block i in
any one processor• We will refer to these i• Let us denote these instances by t+k,i" where
accesses as the previous and the next accesses. tk, i is the time of the kth reference to S-block
According to the model, the previous and the next
accesses are separated by a geometrically distri- i. At time t+k,i (for any k), S-block i is in
buted number of references with mean I - I. state RW if it was not replaced in the cache since
Piqs
the latest write to it or if the kth reference to
There are two types of S-block i that may re- it was a write• In this case, the "modified bit"
side in the cache after a previous reference• In is raised in the cache for S-block i, at time
the first case, the previous reference terminates
with an RO copy of S-block i in the cache. At the t+k,i" + i, for any
At time tk, k, block i is in
time of the next reference to it, the RO copy may state RO in all other cases, i.e., the cases when
not be present in the cache for two reasons: ei- block i has been replaced since the latest write
ther it has been chosen for replacement, or it has to it and the kth reference is a read; the "modi-
been invalidated following a write request to S-
block i from another processor• Both cases result fied bit" for block i is then reset at time t+
k,i"
in a miss at the next reference• The invalidation The steady-state probability that S-block i is
of an RO copy does not require any block transfer
written back between two references to it is
to memory• A hit at the next access occurs only
if S-block i has not been replaced and if no other
processor has attempted to write into S-block i Wbi = PRW, i • (1-h i) , (1)
since the previous reference. where PRW, i is the long term probability that the
In the other case, the previous reference ter-
minates with an RW copy of S-block i in the cache. state of S-block i is RW at time t+k,i" and (1-h I)
After the completion of the previous access to S-
block i, an RW copy of S-block i is present in the is the probability that S-block i is replaced at
cache. The possible events that occur (with the next reference to it.
respect to S-block i) before the next reference to Solving the Markov chain of Fig. 4 yields
it are more numerous than for an RO copy. A miss
on S-block i occurs at the next reference either W.
I
because the block copy was replaced or because it PRW,i = Wo + (1-w.)(1-h.) " sothat
was invalidated after another processor attempted I I I
to write into it. Before an RW copy is replaced,
invalidated or referenced again, the RW copy may w (1-h.)
I 1
have been changed to RO, if another processor has (2)
Wbi = w. + (1-w)(1-h
. o) "
obtained an RO copy since the last reference to I I I
it. Table 1.a summarizes the various events that
can occur on a given S-block i in between two In between two successive references, the S-block
references to it when the cache coherence mechan- i may be replaced• For an S-block i in state RO,
ism is enabled. I, R and S represent "invali- a write-back is not required when the block is be-
date," "replace" or "switch from RW to RO," ing replaced, because it was not modified• If the
respectively. Table 1.b shows the case in which previous reference was in state RW, the block is
the coherence mechanism is disabled. To evaluate not present at the next reference to it in the
the effects of the cache coherence check, both cache, with a probability (l-hi).
cases must be studied• The parameters (1-h i) for i=1,...,N s in Eq. 2
4.1 Cache Without Coherence Check depend on the program behavior• Because a unipro-
To evaluate the effect of the cache coherence cessor system is assumed, the physical timing of

302
events is irrelevant to the hit ratio estimation. timing of each discrete event occurring in the P
The only useful timing consideration is the processors. Since this complexity is probably not
virtual time [SPI77]. The virtual time measures tractable, it is desirable to develop an approxi-
the number of references the processor has made mation. The basic assumption is that, between two
since the beginning of the program execution. The successive references of the processor under
determination of (1-h.) is a very complex problem, study, all the other processors complete one and
I only one reference. This hypothesis is not veri-
even with the simplified model hypotheses. A fied in practice. However, it is adequate, be-
solution is possible under the following approxi- cause the processors are assumed to be homogene-
mation. To understand the approximation, the ous. This approximation is made only to evaluate
reader is referred to Fig. 5 in which a snapshot the mean intercache traffic and hit ratio degrada-
of the cache LRU stack is displayed at any virtual tions. It is not made to study the overall system
time. In this configuration of the stack, S-block performance. Synchronization at every reference
i is at level k. The approximation assumes that would indeed be a very gross approximation in this
the probability that the next reference is for a context. We can introduce the concept of virtual
block contained in any level higher than k, given time, as in single processor systems [SPI77]. To
that the referenced block is not S-block i, is each processor is associated a virtual clock which
simply h(k I) and is independent of the previous is incremented whenever a reference is made to the
references. Similarly, the probability that the cache. The correspondence between physical and
next reference is at a level lower than or equal virtual times is shown in Fig. 7. The approxima-
to k, given that S-block i is not referenced, is tion assumes that the physical times corresponding
[1-h(k-1)]. In summary, the probabilities that to a given virtual time are of the same order in
the next reference in the stack configuration o f all the processors.
Fig. 5 is for S-block i, or for another block at a Given this hypothesis, we can assume that,
level higher than k, or for another block at a between two references in the cache under study,
level lower than k are qsPi or (1-qsPi)h(k-1) or any one of the other processors accesses S-block i
(1-qsPi)[1-h(k-1)], respectively. In this approx- with probability qsPi and writes into S-block i
imation, the probability of accessing the cache with probability qsPiWi . We note that in the
stack at a given level is independent of previous
model, several references to S-block i may be made
references. Thus, we call this approximation the in between two references to the cache under
"instantaneous hit ratio" approximation. study. Consider two successive references to the
The instantaneous hit ratio approximation
cache under study; we refer to them as the
results in the simple Markov graph description of
previous and the next reference. Assume that
Fig. 6, in which the state is the position of S-
after the completion of the previous reference a
block i in the LRU stack. At virtual time 0, when
copy of S-block i is present in the cache under
the Markov process is started, S-block i is in po-
study. In our model, the following events may oc-
sition (or state) I. We thus examine the stack cur to S-block i before the next reference.
just after a reference to S-block i. Whenever S- First of all, no other processor may have
block i is referenced (with probability qsPi ), the referenced it, in which case the copy is un-
next state is state I. Whenever a block is refer- changed. If some other processors have referenced
enced at a level lower than the level k of S-block it, the copy may be modified. If the copy was RO
i [with probability (1-qsPi)(1-h(k-1))] , S-block i at the previous reference and all the references
from other processor were reads, then the copy is
is moved down to level k+l. With probability unchanged. If the copy was RO and at least one of
(1-qsPi)h(k-1), a block at a level higher than k the references from the other processors was a
is referenced, and thus S-block i position does write, then the copy is invalidated. If the copy
not change. The probability (1-h.) that S-block i was RW, three cases can occur if other processors
I
have referenced the block. First, if all the
is replaced before the next reference is equal to references from the other processors were read,
the probability that the Markov process of Fig. 6, then the copy has become RO by the time of the
started in state I, is absorbed in the state "Re-
next reference. Second, if the first reference
placed" before it visits state I again:
from other processors signaled to the cache was a
write, the copy is invalidated. Third, if the
K-I [1-h(k)](1-qsPi) first reference from other processors to reach the
1-h. = H (3)
I k= 0 1-h(k)(1-qsp i) " cache was a read and it was then followed by at
least a write, then the copy has been first
changed to RO, then invalidated by the time of the
A combination of Eqs. 2 and 3 yields the probabil- next reference. We assume that the order in which
ity of a write back of any S-blocks i for arbi- references from other processors are signaled is
trary probability distributions of S-block
independent of their being a read or a write.
accesses and write, and hit ratio curves.
From the hypotheses of the model, the probability
of these events can be evaluated, as in Table 2.
4.2 Cache with Coherence Check
In this table, Psi is the probability that no oth-
When the cache coherence enforcement mechanism
is enabled, signals are received from other pro- er processor references S-block i. Similarly, P
WI
cessors' caches. The problem of incorporating
is the probability that no other processor at-
these external signals in the local model is very
tempts to write into S-block i. Hence,
complex, because, contrary to the single processor
case, it requires keeping track of the physical Pwi = (1-qsPiwi)P-l" and Psi = (1-qsPi)P-l"

303
The next step is to compute the effect of the N
S
cache coherence check on the hit ratio. To or h*(K) : h(K) - qs ~ Pi(hi - hl), . (7)
achieve this, we compute the individual miss ra-
tios (1-h) of Eq. 3 under invalidations, h i is
Figure 9 displays the hit ratio curves as a
the probability that S-block i is present in the function of the cache size. These curves were
cache at the time of a reference to it when the verified with simulation to check the instantane-
cache coherence mechanism is enabled. We note ous hit ratio approximation and the approximation
from TabLe 2 that the probability of invalidation of Eq. 7. In all the results we show in this pa-
of a copy of S-block i at the next reference is per, the distribution Pi of the accesses to S-
the same, whether or not the copy after the previ-
ous reference was RO or RW, and is (1-Pwi). In- I
blocks is a uniform distribution (Pi = N--)" and w i
S
validations ale the only events affecting the hit
ratio. is also a constant, independent of i. Values of
For the processor under study and right after a w. = .I and N = 1024 are also adopted throughout.
I p
reference to S-block i (at t=O), a copy of S-block When the probability qs is .01, the effect on the
i is present in the cache. Three events may have
occurred right after each virtual time unit. hit ratio is negligible. For a high degree of
First, the block copy may have been invalidated, sharing (qs = .1), the hit ratio may deteriorate
with probability (I-P .). Second, if it has not considerably, even for P=16 (up to 10%). The
WI
been invalidated, it may have been referenced with curves are different for Ns=64 or Ns=16 , but the
probability qsPi . Finally, if it has not been in- overall effect is similar.
validated nor referenced, it may have been re- Note that h*i ~ hi because of the invalidations.
placed. The Markov graph is depicted in Fig. 8.
It is more complex than the Markov graph of Fig. 6 Also, the hit ratio degradation, h(K)-h*(K), has
because of the added possibility of an invalida- qs as an upper bound. This bound may be achieved
tion. This Markov graph can be solved by succes-
sive reductions. when the cache is so large that ho = I and the
1
The probability of absorption in the state "Re-
placed" is the probability of replacement and is invalidation traffic is so intense that h~ = 0.
I
equa I to This bound is visible in Fig. 9. As an approxima-
K-I * tion, h i can be replaced for h~ in Eqs. 4-6 in
[1-h (k)] (I-Q) I
PR,i = 11
k=O l-h* (k) (l-Q) order to estimate h* from Eq. 7. Alternately, an
iterative scheme can be used to estimate h* within
:
K-I [1-h*(k)]
II .
Pwi(1-qsPi) , (4)
a given tolerance.
From Eq. 5, the total invalidation traffic
k=O 1-h (k)Pwi(1-qsPi) reaching the cache is thus estimated as
where h* is the hit ratio when the cache coherence N
mechanism is enabled. S
The probability of invalidation is TI = qs i=~i Pi PI,i ~ qs " (8)
1-P o
wl (1 - PR,i ) , (5) All components of this invalidation traffic do
PI,i = 1-Pwi(1-qsp i) not result in the same overhead. Since we assumed
a write-back policy for cache management, the in-
and the probability of a reference while the block validation of an RW copy requires a write-back of
copy is still present in the cache is the block to memory, whereas the invalidation of
an RO copy does not. Moreover, the change from RW
* qsPi " Pwi qsPi " Pwi to RO also requires a write-back of the block to
memory. Considering invalidation alone, however,
hi = 1-Pwi(1-qsPi) (1-PR,i) = 1-Pwi " PI,i (6).
is sufficient to estimate the hit ratio degrada-
tion. The traffic between caches is also impor-
From Eqs. 3 and 4, the effect of invalidations tant in order to estimate the overhead. Overhead
on the hit ratio can be evaluated. is incurred both by the processor sending a re-
As was mentioned in section 2, the deteriora- quest (because it has to wait for the completion
tion of the hit ratio comes mainly from the in- of a request) and also by the processor receiving
creased probability of misses on S-blocks. The it (because a cache cycle is stolen to update the
probabilities of a miss on a given S-block i are copy in its cache). In Table 1, we have indicated
qsPi (l-hi) if the coherence is enforced, and the possible changes of state of a given S-block i
in between two successive references to it. The
qsPi(1-hi) if it is not. probability of each of these transitions is also
We have thus given in the table. The derivations of these pro-
babilities is involved and is omitted here. The
N
reader is referred to [DUB82].
1-h*(K) = 1-h(K) + ~E~ q s P i 1-h ) - (1-h i
i=1

304
5. Comparison with Simulation Results very interesting and very complex problem to
The first approximation made in this paper was analyze. Measurements on a real system are prac-
to assume that, between two consecutive references tically excluded. Simulation of an MIMD system
of the processor under study each other processor based on P individual traces is very expensive and
makes one and only one reference. This assumption complex, as well as very difficult to verify.
permitted the definition of a virtual time, unique Analytical models are an attractive alternative.
for all processors. Simulation of the whole mul- Once the model has been developed, the gathering
tiprocessor system would be required in order to of results is easy and unexpensive. Predictions
take into account the effect of the exact physical of the performance degradations due to coherency
timing of events in the cache. This simulation are possible within the limitation of a reasonable
was not attempted, and thus the first approxima- set of approximations. For example, the model for
tion is still to be verified. Within the context the effects of coherency could be included in a
of this first assumption, the second approximation global model of a cache-based multiprocessor, like
-- the instantaneous hit ratio approximation -- is the one derived in [BRI81], and could contribute
easier to evaluate. The definition of the virtual to the understanding of the cost-performance
time permits the simulation of the behavior of one trade-offs in such systems. In interpreting these
cache in isolation. We have developed a methodol- models, one should keep in mind that they are
ogy through which the hit ratio and the various based on a set of hypotheses regarding the program
components of the traffic between caches can be behavior. Up to now, no real attempt has been
estimated in one simulation run, for all cache made to model the behavior of cooperating
sizes, under invalidations caused by the cache processes in a multitasked system. Much work
coherency. The simulation results for the hit ra- remains to be done in the development and valida-
tio is displayed in Fig. 10 in solid lines. The tion of these MIMD program models. In this
simulations where run for P=16, Np=1024, qs=.l, respect, our approach breaks new ground in the
field of modeling. It has the advantage of being
w ='1"i andPi -- ~N" for all i. The total number both tractable and reasonable.
s
of cache references generated was 1,000,000 for Acknowledgement
each case. The theoretical approximations appear The authors express their gratitude to Saied
as dashed lines. The simulations compare very Kazemi for his help in developing the simulation
well with the curves obtained analytically, demon- programs to verify the analytical models.
strating the validity of the instantaneous hit ra-
tio approximation.

6. Conclusion References
In this paper, we have studied the effects of [BAD79] M. Badel and J. Leroudier, "Performance
the cache coherency on the hit ratio of a cache Evaluation of a Cache Memory for a Minicomputer,"
and on the traffic between the caches. An analyt- 4th International Symposium on Modelling and
ical model for multitasked program behavior has Performance Evaluation of Computer Systems, 1979.
been introduced. This analytical model includes [BEL74] J. D. Bell et al.p "An Investigation of
the sharing of blocks between processors. Locali- Alternative Cache Organizations," IEEE Transaction
ty properties of P-blocks and S-blocks are re- on Computers, Vol. C-23, No. 4, April 1974.
flected by a different model for each. To obtain [BRI77] F. A. Briggs and E. S. Davidson, "Organi-
a tractable solution, it is assumed that between zation of Semiconductor Memories for Parallel
two references of a given processor, any other Pipelined Processors," IEEE Transactions on
processor makes one and only one reference. This Computers, Vol. C-26, February 1977.
assumption is only made for the study of coherency [BRI81] F. A. Briggs and M. Dubois, "Performance
and, in this context, is reasonable. Simulations of Cache-Based Multiprocessors," ACM Conference on
are possible at this level of abstraction. We Measurement and Modeling of Computer Systems, Sep-
have, however, derived approximate analytical for- tember 1981.
mulas for the effects of coherency. These deriva- [CEN78] L. M. Censier and P. Feautrier, "A New
tions are based on the instantaneous hit ratio ap- Solution to Coherence Problems in Multicache Sys-
proximation. Simulations have shown that such an tems," IEEE Transactions on Computers, Vol. C-27,
approximation was very good. No. 12, December 1978.
In general, for processes with a low degree of [CON69] C. J. Conti, "Concepts for Buffer
sharing (qs=.01), the effects of coherency are Storage," IEEE Computer Group News, Vol. 2, No. 8,
very low, even with a large number of processors. March 1969.
When the degree of sharing is large (qs=.1), the [DEN80] P. Denning, "Working Sets Past and
Present," IEEE Transaction on Software
main degradations are in the hit ratio and in the Engineering, VoW. SE-6, No. I, J a n u a ~ 1980.
blocking time resulting from accesses to blocks [DUB81] M. Dubois and F. A. Briggs, "Efficient
owned by other processors with RW right. Interprocessor Communication for MIMD Multiproces-
The model does not evaluate the effects of con- sor Systems," Proceedings of the 8th International
tention for the global table. To limit these con- Symposium 0_9. Computer Architecture, May 1981.
tentions, we have organized the global table in an [DUB82] M. Dubois and F. A. Briggs, "Analytical
L-M configuration. The effects of access con- Methodologies for the Evaluation of Multiprocess-
flicts can be studied through other models ing Structures," Purdue University, School of
[BRI77]. Electrical Engineering, Technical Report, TR-EE
The cache coherency in multiprocessors is a 82-4, Feb. 1982.

305
[KAP73] K. R. Kaplan and R. O. Winder, "Cache- brary, 1977.
Based Computer Systems," Computer, Vol. 6, No. 3, [STR76] W. D. Strecker, "Cache Memories for
March 1973. PDP-11 Family Computers," Proceedings of the 3rd
[KUN80] H. T. Kung, "The Structure of Parallel Annual Symposium on Computer Architecture, January
Algorithms," in Advances in Computers, 1980. 1976.
[SALT4] J. H. Saltzer, "A Simple Linear Model of [TAN76] C. K. Tang, "Cache System Design in the
Demand Paging Performance," Communications of the Tightly Coupled Multiprocessor System,"
ACM, Vol. 17, April 1974. Proceedings of the AFIPS, 1976.
[SAT80] M. Satyanarayanan, "Commercial Multipro- [WID80] L. C. Widdoes, "The S-1 Project: Develop-
cessing Systems," IEEE Computer, May 1980. ment of High Performance Digital Computers,"
[SPI77] J. R. Spirn, Program Behavior: Models Compcon Digest of Paper '80, IEEE Computer So-
and Measurements, Elsevier Computer Science Li- ciety, San Francisco, CA, February 1980.

S~tlTCH I

No

Processors opie(s)

Request cache k to WB
1
Fig. la. Multiprocessor System with Shared
Invalidate all R0I~ invalldate its copy of
block i and to write it I
copies of block iI to MM
Caches. in other caches I
T
I
• • • Caches Record block i
!

I I
as RW in GT
J I
_.1
I ........ I
d6 6 ...........
Fig. lb. Multiprocessor System with Private
Caches.

<>Yes
No

If block j is RW, write


it to MM for update
NO

I Find block to replace


If block j is RW, write
it to MM for update
l GTwB
. . . . . . IG T
~_ .o
I tfGet RW ...... ~ W .\
to copy of block i Ip ~ to copy o

NO Yes, in cache k II
I
rlll
J
I Get copy of block i
from MMwith RO access 1 Legend. i

GT: Possible access to


Write in cache I
Global Table I
i I WB: Possible Write-Back
I ,end . . . . . . . . . . . . . . . I to M. . . . y
PI: Pure Invalidation
MM: Main Memory

(a) Read Operation (b) Write Operation

Fig. 2. Flowcharts describing memory operation in a multiprocessor cache with coherency enabled.

306
S-blocks (IRM) P-blocks (LRU stack) Processor
number

J
B.

1 BNs+24

BNs+IO

I-q s
virtual
times
iI
3
4

L
. . . .

I I
I
,J/

2
7'
I

I I
3
I
'"

4
physical time

Fig. 7. Definition of virtual time for a multis-

Next
Progr
tream with four processors. The crosses
corresponds to the physical time of the
reference
references of each processor.

dI ui(l) ui(2) ui(K-l)


Cache stack (LRU)
w, , , . , i(2)... ~

~I-P,
d. e,-

Fig. 3. Model of program behavior, with S-blocks


accessed with fixed probability distri- d, =q p.P .
bution and P-blocks references with LRU ei = (t-qsp.)P ,
stack model.

l-w i h i + wi(l-h i )
Fig. 8. Markov-graph description of S-block i
stack position with cache coherence
mechanism so enabled h* is the hit ratio
function with cache coherence mechanism
enabled.

Table 1.a Possible events and associated overhead


( l - h ; ) (l-w i ) in between two references to a given
S-block when the cache coherence check
is enabled.
Fig. 4. Markov graph of reference pattern to S- Previous Probability
Case Reference Event Overhead Next Reference of Event
block i with cache coherence mechanism
Esl.i RO I invalidate block not present Pl,i
disabled.
Es2,i RO R None not present PR,i
Es3.i RO None None present (RO) h~
w. (I-Ps i
Top of Stack Epl,i RW I invalidate block not present s ~ [1-P(EP2"i)]
}evel 1 write hack block

2 K-1 [1-h*(k)3Pst(1-qsp i )
Ep2,i RW R write back block not present n
k=O t-h*(k)Psi (l-qsp i )
Psi qsPi
Ep3,i RW None None present (RW) ~ [1-P(Ep2,i)]
S-block i level k
Ep4,i RW S then I switch to RO not present Pl,i - P(Epl,i)
write back block
and invalidate
K-l block
K Epsoi RW S then R switch to RO not present PR,i - P(Ep2
write back block

Ep6, i RW S switch to RO present h~ - P(Ep3,i)


write back block
Fig. 5. Snapshot of the cache LRU stack.

Table 1.b Possible events and associated overhead


between two references to a given S-
block when the cache coherence check is
disabled.
Previous
Reference Event Overhead Next r e f e r e n c e
RO None None present (RO)

RO R None not present


Fig. 6. Markov description of S-block i stack
position in system with cache coherence RW None None present (RW)
mechanism disabled, where c i = qsPi and RW R write back not present
v.(k) = 1-h(k)[1-c.]. block
I I

307
Table 2. Possible modifications of the state of
an S-block i between two successive
references in the cache, where
= )P-I
Pwi (1-qsPiWi and r=
1,00 ____--12

Psi = (1-qsPi)P-1" 16
-- '12E
. III?'5
State of S.-block i Changes in S-block in state ProbabiL.ity
at previous references between the two references
Not present Not present I

Present, R() Present, RO Pwi


Invalidated 1-Pwi 0 i~p= I02~
qe= 0,1
Present, RW Present, RW Psi ac ,~w~o • wi= 0.1

Present, RO Pwi-Psi
z
Present, RO, then invalidated (1-Pwi)-wi(1-Psi)

Simply invalidated wi(1-Psi)


. IE".,(I

p=

1.00
0 ,O©
16
128
CACHE SIZE(BLOCKS)

(c)
,750"

Ns= 16
. r~'a. Np= I02/~ Fig. 9. Hit ratio versus cache size. The dotted
(3 qs= 0.1 curves are for the cases where the cache
wi= 0.%
QaC .DOS ' coherence mechanism is disabled.
Z
,3?'3

.P~0 1.00

016 .........................
.6,"5

O ,SO '~0

CACHE SIZE(BLOCKS)

p= 16
Ca) p- wl = 0.1
C~ .I00 ' qs= o.1
Np = 1021~
p=
1 *SO
.......... lit6
12B

f
.?'50 "

,e,a5 ~s= 16 O ,SO


16 3B q ~ ll~ ~ 110 IN
~p= 102~
qs= 0.01 CACHE SIZE (~LOCKS)
,~O wi= 0.1

.~175 Fig. 10. Comparison between simulations and the


analytical approximation for P = 16,
I
,I~0
W i = .1, qs = .1, Pi and Np = 1024.
The simulations are in solid lines and
the model approximations are in dotted
lines
0 ,oo

CLICHE S I Z E ( ~ _ O C K S )

(b)

308

You might also like