VLSI Processor Architecture
VLSI Processor Architecture
VLSI Processor Architecture
Abstract - A processor architecture attempts to compromise In many ways, the architecture and organization of these
between the needs of programs hosted on the architecture and the VLSI processors is similar to the designs used in the CPU's
performance attainable in implementing the architecture. The of modern machines implemented using standard parts and
needs of programs are most accurately reflected by the dynamic
use of the instruction set as the target for a high level language bipolar technology. However, the tremendous potential of
compiler. In VLSI, the issue of implementation of an instruction MOS technology has not only made VLSI an attractive imple-
set architecture is significant in determining the features of the mentation medium, but it has also encouraged the use of the
architecture. Recent processor architectures have focused on two technology for new experimental architectures. These new
major trends: large microcoded instruction sets and simplified, or architectures display some interesting concepts both in how
reduced, instruction sets. The attractiveness of these two ap-
proaches is affected by the choice of a single-chip implementation. they utilize the technology and in how they overcome per-
The two different styles require different tradeoffs to attain an formance limitations that arise both from the technology and
implementation in silicon with a reasonable area. The two styles from the standard barriers to high performance encountered
consume the chip area for different purposes, thus achieving per- in any CPU.
formance by different strategies. In a VLSI implementation of an This paper investigates the archifectural design of VLSI
architecture, many problems can arise from the base technology
and its limitations. Although circuit design techniques can help uniprocessors. We divide the discussion into six major seg-
alleviate many of these problems, the architects must be aware of ments. First, we examine the goals of a processor architec-
these limitations and understand their implications at the in- ture; these goals establish a framework for examining various
struction set level. architectural approaches. In the second section, we explore
Index Terms- Computer organization, instruction issue, in- the two major styles: reduced instruction set architectures
struction set design, memory mapping, microprocessors, pipe- and high level microcoded instruction se t architectures.
lining, processor architecture, processor implementation, VLSI. Some specific techniques for supporting both high level lan-
guages and operating systems functions are discussed in the
I. INTRODUCTION third and fourth sections, respectively. The fifth section of
the paper surveys several major processor architectures and
ADVANCES in semiconductor fabrication capabilities their implementations; we concentrate on showing the salient
have made it possible to design and fabricate chips with features that make the processors unique. In the sixth section
tens of thousands to hundreds of thousands of transistors,
we investigate an all-important issue -implementation. In
operating at clock speeds as fast as 16 MHz. Single-chip VLSI, the organization and implementation of a CPU signifi-
processors that have transistor complexity and performance cantly affect the architecture. Using some examples, we
comparable to CPU's found in medium- to large-scale main- show how these features interact with each other, and we
frames can be designed. Indeed, both commercial and
indicate some of the principles involved.
experimental nMOS processors have been built that match
the performance of large minicomputers, such as DEC's
VAX 11/780.
In the context of this paper, a processor architecture is
defined by the view of the programmer; this view includes II. ARCHITECTURAL GOALS
user visible registers, data types and their formats, and the
instruction set. The memory system and I/O system architec- A computer architecture is measured by its effectiveness as
tures may be defined either on or off the chip. Because we are a host for applications and by the performance levels obtain-
concerned with chip-level processors we must also include able by implementations of the architecture. The applications
the definition of the interface between the chip and its envi- are written in high level languages, translated to the pro-
ronment. The chip interface defines the use of individual cessor's instruction set by a compiler, and executed on the
pins, the bus protocols, and the memory architecture and I/O processor using support functions provided by the operating
architecture to the extent that these architectures are con- system. Thus, the suitability of an architecture as a host is
trolled by the processor's external interface. determined by two factors: its effectiveness in supporting
high level languages, and the base it provides for system level
Manuscript received April 30, 1984; revised July 31, 1984. This work was functions. The efficiency of an architecture from an imple-
supported by the Defense Advanced Research Projects Agency under Grants mentation viewpoint must be evaluated both on the cost and
MDA903-79-C-680 and MDA903-83-C-0335.
The author is with the Computer Systems Laboratory, Stanford University, on the performance of implementations of that architecture.
Stanford, CA 94305. Since a computer's role as program host is so important, the
instruction set designer must carefully consider both the use- compilers. Thus, the architecture should be designed to sup-
fulness of the instruction set for encoding programs and the port the code produced by an optimizing compiler. An im-
performance of implementations of that instruction set. plication of this observation is that the architecture should
Although the instruction set design may, have several expose the details of the hardware to allow the compiler to
goals, the most obvious and usually most important goal is maximize the efficiency of its use of that hardware. The
performance. Performance can be measured in many ways; compiler should also be able to compare alternative in-
typical measurements include instructions per second, total struction sequences and choose the more time or space effi-
required memory bandwidth, and instructions needed both cient sequence. Unless the execution implications of each
statically and dynamically for an application. Although all machine instruction are visible, the compiler cannot make a
these measurements have their place, they can also be mis- reasonable choice between two alternatives. Likewise, hid-
leading. They either measure an irrevelant point, or they den computations cannot be optimized away. This view of the
assume that the implementation and the architecture are optimizing compiler argues for a simplified instruction set
independent. that maximizes the visibility of all operations needed to exe-
The key to perforrnance is the ability of the architecture to cute the program.
execute high level language programs. Measures based on Large instruction set architectures are usually imple-
assembly language performance are much less useful because mented with microcode. In VLSI, silicon area limitations
such measurements may not reflect the same patterns of in- often force the use of microcode for all but the smallest and
struction set usage as compiled code. Of course, compiler simplest instruction sets: all of the commercial 16 and 32 bit
interaction clouds the issue of high level language per- processors make extensive use of microcode in their imple-
formance; that is to be expected. The architecture also influ- mentations. In a processor that is microcoded, an additional
ences the ease and difficulty of building compilers. level of translation, from the machine code to micro-
Implementation related effects can cause serious problems instructions, is done by the hardware. By allowing the com-
if the abstract measurements are used as a gauge of the real piler to implement this level of translation, the cost of the
hardware performance. The architecture profoundly influ- translation is taken once at compile-time rather than repeti-
ences the complexity, cost, and potential performance of the tively every time a machine instruction is executed. The view
implementation. On the basis of abstract architecturally ori- of an optimizing compiler as generating microcode for a
ented benchmarks, the most complex, highest level in- simplified instruction set is explained in depth in a paper by
struction sets seem to make the most sense; these include Hopkins [5]. In addition to eliminating a level of translation,
machines like the VAX [1], the Intel-432 [2], the DEL ap- the compiler "customizes" the generated code to fit the appli-
proaches [3], and the Xerox Mesa architectures [4]. How- cation [6]. This customizing by the compiler can be thought
ever, the cost of implementing such architectures is higher, of as a realizable approach to dynamically microcoding the
and their performance is not necessarily as good as architec- architecture. Both the IBM 801 and MIPS exploit this ap-
tural measures, such as.instructions executed per high level proach by "compiling down" to a low level instruction set.
statement, might indicate. Many VAX benchmarks show The architecture and its strength as a compiler target deter-
impressive architectural measurements, especially for in- mine much of the performance at the architectural level.
struction bytes fetched. However, data from implementations However, to make the hardware usable an operating system
of the architecture show that the same performance is not ,must be created on the hardware. The operating system
attained. VAX instructions are short; the instruction fetch requires certain architectural capabilities to achieve full
unit must constantly prefetch instructions to keep the rest of functional performance with reasonable efficiency. If the
the machine busy. This includes fetching one or more in- necessary features are missing, the operating system will be
structions that sequentially follow a branch. Since branches forced to forego some of its user-level functions, or accept
are frequent and they are taken with higher than 50 percent significant performance penalities. Among the features
probability, the instructions fetched following a branch are considered necessary in the construction of modern operating
most often not executed. This leads to a significantly higher systems are
instruction bandwidth than the architectural measurements * privileged and user modes, with protection of special-
indicate. ized machine instructions and of system resources in user
Since most programs are written in high level languages, mode;
the role of the architecture as a host for programs depends on * support for external interrupts and internal traps;
its ability to serve as a target for the code generated by * memory mapping support, including support for demand
compilers for high level languages of interest. The effective- paging, and provision for memory protection; and
ness is a function of the architecture, the compiler tech- * support for synchronization primitives, in multi-
nology, and, to a lesser extent, the programming language. processor configurations, if conventional instructions cannot
Much commonality exists among languages in their need be used for that purpose.
for hardware support; furthermore, compilers tend to trans- Some architectures provide additional instructions for sup-
late common features to similar types of code sequences. porting the operating system. These instructions are included
Some special language features may be significant enough to for two primary reasons. First, they establish a standard inter-
influence the architecture. Examples of such of features are face for hardware dependent functions. Second, they may
support for tags, support for floating point arithmetic, and enhance the performance of the operating system by support-
support for parallel constructs. ing some special operation in the architecture.
Program optimization is becoming a standard part of many Standardizing an interface by including it in the architec-
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:45:15 UTC from IEEE Xplore. Restrictions apply.
HENNESSY: VLSI PROCESSOR ARCHITECTURE 1223
ture has been cited as a goal both for conventional high level both reducing the overhead on instructions as well as or-
instructions, e.g., on the VAX [7], and for operating system ganizing the hardware to minimize the delays in each clock
interfaces [2]. Standardizing an interface in the architectural cycle.
specification can be more definitive, but it can carry per- 2) Minimize the number of cycles to perform each in-
formance penalties when compared to a standard at the as- struction. This minimization must be based on the expected
sembly language level. This standard can be implemented by dynamic frequency of instruction use. Of course, different
macros, or by standard libraries. Putting the interface into the programming languages may differ in their frequency of in-
architecture commits the hardware designers to supporting it, struction usage.
but it does not inherently enforce or solidify the interface. This second rule may dictate sacrificing performance in
Enhancing operating system performance via the architec- some components of the architecture in return for increased
ture can be beneficial. However, such enhancements must be performance of the more heavily used parts.
compared to alternative improvements that will increase gen- The observation that these types of tradeoffs are needed,
eral performance. Even when significant time is spent in the together with the fact that larger architectures generate addi-
operating system, the bulk of the time is spent executing tional overhead, have led to the reduced (or simplified) in-
general code rather than special functions, which might be struction set approach [10], [11]. Such architectures are
supported in the architecture. The architect must carefully streamlined to eliminate instructions that occur with low fre-
weigh the proposed feature to determine how it affects other quency in favor of building such complex instructions out of
components of the instruction set (overhead costs, etc.), as sequences of simpler instructions. The overhead per in-
well as the opportunity cost related to the components of the struction can be significantly reduced and the implementor
instruction set that could have been included instead. Many does not have to discriminate among the instructions in the
times the performance gained by such high level features is architecture. In fact, most simplified instruction set machines
small because the feature is not heavily used or because it use single cycle execution of every instruction; this elimi-
yields only a minor improvement over the same function nates complex tradeoffs both by the hardware implementor
implemented with a sequence of other instructions. Often the and the compiler writer. The simple instruction set permits a
combination of a feature's cost and performance merit forms high clock speed for the instruction execution, and the one-
a strong argument against its presence in the architecture. cycle nature of the instructions simplifies the control of the
Hardware organization can dramatically affect perfor- machine. The simplification of control allows the imple-
mance. This is especially true when the implementation is in mentation to more easily take advantage of parallelism
VLSI where the interaction of the architecture and its imple- through pipelining. The pipeline allows simultaneous exe-
mentation is more pronounced. Some of the more important cution of several instructions, similar to the parallel activity
architectural implications are as follows. that would occur in executing microinstructions for the inter-
* The limited speed of the technology encourages the use pretation of a more complex instruction set.
of parallel implementations. That is, many slower compo-
nents are used rather than a smaller number of fast compo-
nents. This basic method has been used by many designers on III. BASIC ARCHITECTURAL TRENDS
projects as varied as systolic arrays [8] to the MicroVAX I
datapath chip [9]. The major trend that has emerged among computer ar-
* The cost of complexity in the architecture. This is true in chitectures in the recent past has been the emphasis on tar-
any implementation medium, but is exacerbated in VLSI, geting to and support for high level languages. This trend
where complexity becomes more difficult to accommodate. is especially noticeable within the microprocessor area
A corollary of this rule is that no architectural feature is free. where it represents an abrupt change from the assembly-
* Communication is more expensive than computation. language-oriented architectures of the 1970's. The most re-
Architectures that require significant amounts of global inter- cent generation of commercially available processors, the
action will suffer in implementation. Motorola 68000, the Intel 80X86, Intel iAPX-432, the
* The chip boundaries have two- major effects. First, they Zilog 8000, and the National 16032, clearly show the shift
impose hard limits on data bandwidth on and off the chip. from the 8-bit assembly language oriented machines to the
Second, they create a substantial disparity between on-chip 16-bit compiled language orientation. The extent of this
and off-chip communication delays. change is influenced by the degree of compatibility with
The architecture affects the performance of the hardware previous processor design. The machines that are more com-
primarily at the organizational level where it imposes certain patible (the Intel 80X86 and the Zilog processors) show their
requirements. Smaller effects occur at the implementation heritage and the compatibility has an effect on the entire
level where the technology and its properties become rele- instruction set. The Motorola and National products show
vant. The technology acts strongly as a weighting factor much less compatibility and more of a compiled language
favoring some organizational approaches and penalizing oth- direction.
ers. For example, VLSI technology typically makes the use This trend is more obvious among designs done in univer-
of memory on the chip attractive: relatively high densities sities. The Mead and Conway [12] structured design ap-
can be obtained and chip crossings can be eliminated. proach has made it possible to design VLSI processors within
A goal in implementation is to provide the fastest hardware the university environment. These projects have been
possible; this translates into two rules. language-directed. The RISC project at Berkeley and the
1) Minimize the clock cycle of the system. This implies MIPS project at Stanford both aim to support high level
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:45:15 UTC from IEEE Xplore. Restrictions apply.
1224 IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 12, DECEMBER 1984
languages with simplified instruction sets. The MIT Scheme 6) The simplified instruction set provides an opportunity
project [13] supports LISP via a built-in interpreter for the to eliminate a level of translation at runtime, in favor of
language. translating at compile-time. The microcode of a complex
instruction set is replaced by the compiler's code generation
A. RISC-Style Machines function.
A RISC, reduced instruction set computer, is a machine The potential disadvantages of the streamlined architecture
with simplified instruction set. The architectures that are come from two areas: memory bandwidth and additional
generally considered to be RISC's are the Berkeley RISC I software requirements. Because a simplified instruction set
and II processors, the Stanford MIPS processor, and the will require more instructions to perform the same function,
IBM 801 processor (which is not a microprocessor). These instruction memory bandwidth requirements are potentially
machines certainly have instruction sets that are simpler higher than for a machine with more powerful and more
than most other machines; however, they may 'still have tightly encoded instructions. Some of this disadvantage is
many instructions: the 801 has over 100 instructions, MIPS mitigated by the fact that instruction fetching will be more
has over 60. They may also have conceptually complex complicated when the architecture allows multiple sizes of
details: the 801 has instructions for programmer 'cache instructions, especially if the instructions require multiple
management, while MIPS requires that pipeline dependence fetches due to lack of alignment or instruction length.
hazards be removed in software. All three architectures avoid Register-oriented architectures have significantly lower
features that require complex control structures, though they data memory bandwidth [10], [14]. Lower data memory
may use a complex implementation structure where the com- bandwidth is highly desirable since data access is less pre-
plexity is merited by the performance gained. dictable than instruction access and can cause more per-
The adjective streamlined is probably a better description formance problems. The existing streamlined instruction set
of the key characteristics of such architectures. The most implementations achieve this reduction in data bandwidth
important features are from either special support for on-chip data accessing, as
1) regularity and simplicity in the instruction set allows in the RISC register windows (see Section IV-A), or the
the use of the same, simple hardware units in a common compiler doing register allocation. The load/store nature of
fashion to execute almost all instructions; these architectures is very suitable for effective register allo-
2) single cycle execution -most instructions execute in cation by the compiler; furthermore, each eliminated mem-
one machine (or pipeline) cycle. These architectures are ory reference results in saving an entire instruction. In a
register-oriented: all operations on data objects are done in memory-oriented instruction set only a portion of an instruc-
the registers. Only load and store instructions access mem- tion is saved.
ory; and If implementations of the architecture are expected to have
3) fixed length instructions with a small variety of formats. a cache, trading increased instruction bandwidth for de-
The advantages of streamlined instruction set architectures creased data bandwidth can be advantageous. Instruction
come from a close interaction between architecture and caches typically achieve higher hit rates than data caches for
implementation. The simplicity of the architecture lends a the same number of lines because of greater locality in code.
simplicity to the implementation. The advantages gained Instruction caches are also simpler since they can be read-
from this include the following. only. Thus, a small on-chip instruction cache might be used
1) The simplified instruction formats allow very fast in- to lower the required off-chip instruction bandwidth.
struction decoding. This can be used to reduce the pipeline The question of instruction bandwidth is a tricky one.
length (without reducing throughput), and/or shorten the in- Statically, programs for machines with simplier, less
struction execution time. densely encoded instruction sets, will obviously be larger.
2) Most instructions can be made to execute in a single This static size has some secondary effect on performance
cycle; the register-oriented (or load/store) nature of the archi- due to increased working set sizes both for the instruction
tecture provides this capability. cache and the virtual memory. However, the potentially
3) The simplicity of the architecture means that the or- higher bandwidth requirements are much more important.
ganization can be streamlined; the overhead on each in- Here we see a more unclear picture.
struction can be reduced, allowing the clock cycle to be While the streamlined machines will definitely need more
shortened. instruction bytes fetched at the architectural level, they have
4) The simpler design allows silicon resources and human some benefits at the implementation level. The MIPS and
resources to be concentrated on features that enhance per- RISC architectures use delayed branches [15] to reduce the
formance. These may be features that provide additional high fetching of instructions that will not be executed. A delayed
level language performance, or resources may be concen- branch means that instructions following a branch will be
trated on enhancing the throughput of the implementation. executed until the branch destination can be gotten into the
5) The low level instruction set provides the best target for pipeline. Data taken on MIPS had shown that 21 percent of
state-of-the-art optimizing compiler technology. Nearly ev- the instructions that are executed occur during a branch delay
ery transformation done by the optimizer on the intermediate cycle; in the case of an architecture without the delayed
form will result in an improved running time because the branch, that 21 percent of the cycles would be wasted. In
transformation will eliminate one or more instructions. The many machine implementations the instructions are indepen-
benefits of register allocation are'also enhanced by elimi- dently fetched by an instruction prefetch unit so that when the
nating entire instructions needed to access memory. branch is taken the instruction prefetch is wasted. Another
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:45:15 UTC from IEEE Xplore. Restrictions apply.
HENNESSY: VLSI PROCESSOR ARCHITECTURE 1225
data point that points to the same conclusion is from the opcode, data type, and addressing mode must be encoded
VAX; Clark found that 25 percent of the VAX instructions efficiently to prevent an explosion in code size.
executed are taken branches. This means that 25 percent A high level instruction set has one major technological
of the time, the fetched instruction (i.e., the one following advantage and several strategic advantages. The denser
the branch) is not executed. Thus the bandwidth is only encoding of the instruction set lowers the static size of
80 percent of its effective bandwidth. the program; the dynamic instruction bandwidth depends on
There are some important differences in peak bandwidth the static size of the most active portions of the program. The
and average bandwidth for instruction memory. To be com- major strategic advantage for a high level microcoded in-
petitive in performance the complex instruction set machines struction set comes from the ability to span a wide range of
must come close to achieving single cycle execution for the application environments. Although compilers will tend to
simple instructions, e.g., register-register instructions. To use the simpler and straightforward instructions more often,
achieve this goal, the peak bandwidth must at least come different applications will emphasize different parts of the
close to the same bandwidth that a reduced instruction set instruction set [7], [20]. A large instruction set can attempt to
machine will require. This peak bandwidth determines the accommodate a wide range of application with high level
real complexity of the memory system needed to support the instructions suited to the needs of these applications. This
processor. allows the standardization of the instruction set and the abil-
Code generation for both streamlined machines and sim- ity to interchange object code across a wide range of imple-
plified machines is believed to be equally difficult. In the mentations of the architecture.
case of the streamlined machine, optimization is more im- In addition to not sharing some of the implementation
portant, but code generation is simpler since the alternative advantages of a simplified instruction set, a more complex
implementations of code sequences do not exist [ 16]. The use architecture suffers from its own complexity. Instruction set
of code optimization, which is usually done on an inter- complexity makes it more difficult to ensure correctness and
mediate form whose level is below the level of the machine achieve high performance in the implementation. The latter
instruction set, means that code generation must coalesce occurs because the size of the instruction set makes it more
sequences of low level intermediate form instructions into difficult to tune the sections that are critical to high per-
larger more powerful machine instructions. This process is formance. In fact, one of the advantages claimed for large
complicated by the detail in the machine instruction set and instruction set machines is that they do not a priori discrimi-
by complex tradeoffs the compiler faces in choosing what nate against languages or applications by prejudicing the
sequence of instructions to synthesize. Experience at Stan- instruction set. However, similarities in the translation of
ford with our retargetable compiler system [17] has shown high level languages could easily allow prejudices that bene-
that the streamlined instruction sets have an easier code gen- fited the most common languages and which penalized other
eration problem than the more complex instruction machines. languages. There is also a question of design and imple-
We have also found that the simplicity of the instruction set mentation efficiency with this type of instruction set: some
makes it easier to determine whether an optimizing trans- portions of it may see little use in many environments. How-
formation is effective. In retargetting the compiler system to ever, the overhead of that portion of the instruction set is paid
multiple architectures, we have found better optimization by all instructions to the extent that the critical path for the
results for simpler machines [18]. In an experiment at instructions runs through the control unit.
Berkeley, a program for the Berkeley RISC processor
showed little improvement in running time between a com-
piled and carefully hand-coded version, while substantial
improvement was possible on the VAX [19]. Since the same IV. ARCHITECTURAL SUPPORT FOR HIGH LEVEL LANGUAGES
compiler was used in both instances, a reasonable conclusion
is that less work is needed to achieve good code for the RISC Several computers have included special language support
processor when compared to the VAX and that a simpler in the architecture. This support most often focuses on a
compiler suffices for the RISC processor. small set of primitives for performing frequent language-
oriented actions. The most often attacked area is support for
B. Microcoded Instruction Sets procedure calls. This may include anything from a call in-
The alternative to a streamlined machine is a higher level struction with simple program counter (PC) saving and
instruction set. For the purposes of this paper, we will use the branching, to very elaborate instructions that save the PC and
term high level instruction set to mean an architecture with some set of registers, set up the parameter list and create the
more powerful instructions; one of the key arguments of the new activation record. A wide range of machines, from the
RISC approach is that the high level nature of the instruction Intel-432, to the VAX, to the Berkeley RISC microprocessor
set is not necessarily a better fit for high level languages. The all have special reasonably powerful instructions for support-
reader should take care to keep these two different inter- ing procedure calls.
pretations of "high level" architecture distinct. The compli- Extensive measurements of procedure call activity have
cations of such an instruction set will usually require that the been made. Source language measurements for C and Pascal
implementation be done through microcode. A large in- have been done on the VAX by the RISC group at Berkeley
struction set with support for multiple data types and address- [21 ]. Clark [7] has measured the VAX instruction set (includ-
ing modes must use a denser instruction encoding. In addition ing call) using a hardware monitor. These measurements con-
to more opcode space, the large number of combinations of firm two facts. First, procedure calls are infrequent (about
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:45:15 UTC from IEEE Xplore. Restrictions apply.
1226 IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 12, DECEMBER 1984
10 percent of the high level statements) compared to the most use of register references versus memory references lowers
common simpler instructions (data moves, adds, etc). Sec- the amount of addressing overhead. For example, in the
ond, the procedure call is one of the most costly instructions Berkeley RISC register-register instructions execute twice
in terms of execution time; the data from Berkeley indicates as fast as memory accesses. The compiler can be selective
that it is the most costly source language statement (i.e., about its allocation effectively increasing the "hit rate" of the
more machine instructions are needed to execute this source register file. However, only scalar variables may be allocated
statement than most others). This high cost is sufficient to to the registers. Thus, some programs may benefit little from
make call one of the most expensive statements, both at this technique, although data [21] has shown that the bulk of
the machine instruction set level and at the source language the accessed variables are local and global scalars.
level. Any large register set can achieve the elimination of off-
There are a few important caveats to examine when consid- chip references and reduction of addressing overhead.
ering these data. The most important observation is that reg- However, to make use of such a large register set without
ister allocation bloats the cost of procedure call. A simple burdening the cost of procedure call by an enormous amount,
procedure call in compiled code without register allocation the register file can be organized as a stack of register sets,
is not very expensive: save the program counter, the old allocated dynamically on a per procedure basis. This concept
activation record pointer, and create a new activation record. was originally proposed for use in VLSI by Sites [25], ex-
This can be easily done in a few simple instructions, par- panded by Baskett [26], and has been studied by a wide range
ticularly if activation record maintenance is minimized. of'people including Ditzel for a C machine [27], the BBN C
However, when an additional half-dozen register-allocated machine [28], Lampson [29], and Wakefield for a direct
variables need to be saved the cost is in the neighborhood of execution style architecture [30]. A full exploration of the
10-15 instructions. This additional cost is not inherent in the concept was done by the Berkeley RISC design group and
procedure call itself but is an artifact of the register allocator.
implemented with some important extensions in their RISC-I
Such costs should be accounted for by the register allocation microprocessor [21]. The Pyramid supermini computer [31]
algorithm [18], but are often ignored. Despite this, there is has a register stack as its main innovative architectural fea-
merit in lumping these saves and restores as part of the call, ture. We will explain the register stack concept in detail using
if this means that they can be reduced by an efficient method the RISC design.
of executing procedure calls. Numerous on-chip registers are arranged in a stack. On
Before we look at such a method in detail, consider one each call instruction a new frame, or window, of registers is
other possible attack on the problem: reducing call fre- allocated on the stack and the old set is pushed; on a return
quency. Modern programming practice encourages the use of instruction the stack is popped. Of course, the push and pop
many small procedures; often procedures are called exactly actions are done by manipulation of pointers that indicate the
once. While this may be good'programming practice, an current register frame. Each procedure addresses the regis-
intelligent optimizer can expand inline any procedure that is ters as 0 n and gets a set of n registers. The compiler
...
called exactly once, and perhaps a large number of proce- attempts to allocate variables to the register frame, elimi-
dures that are small. For a small procedure, the call overhead nating memory accesses.- Scalar global variables can be allo-
may easily be comparable to the procedure size. In such cated to a base level frame that is accessible to all procedures
cases, inline expansion of the procedure will increase the and does not change during the running of the program. The
execution speed with little or no size penalty. The IBM PL. 8 effectiveness of this scheme for allocating global scalars is
compiler [22] does inline expansion of all leaf-level proce- limited for languages that may use large numbers of base-
dures (i.e., ones that do not call another procedure), while the level variables; many modern languages with module sup-
Stanford U-Code optimizer includes a cost-driven inline ex- port, e.g., Ada and Modula, have this property. In addition,
pansion phase [18]. any variables that are visible to multiple, separately compiled
A. Support for Procedure Call: The Register Stack routines cannot be allocated to registers. There are similar
problems in allocating local variables to registers, when
VLSI implementation greatly favors on-chip commu- those variables may be referenced by inward-nested proce-
nication versus off-chip communication. This fact has led dures; we will discuss this problem in detail shortly.
many designers to keep small caches (usually for instructions Although this concept is straightforward, there are a num-
only) or instruction prefetch buffers on the chip as in the VAX ber of complications to consider. First, should these frames
microprocessors [23], [24] and the Motorola 68020. How- be fixed in size or variable, and if fixed how large? The
ever, current limitations prevent the integration of a full size advantage of using a fixed frame size is that an appropriately
cache (e.g., 2K words) onto the same chip as the processor. chosen frame size can avoid an addition cycle which is other-
An alternative approach is to use a large on-chip register set. wise needed to choose the correct register from the register
This approach sacrifices the dynamic tracking ability of a file. It also has some small simplifications in the call in-
cache, but it is possible to put a reasonably large register set struction. However, a fixed size frame will provide insuf-'
on the chip because the area per stored bit can be smaller than ficient registers for some procedures and waste registers for
in a cache. By allowing the'compiler to allocate scalar locals others. Studies by various groups have shown that a small
and globals to the register set, the amount of main memory number of registers (around eight) works for most procedures
data traffic can be lowered substantially. Additionally, the and that an even smaller number can obtain over 80 percent
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:45:15 UTC from IEEE Xplore. Restrictions apply.
HENNESSY: VLSI PROCESSOR ARCHITECTURE 1227
of the benefits. Most implementations of register files use a and registers m n reference the n - m + 1 local regis-
..
fixed size frame with from 8 to 16 registers per frame. The ters. Furthermore, since these global registers are the only
stack cache design of Ditzel demonstrates an elegant variable globally accessible registers they are never swapped out.
size approach. Languages like Ada, Modula, and Pascal have nested
In today's technology a processor can contain only a small scopes and allow up-level referencing from any nested scope.
number of such register frames; e.g., the RISC-I1 processor to a surrounding scope. This means that the processor must
has 8 such frames of 16 registers each. Increasing integrated allow addressing to all the register frames that are global tp
circuit densities may allow more frames but the diminishing the currently active procedure. Because up-level referencing
returns and implementation disadvantages, which we will to intermediate scopes (i.e., to a scope that is not the most
discuss shortly, indicate that the number of frames should be global scope) is rare, such addressing can be penalized with-
kept low. Because it is impossible either to bound at compile- out significant overall performance loss. In the simple case,
time, or to restrict the calling depth a priori, the processor the addressing is straightforward: the instruction can give a
must deal with register stack overflow. relative register-set number and a register number (offset in
When the register stack overflows, which only happens on the register-set) and the processor can do the addressing.
a call instruction as a new frame is allocated, the oldest frame Even if this instruction is very slow, the performance penalty
must be migrated off the chip to main memory. This function will be negligible. The complicated case arises when a regis-
can be done with hardware assist, in microcode as on the ter stack overflow has occurred and the addressed register
Pyramid, or in macrocode as on RISC. In a more complex frame has been swapped out. In this case, the register refer-
processor, the oldest stack frames might be migrated off-chip ence must become a memory reference.
in background using the available data memory cycles. When A similar problem exists with reference (or pass by ad-
the processor returns from the call that caused the overflow, dress) parameters. Variables that are passed as reference pa-
the register stack will have an empty frame and the frame rameters may be allocated in a register and may not even have
saved on the overflow can be reloaded from memory. Alter- a memory address that can be passed. The language C allows
natively, the reloading can be postponed until execution re- the address of a variable to be obtained by an operator; this
turns to the procedure whose frame was migrated. causes problems since register-allocated variables will not
One of the interesting results obtained by the studies done have addresses.
for the RISC register file concerns measurements done of call Fortunately, there are two solutions [29] to these problems.
patterns and the implications for register migration strategies The first is to rely on a two-pass compilation scheme to detect
[32]. If we assume that calls are quite random in their behav- all up-level references or address references and to prevent
ior, the benefits of the register stack can be quite small. In the associated variable from being allocated in the register
particular, if the call depth varies widely, then a large number stack. This requires a slightly more complex compiler and
of saves and restores of the register stack frames will be has some small performance irnpact. An alternative solution
needed. In such a case, the register stack with a fixed size uses some additional hardware capability and will handle
frame can even be slower than a processor without such a both types of nonlocal references. Let us assume that each
stack because all registers are saved and restored whether or register frame (and hence each register) has a main memory
not they are being used. However, if the call pattern tends to address, where it resides if it is swapped out. A nonlocal
be something like "call to depth k, make a significant number reference (up-level in the scope of the reference) can be
of calls from level k and higher but mostly within a few levels translated by computing the address of the desired frame,
,of k, before backing out," then register stack scheme can which is a function of
perform quite well. It will need to save and restore frames * the address in memory for the current frame (which is
getting to and returning from level k, but once at level k the based only on the absolute frame nurnber), and
number of migrations could be very small. * the number of frames offset from the current frame,
Data collected by the Berkeley RISC designers indicate the which is based on the differences in lexical levels between the
latter behavior dominates. This also leads to another im- current procedure and the scope of the referenced variable.
portant insight: it may be more efficient to migrate frames in With these two pieces of information we can calculate the
batches, thus cutting down on the number of overflows and memory address of the desired frame. Likewise, for a refer-
underflows encountered. However, a recent paper [32] shows ence parameter that is in a register we can calculate and pass
that the optimal number of frames to move varies between the memory address assigned to the register location in the
programs. Furthermore, that study shows that past behavior frame.
is not necessarily a good guide when choosing the number of Now, this leaves only one problem: some memory ad-
frames to migrate. Simple strategies of moving a single frame dresses can refer to registers that may or may not currently be
or two frames are a good static approximation and should be in the processor. If the referenced register window has over-
used. flowed into memory, then we can treat the reference as a
Because the language C does not have nested scopes of conventional memory reference. If the register is currently
reference, a register file scheme for C need provide address- on-chip, then we need to find the register set and access the
ability only to the local frame and the global frame. This can on-chip version. Since this access need not be fast, it is easy
be easily done by splitting the register set seen by the proce- to check the current register file and get the contents,- or to
dure so that registers O * * m address m + 1 global registers allow the memory references to complete [33].
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:45:15 UTC from IEEE Xplore. Restrictions apply.
1228 IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 12, DECEMBER 1984
A register stack allows the use of a fairly simple register have been incorporated in a commercial machine [31] and
allocator, as well as mitigating the cost of register used in the RISC chip. However, when measured against
save/restore at call statements. Compilers often attempt to improved compiler technology and the cost in the cycle time,
speed up procedure linkage by communicating parameters the real benefits remain unknown.
and return values in the registers. If the compiler is not doing
global register allocation, this task is easy; otherwise, the V. SYSTEMS SUPPORT
compiler must integrate the register allocation in existence at
the call site with the register usage needed for parameter A processor executes compiled programs; however, with-
passing. This communication of parameters in registers can out an operating system the processor is essentially useless.
improve performance by about 10 percent. However, using The operating system requires certain architectural capa-
this improvement with a straightforward register stack is im- bilities to achieve functional performance with reasonable
possible since neither procedure can address the registers of efficiency. Perhaps the most important area for operating
the other in a fast and efficient manner. systems support is in memory management.
The RISC processor extended the idea of the register stack Support for memory management has become a feature
to solve this problem [ 14]. On RISC the frames of a caller and of almost all computer architectures. The initial micro-
callee overlap by a small number of registers. That is, the j processors did not provide such support and even in machines
high order registers of the caller correspond to thej low order as late as the M68000 no support for demand paging is pro-
registers of the callee. The caller uses these registers to pass vided, although support is provided in the M68019. Current
the actual parameters, and the callee can use them to return microprocessors must compromise between providing all
the procedure result. The number of overlapping registers is necessary memory management features on-chip and the real
based on the number of expected parameters and on hardware limitations of silicon area and interchip delays. Thus, some
design considerations. design compromises are usually made to achieve an accept-
The disadvantages of the register set idea come from three able memory mapping mechanism. After looking at the re-
areas. First, improved compiler technology, mostly in the quirements, we will examine the memory mapping support in
formn of good models for register allocation [34]-[36], makes three processors: the 8-chip VLSI VAX, the Intel iAPX432,
it possible for compilers to achieve very high register "hit" and the Stanford MIPS processor. Each of these processors
rates and to more efficiently handle saving and restoring at makes a different set of design compromises.
procedure call boundaries. Good allocation of a single regis- Modern memory systems provide virtual memory support
ter set with a cache for unassigned references could be ex- for programs. In addition, the system must also imnplement
tremely effective. Since the registers are multiport, the size memory protection and help to minimize the cost of using
of the individual register cells and their decode logic means virtual memory as well as improve mernory utilization. Pro-
the silicon area per word of storage may approach the area gram relocation is a function of the memory mapping system;
occupied per word in a set associative cache. Another disad- segmentation provides a level of relocation that may be used
vantage with respect to a cache is that the register stack is instead of or in addition to paging. Implementing a paged
inefficient: only a small fraction (i.e., one frame) is actively virtual memory requires translation of virtual addresses into
being used at any time. In a cache a larger portion of the real addresses via some type of memory map. Support for
storage could be used. Of course, the effectiveness of the demand paging will require the ability to stop and restart
register stack is increased when procedure calls are frequent instructions when page faults occur. Protection can be pro-
and the portion of the register stack being used changes vided by the hardware on a segment and/or page basis.
quickly.
A second disadvantage is that the use of a register set A. VAX and VLSI VAX Memory Management
clearly increases the process switching time, by dramatically The memory management scheme used in the VAX archi-
increasing the processor state. Although process switches tecture is a fairly conventional paging strategy. Some of the
happen much less frequently than procedure calls, the true more interesting aspects of the memory architecture arise
cost of this impact is not known. Studies [37]-[39] have when the implementation techniques used in the VLSI VAX's
shown that the effect of process switches on TLB and cache are examined.
hit ratios can be significant. The 232 byte virtual address space is broken into several
Third, the register set concept presents a challenging im- segments. The main division into two halves provides for a
plementation problem, particularly in VLSI. The number of system space (a system wide common address space) and a
frames is ideally as large as possible; however, if the register user process address space. The process address space is
file is to be fast it must be on-chip and close to the central data further subdivided into a P0 region, used for programs, and
path. The size and tight coupling to the data path will result a P1 region, used for stack-allocated data. The heap, from
in slowing down the data path at a rate dependent on the size which dynamically managed nonstack data are allocated, is
of the register file; this cost at some point exceeds the merit placed above the code in the P0 region. Fig. 1 shows this
of a larger register file. The best size for the register stack and breakdown. The P0 and P1 regions grow towards each other,
its impact on the cycle time is difficult to determine since it while the system region grows towards its upper half, which
depends a great deal on both the implementation and the is currently reserved. The decomposition into system and
benchmarks chosen to measure performance. We will discuss process space has two main effects: it guarantees in the archi-
the issue of implementation impact later in the paper. The tecture a shared region for processes as well as for the oper-
final worth of the register stack ideas remains to be seen; they ating system, and it allows the processor implementation to
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:45:15 UTC from IEEE Xplore. Restrictions apply.
HENNESSY: VLSI PROCESSOR ARCHITECTURE 1229
compile-time, the compiler must assume that references to 32?b Virtul Address -4
other parts of the stack and references to the heap will require ,."ent sel, -ctr lnDisIncccmlnt
| 4
a segment change. This will result in a performance loss | 16- ~ <; L2- DATA SEGMENT
offset: the segment designator is an access descriptor that .Rights -Segment pointer/otffset
contains the access rights for the segment, as well as informa-
tion for addressing the segment. These access descriptors are
similar to the concept of capabilities [41]. The access de- r I_____
scriptors are collected into an access segment, which is in-
dexed by a segment selector. The address portion of the SEGMENT (or OBJEC-r) TABLE
access descriptor contains a pointer to a segment table, which
Segment entry 128 bits
specifies the entry providing the base address of the segment. 104 24
The offset to the segment is part of the original operand
address, whose format is described in a following section. Type, lenigth, other information Segmeet
_ase
This two-level mapping process is illustrated in Fig. 2. The
432's data processor chip contains a 22 element cache on
the access segment and the segment table; 14 of the 20 entries Fig. 2. iAPX-432 address mapping.
are preassigned for each procedure, two are reserved for
object table entries, and six entries are available for generic
use. This cache reduces the frequency with which the hard- misses must be translated at a much slower rate (approxi-
ware must examine the two-level map in memory. mately 5 As per translation) causing a substantial degradation
The 432 architecture uses the access segment to define a in performance.
domain for a program. A program's domain of access con- Despite these objections, the 432 addressing mechanism
sists of an access segment that provides addressing to mul- does provide the most cost effective implementation of capa-
tiple data and program segments. For program segments, the bilities in hardware to date. Future evolution of software
access descriptor indicates that the object is a program and systems, such as Smalltalk, may make object-based environ-
checks that the execution of instructions occurs only from an ments more important. When such environments are very
instruction segment. Similarly, all branches are checked to be dynamic and a high level of protection is desired, the 432
suire that they will transfer to an instruction segment. In capability-based mechanism offers an attractive vehicle for
addition to the instruction segments, the 432 defines both implementation. The challenge to such architectures will be
data and stack segments, as well as constant segments. to make the performance penalties for a capability-based sys-
The 432 addressing scheme achieves two primary objec- tem insignificant when compared to their functional benefits.
tives: support for capabilities, and support for fine-grained
protection. The major objection raised to the addressing C. MIPS Memory Management
scheme is that it is more complicated and powerful than In addition to the standard requirements for virtual mem-
is necessary. The use of capabilities has been explored in ory mapping, the Stanford MIPS processor attempts to support
several systems [42], [43] with limited success at least par- a large uniform address space for each process, and fast
tially due to a lack of hardware support. Most of these sys- context switching. One mechanism for facilitating context
tems found that capability based addressing was expensive switching is the incorporation of a process identification
and this may have prevented its use. An interesting dis- number into the virtual memory address. The use of the
cussion of the issues is contained in a paper by Wilkes [44]. process id number helps achieve fast context switches by
The other major advantage claimed for the 432 is that it allowing the cache and memory address translation units to
provides fine grained protection to allow users to protect avoid the cold start penalties. These penalties appear in sys-
against array bounds violations and references out of a mod- tems that require caches and translation buffers to be flushed
ule, by limiting the size of the segment. However, a careful because processes share the same virtual address space. The
examination of the requirements imposed by Ada, the host process id approach also allows the use of a large linear
language for the 432, shows that the segment based approach address space, avoiding the difficulties that arise when seg-
is only usable when each object that can be indexed or ad- ment boundaries are introduced. The realities of the MIPS
dressed dynamically is in a single segment. When this is not implementation technology (a 4 ,m channel length nMOS)
the case, runtime checks are required by the compiler and meant that it was not feasible to include all of the virtual to
these checks guarantee that the reference is legal, making the physical translation on the same chip as the processor.
hardware segment checking superfluous. There are several Consequently, a novel memory segmentation scheme was
reasons why allocating each such data object to a unique added to the architecture; it is designed to work with a con-
segment is an unsuitable approach. The most important ventional page mapping scheme implemented with the use of
reason is that it will cause a large increase in the number of an off-chip TLB. Each process has a process address space of
segments (one per data object to be protected), which will 232 words. The first step of the translation is to remove the top
decrease -the locality of segment references and hamper n bits of the address and replace them by an n-bit process
the effectiveness of the address cache. Address cache identifier (PID). Fig. 3 shows the generation of this virtual
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:45:15 UTC from IEEE Xplore. Restrictions apply.
HENNESSY: VLSI PROCESSOR ARCHITECTURE 1231
Virtual TABLE I
SUMMARY OF TRANSLATION BUFFER FEATURES ON VAX IMPLEMENTATIONS
FFFFFFFF
Machine Entries Features
FFFF8000
11/780 128 Direct mapped; 1/2 system, 1/2 user
VLSI VAX 5 on processor chip Fully associative;
limit one instruction entry
512 on campanion-chip Set associative; not reserved
MicroVAX-32 8 Fully associative
Rd Rs S2
Rd
Rd
Rs & S2
Rs S2
biLwise AND
bitwise OR
Rd,
Rs S2
processor was substantially reduced. However, this very xor
R d, Rts ,S52
Cd. Its S2
Rd
Rd
Rs xor S2
RRs shifted by S2
bitwise EXCLUSIVE OR
siift left
sri Red CRs S2
S0 Rd : Rs shifted by S2 sblift riglit logical
loose encoding of instructions means that instruction density sra
1dw
Rd, Rs,
Rd,RxsS2 Rd Rs shifted by
[Rd: =Mt lx+S2 ]
S2 sliift righit arithmetic
load word
is much lower than for other architectures, in the range bhu
dis
Rd ( Rx S2
Rd, (Rx )S2
Rd.,(Rx)S2
Rd: =M[Rx+S2] (align, zero-fill)
Rd:=M[Ilx+S2c (al igo, sign-ext)
load half unsigned
load half signed
of 40-70 percent lower. The RISC processor is able to dbu
1 dbs
Rd.,(ix)S2
R( lix }S2
Rd: =MHlx+S2] (0al ig, zero-fill)
Rd: =M[ix+52] (ol iga., s igis-ext)
load byte unsigned
load byte signed
achieve a one machine cycle execution of register-register stw
sth
Ri, (inx )S2 Mf Rx+S2 ]: =Rin
MI Rtx+S2 i: =Rmn ( al ign)
store word
store half
s
Ililt, Rx )S2 MLRx+S2] =Rnv (al igi) store byte
instructions and a two machine cycle memory access in- tb
COsD,lCx
( ) S2 ir cONIl theai ItC:v=x+S2 coiid. juovep, inidexed
struction. Its instruction set is summarized in Table II. CONl), Y
Rd, (Rx)S2
if CONI) then tIC:=I'C+Y
Rd:=PlC; IC:=Ilx+S2; CWP--
covid. jeuelp, PC-re].
tcail indexed
The major innovation of the RISC processor has been the call1
ret
Rd ,Y
(Ax)S2
Rd-:=I'C; PC:='C+Y; CWtP--
PC:f=Rx+S2; CWP+
call PC-rel.
retu rn
addition of a large register stack with overlapping register dh Rd<31:13>:=Y;,d<l2R0>:=O load imned iate high
Cd save value for
windows. This idea was explained in detail in the section on g t Ilt
Rd
Rd: =las tPC
nd :=PSW
restartieig p ipel ine
read status word
getpsw
register stacks. The register window concept is responsible p u L p s,w
reti
Rm
(Rx)S2
PSW: = Rin
PC:=cx+S2; CWP++;
set status word
returis froeyi iv terrupt
for much of the performance benefits that RISC demon- cal li call an iiiterrupt
strates. The simplicity of the other parts of the instruction set Rd, ils. Rx. Rio: a register (one of 32, wlier-o RO=O)
S2: either- a register or a 13-bit iiieased iate constauet;
allow reduction in the silicon area needed to implement the CONID: R-bi t coiidition;
Y 19-h it iiisiied iate ceoistailt;
processor's control portion, thus freeing up space for the IC: Pfrogjriii-CoUnter,
CWi': C rrelnt-Window-fOointer;
large register file. All iinstruct ionis caii optioiially set the Conditioes-Codes.
TABLE III
MIPS ASSEMBLY INSTRUCTIONS
Jper,ition Operanus Gomments
Arithminetic and logical operations
Ado srcl, src2, dst dst: = src2 + srcl Integer addition
And srcl, src2, dst dst: = src2 & srcl Logical and
Ic srcl, src2, dst dst: = byte srcl of dst is replaced by src2 Insert byte
Or srcl src2, dst dst: = src2 srcl Logical or
Ric srcl, src2, src3, dst dst: = src2llsrc3 rotated by srcl positions Rotate combined
Rol srcl, src2, dst dst: = src2 rotated by src1 positions Rotate
Sil srcl, src2. dst dst: = src2 shifted left by srcl positions Shift left logical
Sra srcl, src2, dst dst. = src2 shifted right by srcI positions Shift right arithmetic
Sri srcl, src2, dst dst: = src2 shitted right by srcI positions Shift right logical
Sub srcl, src2, dst dst:= src2-srcl Integer subtraction
Subr srcl. src2, dst dst: = srcl - src2 Reverse integer subtraction
Xc srcl, src2, dst dst: = byte srcl of src2 Extract byte
Xor srcl, src2, dst dst: = src2 (D src1 Logical xor
Transport operations
Ld A[srcl, dst dst: = MIA + src] Load based
Ld [src1 + src2], dst dst: = M[srcl + src2] Load based-indexed
Ld Isrc1>>src2], dst dst: = M[srcl shifted by src2] Load based-shifted
Ld A,dst dst: = M[AJ Load direct
Ld l, dst dst:= Load immediate
Mov src, dst dst: = src Move (byte or register)
St srcl, A[src3 MIA+ src]: = srcl Store based
St src 1, [src2 + src3] M[src2 + src3l: = srcl Store based-indexed
St srcl, [src2>>src3] MIsrc2 shifted by src3]: = src1 Store based-shifted
St src, A M[AJ: = src Store direct
Other operations
SavePC A M(AJ: = PC3 Save multi-stage PC after
trap or interrupt
Set Cond, src, dst dst: = -1 if Cond(src,dst) Set conditional
dst: = 0 if not Cond(src,dst)
fits for both the compiler and the processor implementation access cycle. The use of two ALU cycles makes it possible
[48]. It simplifies pipelining and branch handling in the im- to accommodate compare and branch in a single instruc-
plementation and eliminates the need to attempt optimization tion, although the data memory cycle is unused in such an
of the condition code setting. instruction.
The compiler and operating system would prefer to see a The execution of a load instruction requires the use of the
simple well-structured instruction set. However, this con- ALU only once to compute the effective address of the item
flicts with the goal of exposing all operations, and allowing that is to be retrieved from memory. This arrangement leaves
the internal processor organization to closely match the archi- the ALU idle for one machine cycle during the execution of
tecture. To overcome these two conflicting requirements, the a simple load instruction. Since the ALU is not busy and
MIPS instruction set architecture is defined at two levels. The most load instructions do not require a full 32-bit instruction
first level is visible to the compiler or assembly language word, an additional register-register operation can be done
programmer. It presents the MIPS machine as a simple in the same instruction. The companion ALU instruction is
streamlined processor. Table III summarizes the MIPS defi- an independent two operand register-register instruction.
nition at this level. Some forms of the load (e.g., long immediate) sacrifice
Each MIPS assembly-level instruction is translated to ma- the ALU instruction encoding space for another use. Since an
chine level instructions; this translation process includes a ALU instruction only uses the ALU once and is a small
number of machine-dependent optimizations: organizing the instruction, the instruction set allows a three operand ALU
instructions to avoid pipeline interlocks and branch delays, operation and a two operand ALU operation to be combined
expanding instructions that are macros, and packing multiple in every instruction word. This combination is particularly
assembly language instructions into one machine instruction. effective for executing arithmetic code that tends to be oper-
The machine-level instruction set of MIPS is closely tied ation intensive. Store instructions, and to a lesser extent
to the pipeline structure. The pipeline structure can be ex- branch instructions that involve a reference to memory, are
plained by examining the memory and ALU utilization. Each treated correspondingly.
instruction goes through the five stages of the pipeline, with These two component instructions consist of two separate
an instruction started on every other stage. During a single and independent halves. The same is true of most other in-
instruction, two pipestages are allocated for instruction fetch structions. For example, a compare-and-branch instruction
and decode, two for ALU usage, and one for a data memory involves a condition test and a PC-relative address calcu-
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:45:15 UTC from IEEE Xplore. Restrictions apply.
1234 IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 12, DECEMBER 1984
lation. This separation of the instruction into two distinct not related to memory mapping; we will briefly mention the
parts allows the instruction to be viewed as a series of distinct other operating system support features.
single operators to be executed in the pipeline. This approach Among the most significant features in the iAPX-432 ar-
simplifies the pipeline control and allows the pipeline to chitecture is its support for a wide variety of data types,
run faster. including
Translation between Assembly language (the architectural *8-bit characters,
level) and the hardware instructions (organizational level) is * 16-bit signed and unsigned integers,
done by the reorganizer [49]. The reorganizer reorders the * 32-bit signed and unsigned integers,
instructions for each basic block to satisfy the constraints * a variable length bit field: 1-3 1, or 1-16 bits in length,
imposed by the pipeline organization; this reorganization and
establishes at compile time the schedule of instruction exe- * 32-bit, 64-bit, and 80-bit reals.
cution. Scheduling instructions in software has two bene-
fits: it enhances performance by eliminating instances of Complete sets of arithmetic and (where appropriate) logical
pipeline interlocking, and it simplifies the pipeline control operators are defined for each data type. The iAPX-432 is the
hardware allowing a shorter time per pipestage. The dis- first microprocessor to define and implement floating point
advantage in the MIPS case is that the absence of a legal support in the architecture. Conditional branch instructions
instruction to schedule will force the insertion of a no-op are defined as taking Boolean operands. The remainder of the
instruction; this results in a slight code size increase (less than instruction set is largely devoted to operators for: object ma-
5 percent in typical applications [49], [50]) but has no impact nipulation, protection, context management (which we dis-
on execution speed. MIPS also includes a delayed branch, cuss in Section V), and process communication.
which is the natural extension of the absence of interlocks to Data operands reside either in data memory or on an oper-
the program counter. and stack implemented in memory with caching of the top
Studies on the MIPS instruction set show that the combina- element of the stack. Although the stack is efficient when
tion of a simplified pipeline structure and the optimizations measured by the number of bits needed to represent a com-
performed by the code reorganizer are responsible for a factor putation, it is not believed to be a good representation for
of two in performance improvement. compilers and code optimization [53]. Memory-memory
operations are efficient when measured by the number of
C. The Intel iAPX-432 Processor operations needed for a program. However, since there are no
The Intel iAPX-432 [2] represents the most complete on-chip registers, it is not possible to optimize references to
approach to integrating the needs of an entire software commonly used variables.
environment onto silicon. Among the characteristics of the The iAPX-432 instructions have one, two, or three oper-
iAPX-432 architecture are the following: ands and complete symmetry with respect to addressing
* a dense encoding of instructions with variable instruction modes. Since the instruction formats allow arbitrary bit
lengths in bits. Instructions may also start and stop on arbi- lengths, memory operands can be mixed with stack operands
trary bit boundaries; with no loss of encoding efficiency. Of course, the task of
* an object-oriented support mechanism, allowing for instruction fetching and decoding is substantially more com-
creation and protection of an object; plex; we will discuss this topic further in a latter section.
* a packet-switched bus protocol; The iAPX-432 uses a two-part memory address consisting
* support for many standard operating system functions; of a segment and a displacement. Segment-based addressing
* provision for transparent multiprocessing; and has been discussed in the earlier section on memory manage-
support for fault-tolerant operation. ment and is summarized by Fig. 2. The segments may be up
The iAPX-432 also represents an architecture that meets to 216 bytes long; although fixed limited size segments help
many of the goals and specific design principles of Flynn's provide memory protection, they pose a major problem for
ideal machine [51]. The similarities between the iAPX-432 segments that need to grow larger than this size. Managing
and the high level DEL machines proposed by Hoevel (for the activation record stack and heap as growing objects re-
Fortran [52]) and Wakefield (for Pascal [30]) are consid- quires using a multisegment approach from the start. It also
erable. The major difference is the absence of a register set implies that -most programs will need to include segment
or stack cache in the iAPX-432. However, the use of memory numbers in addresses.
and stack operands, bit encoded instructions, data typing, The displacement portion of an address is the displacement
and symmetric addressing are all key principles in the DEL within a segment and may be specified using one of four
designs. addressing modes. Each addressing mode is composed of a
The iAPX-432 implementation consists of three major base address and an index; either component may be indirect,
chips: two of these comprise the general data processor i.e., the address contains the value of the base or index.
(GDP) and the other is the interface processor (IDP). The Indirect index values are scaled according to the byte size of
two-chip GDP consists of an instruction decode unit, which the object being accessed.
also contains most of the microcode, and the microexecution The iAPX-432 is unique among architectures in its support
unit. In addition to executing microinstructions, the micro- for multiprocessing. Multiprocessing is supported by defin-
execution unit performs the memory mapping and protection ing a number of instructions for both processor and process
functions within the iAPX-432 architecture. In this section intercommunication and by the interconnect bus. The inter-
we will concentrate on the components of the instruction set connect bus is a packet bus that allows multiple processors to
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:45:15 UTC from IEEE Xplore. Restrictions apply.
HENNESSY: VLSI PROCESSOR ARCHITECTURE 1235
be connected. The bus offers up to a 16 Mbyte bandwidth addressing formats for byte, word, and long word data types,
when the packets are the maximum size. Data from memory many of the arithmetic and logical instructions allow a short
can be 1-10 bytes in length per access. immediate constant (1 * 8) as an operand. This combination
The process communication instructions include opera- of immediate data types and the short immediate (quick)
tions to send and receive messages, as well as conditional format helps increase code density substantially.
send and receive. Operations that send to processors as well The MC68000 made two instruction set additions that help
as broadcasting to all processors are supported. Since these support high level languages. Support for procedure linkage
operations are supported by the architecture and the bus pro- was built in with several instructions; the most important
vides communication of these messages, multiprocessing can addition was the link instruction, which can be used to set up
be performed independently of the process count, processor and maintain activation records. The multiple register move
count, or distribution of processes on processors. However, instruction helps shorten the save/restore sequence during a
the single bus provides a limit on the ability to do multi- call or return.
processing; the current bus design for the iAPX-432 can Since the original MC68000 has been announced two im-
handle approximately three processors without undue bus portant new versions of the architecture have been produced.
contention. Peripheral chips have been developed to allow The MC68010 provides support for demand paging by pro-
multiple buses to be incorporated into a design. viding instruction restartability in the event of a page fault.
The three year delay between the original MC68000 and the
D. The Motorola 68000 MC68010 is a good indication of the complexity of this ca-
The 68000 [54], [55] represents the first microprocessor to pability. The recently announced MC68020 provides some
support a large, uniform (i.e., unsegmented), virtual ad- extensions to the instruction set, but more importantly
dressing space (>216 bytes) and complete support'for a 32-bit represents a 32-bit implementation both internally in the
data type. The MC68000 architecture has many things in chip and externally on the pins. This provides important
common with the PDP- 11 architecture. It offers a number of performance improvements in instruction access and 32-bit
addressing modes and features orthogonality between in- data memory access.
structions and addressing modes for many but not nearly all
instructions (as compared to a VAX). The MC68000 is a E. The DEC VLSI-Based VAX Processors
16-bit implementation, but almost all the instructions support There are now three VLSI-based implementations of the
32-bit data. VAX architecture. They differ in chip count, amount of cus-
Some interesting compromises were made in the MC68000 tom silicon, and performance. All three implementations are
architecture. Possibly the most obvious is the partitioning of interesting because they reflect different design compromises
the 16 general purpose registers into two sets: address and needed to put the large instruction set into a chip-based im-
data registers. For the compiler this partitioning is trou- plementation. The first implementation, the MicroVAX-I,
blesome since most addressing modes require the use of an uses a custom data path chip [9] and keeps the microcode and
address register and most arithmetic instructions use data microsequencer off chip. The second implementation is the
registers. Because of this dichotomy, excess register copies VLSI VAX [23], a nine-chip set that implements the full
are required and the number of' data registers is too small to VAX instruction set. The third VLSI-based VAX, the
allow register allocation to be easily done. Because the split MicroVAX-32 [24], is a single chip that implements a subset
lowers the number of bits needed for a register designator of the VAX architecture in hardware.
from four to three bits, this choice is motivated by the in- Several key features characterize the VAX instruction
struction coding. set and help provide organization for the 304 instructions
For the most part the addressing modes of the MC68000 and tens of thousands of combinations of instructions and
follow those of the PDP-11: the major change is the elimi- addressing modes:
nation of the infrequently used indirect modes and their re- * a large number of instructions with nearly complete or-
placement with an indexed mode that computes the effective thogonality among opcode, addressing mode, and data type;
address as the sum of the contents of two registers plus an * support for bytes, words (16 bits), and long words
offset. The MC68000 is a one and a half address ma- (32 bits) as data types. Special instructions for bit data types;
chine: instructions have a source and a source/destination * many high level instructions including procedure call
specifier and only one of these may be a memory operand. and return, string instructions, and instructions for floating
The major exception is the move instruction that can move point and decimal arithmetic; and
between two arbitrary operands. * a large number of addressing modes, summarized in
One interesting new instruction in the MC68000 is "check Table IV.
register against bounds." This instruction checks a register The table gives the frequency as percent of all operand
contents against an arbitrary upper bound and causes a trap if memory addressing; the notation (R) indicates the contents of
the contents exceeds the upper bound. If the register contents register R. These memory addressing modes represent just
is a zero-based array index, then this instruction can be used less than one-half of the operands. The other operands are
to do the upper array bound check and trap if the bound is register and literal operands. The VAX supports a short literal
exceeded. The MC68000 also obtains reasonably high code mode (5 bits) and an immediate mode (defined as PC-relative
density due to its useful addressing modes, a good match followed by an autoincrement of the PC). Several common
between instructions and compiled code, and its support for operand addressing formats are obtained using PC-relative
a wide variety of immediate data. Besides having immediate addressing since the PC is in the register set. Hence PC-
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:45:15 UTC from IEEE Xplore. Restrictions apply.
1236 IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 12, DECEMBER 1984
TABLE V meets with two problems. First, not all instructions will con-
SUMMARY COMPARISON OF THE VAX MICROPROCESSORS tain the same number of pipestages. Many instructions, in
VLSI VAX MicroVAX-32 particular the simpler ones, fit best in pipelines of length two,
Chip count 9 2 three, or four, at most. On average, longer pipelines will
(incl. floating pt.)
Microcode (bits) 4S0K 64K
waste a number of cycles equal to the difference between the
Transistors 1250K 101K
-number of stages in the pipeline and the average number of
TL 5 entry mini-TLB 8 entry fully assoc.
stages per instruction. This might lead one to conclude that
512 entries off chip more complex instructions that could use more pipestages
Cache Yes No. instruction
prefetclh buffer would be more effective. However, this potential advantage
is negated by the two other problems: branch frequency and
operand hazards.
tradeoffs, we have used examples from the MIPS processor. The frequency of branches in compiled code limits
Although the examples are specific to that processor, the the length of the pipeline since it determines the average
issues that they illustrate are common to most VLSI processor number of instructions that occurs before the pipeline must be
designs. flushed. This number of course depends on the instruction
set. Measurements of the VAX taken by Clark [7] have shown
A. Organizational Techniques an average of three instructions are executed between every
Many of the techniques used to obtain high performance in taken branch. For simplicity, we call any instruction that
conventional processor designs are applicable to VLSI pro- changes the program counter (not including incrementing it
cessors. Some changes in these approaches have been made to obtain the next sequential instruction) a taken branch.
due to the implementation technology; some of these changes Measurements on the Pascal DEL architecture Adept have
have been adapted into designs for non-VLSI processors. We turned up even shorter runs between branches. Branches that
will look at the motivating influences at the organizational are not taken may also cause a delay in the pipeline since the
level and then look at pipelining and instruction unit design. instructions following the branch may not change the ma-
MOS offers the designer a technology that sacrifices speed chine state before the branch condition has been determined,
to obtain very high densities. Although switching time is unless such changes can be undone if the branch is taken.
somewhat slower than in bipolar technologies, commu- Similar measurements for more streamlined architectures
nication speed has more effect on the organization and imple- such as MIPS and'the 801 have shown that branches (both
mentation. The organization of an architecture in MOS must taken and untaken) occupy 15-20 percent of the dynamic
attempt to exploit the density of the technology by favoring instruction mix. When the levels of the instruction set are
local computation to global communication. accounted for and some special anomalies that increase the
1) Pipelining: A classical technique for enhancing the VAX branch frequency are eliminated, the VAX and stream-
performance of a processor is pipelining. Pipelining in- lined machine numbers are equivalent. This should be the
creases performance by a factor determined by the depth of case: if no architectural anomalies that introduce branches
the pipeline: if the maximum rate at which operators can be exist, the branch frequency will reflect that in the source
executed is r, then pipelining to a depth of d provides an language programs. The number of operations (not instruc-
idealized execution rate of r x d. Since the speed with which tions) between branches is independent of the instruction set.
individual operations can be executed is limited, this ap- This number, often called the run length, and the ability to
proach is an excellent technique to enhance performance in pipeline individual instructions should determine the optimal
MOS. choice for the depth of the pipeline. Since more complex
The depth of the pipeline is an idealized performance mul- instruction sets have shorter run lengths, pipelining across
tiplier. Several factors p'revent achievement of this increase. instruction boundaries is less productive.
First,' delays are introduced whenever data needed to execute The streamlined VLSI processor designs have taken novel
an instruction is still in the pipeline. Second, pipeline breaks approaches to the control of the pipeline and attempted to
occur because of branches. A branch requires that the pro- improve the utilization of the pipeline by lowering the fre-
cessor calculate the effective destination of the branch and quency of pipeline breaks. The RISC and MIPS processor
fetch that instruction; for conditional branches, it' is impos- have only delayed branches; thus, a pipeline break on a
sible to do this without delaying the pipe for at least one stage branch only oc'curs when the compiler can not find useful
(unless both successors of the branch instruction are fetched, instructions to execute during the stages that are needed to
or the branch outcome is correctly predicted). Conditional determine the branch address, test the branch condition, and
branches may cause further delays because'they require the prefetch the de'stination if the branch is taken. Measurements
calculation of the condition, as well as the target address. For have found that these branch delays can be effectively used
most programs and implementations, pipeline breaks due to in 80-90 percent of the cases [15]. In fact, measurements of
branches are the most serious cause of degraded pipeline MIPS' benchmarks have shown that almost 20 percent of the
performance. Third, the complexity of managing the pipeline instructions executed by the processor occur during a branch
and handling breaks adds additional overhead to the basic delay slot! The 801 'offers both delayed and nondelayed
logic, causing a degradation in the rate at which pipestages branches; the latter allow the processor to avoid inserting a
can be executed. no-op when a useful instruction cannot be found. This de-
The designer, in' an attempt to maximize performance, layed branch approach is an interesting contrast to the branch
might increase the number of pipestages per instruction; this prediction and multiple target fetch techniques used on high-
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:45:15 UTC from IEEE Xplore. Restrictions apply.
1238 IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 12, DECEMBER 1984
end machines. The delayed branch approach offers per- requires extensive analysis of the executing code segment
formance that is nearly as good as the more sophisticated and simulation of the uncompleted instructions to create a
approaches and does not consume any silicon area. precise interrupt location. Imprecise interrupts can be largely
A stall in the pipeline caused by an instruction with an avoided by choosing the instruction to interrupt as the suc-
operand that is not yet available is called a data or operand cessor of the last (in the sequence) that has completed; this
hazard. MIPS, the 801 and some larger machines, such as the will guarantee that no completed instructions follow the in-
Cray-1, include pipeline scheduling as a process done by the terrupted instruction. This approach has some performance
compiler. This scheduling can be completely done for opera- penalty on interrupt speed and prohibits interrupts that can
tions with deterministic execution times (such as most not be scheduled, such as page faults. Because the occurrence
register-register operations) and be optimistically scheduled of a page fault is not known until the instruction execution is
for operations whose execution time is indeterminate (such as attempted, imprecise interrupts cannot be tolerated on a pro-
memory references in a system with a cache). This optimiza- cessor that allows demand paging. This fundamental incom-
tion typically provides improvements in the 5-10 percent patibility has limited the use of out-of-order instruction issue
range. In MIPS, this improvement is compounded by the and completion to high performance machines.
increase in execution rate achieved by simplifying the pipe- 2) Instruction Fetch and Decode: One goal of pipelining
line hardware when the interlocks are eliminated for register- is to approach as closely as possible the target of one in-
register operations. Dealing with indeterminate occurrences, struction execution every clock cycle. For most instructions,
such as cache misses, requires stopping the pipeline. The this can be achieved in the execution unit of the machine.
algorithms used for scheduling the MIPS pipeline are dis- Long running instructions like floating point will take more
cussed in [49]; Sites describes the scheduling process for the time, but they can often be effectively pipelined within the
Cray-I in [57]. execution box. More serious bottlenecks exist in the in-
Because the code sequences between branches are often struction unit.
short, it is often impossible for either the compiler or the As we discussed in an earlier segment, densely encoded
hardware to reduce the effects of data dependencies between instruction sets with multiple instruction lengths lower
instructions in the sequence. There are simply not enough memory bandwidth but suffer a performance penalty during
unrelated instructions in many segments to keep the pipeline fetch and decode of the instructions. This penalty comes from
busy executing interleaved and unrelated sequences of in- the inability to decode the entire instruction in parallel due to
structions. In such cases, neither a pipeline scheduling tech- the large number of possible interpretations of instruction
nique nor a sophisticated pipeline that allows instructions to fields and interdependencies among the fields. This penalty
execute out-of-order can find useful instructions to,execute. is serious for two reasons. First, it cannot be pipelined away.
Operand hazards cause more difficulty for architectures High level instruction sets have very short sequences between
with powerful instructions and shorter run lengths. When no branches (due to the high level nature of the instruction set).
pipeline scheduling is being done, the dependence between Thus, the processor must keep the number of pipestages de-
adjacent instructions is high. When scheduling is used it may voted to instruction fetch and decode to as near to one as
be ineffective since the small number of instructions between possible. If more stages are devoted to this function, the
basic blocks makes it difficult to find useful instructions to processor will often have idle pipestages. This lack of ability
place between two interdependent instructions. to pipeline high level instruction sets has been observed for
Another approach to migrating the effect of operand haz- the DEL architecture Adept [30]. Note that the penalty will
ards and increasing pipeline performance is to allow out- be seen both at instruction prefetch and instruction decode;
of-order instruction execution' In the most straightfQrward both phases are made more complex by multiple instruction
scenario, the processor keeps a buffer of sequential instruc- lengths.
tions (up to and including a branch) and examines each of the The second reason is that most instructions that are exe-
instructions in parallel to decide if it is ready to execute. An cuted are still simple instructions. The most common in-
instruction is executed as soon as its operands are available. structions for VAX, PDP-11, and S/370 style architectures
In most implementations, instructions also complete out- are MOV and simple ALU instructions, combined with
of-order. The alternative is to buffer the results of an instruc- "register" and "register with byte displacement" addressing
tion, until all the previous instructions have completed; this for the operands. Thus, the cost of the fetch and decode can
becomes complex, especially if an instruction can have re- often be as high or even higher than the execution cost. The
sults longer than a word (since as a block move instruction). complexities of instruction decoding can also cause the
Out-of-order completion leads to a fundamental problem: simple, short instructions to suffer a penalty. For example, on
imprecise interrupts. An imprecise interrupt occurs when a the VAX 11/780 register-register operands take two cycles
program is interrupted at an instruction that does not serve to complete, although only one cycle is required for the data
as a clean boundary between completed and uncompleted path to execute the operation. Half the cycle time is spent in
instructions; that is, some of the instructions before the fetch and decode; similar results can be found for DEL ma-
interrupted instruction may not have completed and some of chines. In contrast, MIPS takes one third of the total cycle
the instructions after the interrupted instruction may have time of each instruction for fetch and decode. A processor
been completed. Continuing execution of a program after an can achieve single-cycle execution for the simple instructions
imprecise interrupt is nearly impossible; at best, to continue in a complex architecture, but to do so requires very careful
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:45:15 UTC from IEEE Xplore. Restrictions apply.
HENNESSY: VLSI PROCESSOR ARCHITECTURE 1239
design and an instruction encoding that simplifies fetch and use just this approach. For example, on the S/370 long string
decode for such instructions. instructions use the general purpose registers to hold the state
of the instruction during execution; shorter instructions,
B. Control Unit Design such as Move Character (MVC), inhibit interrupts during
The structure of the control unit on a VLSI processor most execution. Because the MVC instruction can still access
clearly reflects the make-up of the instruction set. For ex- multiple memory words, the processor must first check to
ample, streamlined architectures usually employ a single cy- ensure that no page faults will occur before beginning in-
cle decode because the simplicity of the instruction set allows struction execution.
the instruction contents to be decoded in parallel. Even in Instructions that do not have very long running times can
such a machine, a multistate microengine is needed to run the be dealt with by a two-part strategy. The architecture may
pipeline and control the processor during unpredictable prohibit most interrupts during the execution of the in-
events that cause significant change in the processor states, struction. For those interrupts that cannot be prohibited, e.g.,
such as interrupts and page faults. However, the microengine a page fault in the executing instruction, the architecture can
does not participate in either instruction decoding or exe- stop the execution of instruction, process the interrupt, and
cution except to dictate the sequencing of pipestages. In a restart the instruction. This process is reasonably straight-
more complex architecture, the microcode must deal both forward, except when the instruction is permitted to alter the
with instruction sequencing and the handlinag of exceptional state of the processor before completion of the instruction
events. The cascading of logic needed to decode a complex without interrupt can be guaranteed. If such changes are
instruction slows down the decode time, which impacts per- allowed, then the implementation must either continue the
formance when the control unit is in the critical path. Since instruction in the middle, or restore the state of the processor
decoding is usually done with PLA's, ROM's, or similar before restarting the instruction. Neither of these approaches
programmable structures, substantial delays can be incurred is particularly attractive since they require either special
communicating between these structures and in the logic de- hardware support, or extensive examination of the executing
lays within the structures, which themselves are usually instruction. If the processor can decode the instruction and
clocked. knows how much of the instruction was completed, the mi-
In addition to the instruction fetch and decode unit, the crocontrol could simulate the completion of the instruction,
instruction set and system architecture has a profound effect or (under most cases) undo the effect of the completed por-
on the design of the master control unit. This unit is re- tions. However, both of these approaches incur substantial
sponsible for managing the major cycles of the processor, overhead for determining the exact state of the partially exe-
including initiating normal processor instruction cycles un- cuted instruction, and taking the remedial action. Addi-
der usual conditions and handling exceptional conditions tionally, some classes of instructions may not be undone; for
(page faults, interrupts, cache misses, internal faults, etc.) example, an instruction component that clears a register can-
when they arise. The difficult component of this task is in not be reversed, without saving the contents of the register.
handling exceptional conditions that require the intervention Since this overhead must be taken on common types of inter-
of the operating system; the process typically involves shut- rupts, such as page faults, this solution is not attractive.
ting down the execution of the normal instruction stream, To circumvent these problems, the architecture must either
saving the state of execution, and transferring to supervisor prohibit such instructions, as streamlined architectures do, or
level code to save user state and begin handling the condition. provide hardware assist. To keep the amount of special hard-
Simpler conditions, such as a cache miss or DMA cycle, ware assistance needed within bounds, only limited types of
require only that the processor delay its normal cycle. changes in the states are allowed before guaranteed com-
Exceptional conditions that require the interruption of exe- pletion. The most common example of such a limited feature
cution during an instruction have a significant effect on the is autoincrement/autodecrement addressing modes. Like
implementation. Two distinct types of problems arise: state most instructions that change state midway through the in-
saving and partially completed instructions. To allow pro- struction, only the general purpose registers can be changed.
cessing of the interrupt, execution of the current instruction This offers an opportunity to try to restore the machine state
stream must be stopped and the machine state must be saved. to its state prior to instruction execution.
In a machine with multicycle instructions, some of the in- Let us consider the possibilities that occur on the VAX.
ternal instruction state may not be visible to user-level The most obvious scheme would be to decode the faulting
instructions. Forcing such state to be visible is often un- instruction and unwind its effect by inverting the increment
workable since the exact amount of state depends on the or decrement (which can only change the register contents by
implementation. Defining such state in the instruction set a fixed constant). However, on the VAX, with up to five
locks in a particular implementation of the instruction. Thus, operands per instruction, decoding the faulting instruction
the processor must include microcode to save and restore the and determining which registers have been changed is a ma-
state of the partially executed instruction. To avoid this prob- jor undertaking. Because the instruction cannot be restarted
lem, some architectures force instructions that execute for a until all values that have been altered are restored, the cost
comparatively long time and generate results throughout the would be prohibitive. The solution used for the MVC in-
instruction, to employ user visible registers for their opera- struction on the S/370 -make sure you can complete the
tion; most architectures that support long string instructions instruction before you start it -can be adapted. Because of
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:45:15 UTC from IEEE Xplore. Restrictions apply.
1240 IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 12, DECEMBER 1984
the possibility of page faults, this approach requires that the * The program counter. Positioning the program counter in
instruction be simulated to determine that all the pages ac- the main data path simplifies calculation of PC-based dis-
cessed by the instruction's operands are in memory. This placements. In a high performance or pipelined processor,
could be quite expensive, especially if the addressing mode the program counter will usually have its own incrementer.
is used often. Because only limited modifications to the This allows both faster calculation of the next sequential
processor state are allowed before instruction completion, instruction address and overlap of PC increment with ALU
there are several hardware-based solutions that have smaller operation. A pipelined processor will often have multiple
impacts on performance. PC registers to simplify state saving and returning from
1) Save the register contents before they are changed, interrupts.
along with the register designator. Restore all the saved reg- These are the primary components of the data path; micro-
isters using their designator when an interrupt occurs. architectures may have special features designed to improve
2) Save the register designator and the amount of the in- the performance of some particular part of the instruction set.
crement or decrement (in the range of 1-4 on the VAX). If an Fig. 4 shows the data path from the MIPS processor. It is
interrupt occurs, compute the original value of the registers typical of the data path designs found on many VLSI pro-
corresponding to the saved designators and constants. cessors. Some data paths are simpler (e.g., the RISC data
3) Compute the altered register value, but do not store it path, ignoring the register stack) and some are more com-
into the register until the end of the instruction execution. plicated (e.g., the VAX data path). Although the basic com-
The above list gives the rough order of the hardware com- ponents are common, the communication paths are often
plexity of these solutions. The last solution is complicated customized to the needs of the instruction set and varying
because a list of changed registers and the register numbers speed, space, and power tradeoffs may made in designing the
must be kept until the instruction ends. It is also the least data path components (e.g., a ripple carry adder versus a
efficient solution; since most instructions do not fault, the carry lookahead adder).
cost of the update must be added to the execution time. The 1) Data Bus Design: The minimum machine cycle time is
second solution is simpler and requires the least storage, but limited by the time needed to move data from one resource to
it still requires some decoding overhead. The first solution is another in the data path. This delay consists of the propaga-
the simplest; it can be implemented by saving the registers as tion time on the control wires and the propagation time on the
they are read for incrementing/decrementing. data buses, which are usually longer than the control lines. In
C. Data Path Design a process with only one level of low resistance interconnect
The data paths of most VLSI processors share many com- (metal) the data bus would be run in metal, while the control
mon features since most instruction sets require a small lines would run in polysilicon. The delay on the control lines
number of basic micro-operations. Special features may be can be reduced by minimizing the pitch in the data path.
included to support structures such as the queue that saves Partly because of these delays, almost all data paths in VLSI
altered registers during instruction execution. processors use a two bus design. The extra delays due to the
The main data path of the processor is usually dis- wide data path pitch in a three bus design may not be compen-
tinguished by the presence of two or more buses, serving as sated for by the extra throughput available on the third bus.
a common communication link among the components of the Power constraints and the need to communicate signals as
bus. Many common components may be associated with quickly as possible across the data path lead to heavy use of
smaller, auxiliary data paths because they do not need fre- bootstrapped control drivers. Large numbers of bootstrap
quent or time-critical access to the resources provided by the drivers put a considerable load on clock signals, and the
main data path or for performance reasons, which we will designer must be careful to avoid skew problems by routing
discuss shortly. clocks in metal and using low resistance crossovers. Boot-
The data path commonly includes the following strap drivers require a setup period and cannot be used when
components. a control signal is active on both clock phases. Static super-
* A register file for the processor's general purpose regis- buffers can be used in such cases, but they have a much
ters and any other registers included in the main data path for higher static power usage. The tight pitch and use of boot-
performance. In a microprogrammed machine, temporary strap drivers helps minimize the control delay time. In MIPS
registers used by the microcode may reside here. The func- the tight pitch (33 A) and the extensive use of dynamic boot-
tion of the register file depends on the instruction set. In some strap drivers holds the control delay to 10 ns.
cases, it is removed from the data path for reasons we will Although reducing the control communication delays is
discuss shortly. important, the main bus delays normally constitute a much
* An ALU providing both addition/subtraction and some larger portion of the processor cycle time. The main reason
collection of logical operations, and perhaps providing sup- for this is that the bus delay is proportional to the product of
port for multiplication and division. We will discuss the de- the bus capacitance and the voltage swing divided by the
sign of the ALU in some more detail shortly. driver size. When the number of drivers on the bus gets large
* A shifter used to implement shifts and rotates and to (25-50, or more), the bus capacitance is dominated by the
implement bit string instructions or assist instruction decod- drivers themselves, i.e., it is proportional to driver size times
ing. Some processors include a barrel shifter (rather than a the driver count. Thus, the bus delay becomes proportional to
single-bit shifter) because although they consume a fair the product of the driver count and the voltage swing!
amount of area, they dramatically increase the speed of This delay can be reduced by either lowering the number
multiple-bit shifts. of drivers or by reducing the voltage swing. For many data
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:45:15 UTC from IEEE Xplore. Restrictions apply.
HENNESSY: VLSI PROCESSOR ARCHITECTURE 1241
-a t tSddress
Meinor-v Data Hegistar
L Displa{cemnernt_GCencrrator
I
Sniall Conistant Port 7
Process Idcentifier and
Moemory Mapping Unit
Program Counter I]
Branch Target
Fig. 5. MIPS current distribution.
_ Register File
Barrel Shifter ] tion of precharging vanishes. The limited swing bus uses an
a approach similar to the techniques used in dynamic RAM
Multiply/Divide Registe <
design [58]. The bus is clamped to reduce its voltage swing
and sense-amplifier-like circuits are used to detect the change
ALU
in voltage. A version of MIPS was fabricated using a clamped
bus structure to reduce the effective voltage swing by about
Fig. 4. MIPS data path block diagram.
a factor of 4. This approach was the most attractive, since
MIPS uses the bus on every clock phase. The use of a re-
stricted voltage swing does require careful circuit design
paths the register file is the major source of bus drivers. since important margins, such as noise immunity, may be
Those bus drivers are directly responsible for a slower clock reduced.
cycle. This penalty on processor cycle time is a major draw- 2) The Data Path ALU: Arithmetic operations are often
back for a large register file implemented in MOS tech- in a processor's critical timing paths and thus require careful
nology. To partially overcome this problem, many processor logic and circuit design. Although some designs use straight-
designs implement the register file as a small RAM off of the forward Manchester-carry adders and universal logic blocks
data bus. Although this eliminates a large fraction of the load (see, e.g., the description of the OM2 [12]), more powerful
from drivers, it may introduce several other problems. The techniques are needed to achieve high performance. Since
register file is usually a multiported device for at least reads the addition circuitry is usually the critical path, it can be
and sometimes for writes. The smallest RAM cell designs separated from the logic operation unit to achieve minimal
may not provide this capability. Thus, maintaining the same loading on the adder. A fast adder will need to use carry
level of performance requires operating the RAM at a higher lookahead, carry bypass, or carry select. For example,
speed or duplicating the RAM to increase bandwidth (a typ- MIPS uses a full carry-lookahead tree, with propagate signals
ical technique for high performance ECL machines). Isolating and generate signals produced for each pair of bits, which
the RAM or register file from the bus may also incur extra results in a total ALU delay of less than 80-ns with a one-
delays due to communication time or the presence of latches level metal 3 ,um process. To obtain high speed addition, the
between the registers and the bus. ALU may also consume a substantial portion of the pro-
Another approach is to try to reduce the switching time of cessor's power budget.
the bus by circuit design techniques. There are three major Supporting integer multiply and divide (and the floating
styles of bus design that can be used: point versions) with reasonable performance can provide a
* a nonprecharged rail-to-rail bus which has the above real challenge to the designer. One approach is to code these
stated problem; operations out of simpler instructions, using Booth's algo-
* a precharged bus which reduces the problem by replacing rithm. This will result in multiply or divide performance at
the slower pull-up time but having the same the pull-down the rate of approximately one bit per every three or four
time. Precharging requires a separate idle bus cycle to charge instructions. The RISC processor uses this approach. Most
the bus to the high state; and microprocessors implement multiply/divide via microcode
* a limited voltage-swing bus that still allows a bus active using either individual shift and add operations or relying on
on every clock cycle. special support for executing Booth's algorithm. The 68000
The use of precharged buses is discussed in many intro- uses this approach. MIPS implements special instructions for
ductory texts on VLSI design [ 12]. Precharging is most useful doing steps of a multiply or divide operation; these in-
in a design when the bus is idle every other cycle due to the structions are used to expand the macros for a 32-bit multiply
organization of the processor. For example, if the ALU cycle or divide, into a sequence of 8 or 16 instructions, re-
time is comparatively long and the processor is otherwise idle spectively. This type of support, similar to that used in the
during that time, the ALU can be isolated from the bus, and 68000 microengine, requires the ability to do an add (de-
the precharge can occur during that cycle. When such idle pending on the low-order bits of the register) and a shift in the
cycles are not present in the global timing strategy, the attrac- same instruction step. Limited silicon area and power bud-
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:45:15 UTC from IEEE Xplore. Restrictions apply.
1242 IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 12, DECEMBER 1984
gets often make it impractical to include hardware for more processors achieve performance, the architect and designer
parallel multiplication on the CPU chip. must consider a series of issues that affect performance im-
Fast arithmetic operations can be supported in a co- provements achievable by pipelining. These issues include
processor that does both integer and floating point operations the suitability of the instruction set for pipelining, the fre-
as in the VLSI VAX. The design of a floating point co- quency of branches, the ease of decomposing instructions,
processor that achieves high performance for floating point and the interaction between instructions.
operations can be extremely difficult. The coprocessor de- Pipelining adds a major complication to the task of control-
sign must be taken into account in the design of the main CPU ling the execution of instructions. The parallel and simulta-
as well as in the software for the floating point routines. An neous interpretation of multiple instructions dramatically
inefficient or ineffective coprocessor interface will mean that complicates the control unit since it must consider all the
the coprocessor does not perform as well as an integral float- ways in which all the instructions under execution can require
ing point unit. Many current microprocessors exhibit this special control. Complications in the instruction set can make
property: they execute integer operations at a rate close to this task overwhelming. In addition to controlling instruction
that of a minicomputer, but are substantially slower on float- sequencing, the control unit (or its neighbor) often contains
ing point instructions. Furthermore, the floating point in- the instruction decoding logic. The complexity and size of
struction time is often dominated by communication and the de-coding logic is influenced by the size and complexity
coordination with the coprocessor, not by the time for the of the instruction set and how the instruction set is encoded.
arithmetic operation. A well-designed floating point copro- The observation that most microprocessors use 50 percent or
cessor, such as the floating point processor for the VLSI more of their limited silicon area for control functions was a
VAX, can achieve performance equal to the performance consideration when RISC architectures were proposed [601.
obtained in an integral floating point unit. Although the high level design of the data path is largely
3) The Package Constraint: Packaging introduces pin functionally independent of the architecture, the detailed re-
limitations and power constraints. Limited pins force the quirements of data path components are affected by the archi-
designer to choose his functional boundaries to minimize tecture. For example, an architecture with instructions for
interconnection. Pin multiplexing can partially relieve the bytes, half-words, and words requires special support in the
pin constraints, but it costs time, especially when the pins are register file (to read and write fragments) and in the ALU to
frequently active. detect overflow on small fragments (or to shift smaller data
Two types of power constraints exist: total static power items into the high order bits of the ALU). Although the
and package inductance. The packaging technology defines functionality of most data path components is independent of
the maximum static power the chip may consume. Because the processor architecture, the architecture and organization
power can eliminate delays in the critical path, the power affect the data path design in two important ways. First,
budget must be used carefully. Typical packages for pro- different processors will have different critical timing paths,
cessors with more than 64 pins can dissipate 2-3 W. and data path components in the critical path will need to be
The problem of package inductance [59] is more subtle and designed for maximum performance. Second, specific fea-
can be difficult to overcome. Suppose the processor drives a tures of the architecture will cause specialization, of the data
large number of pins simultaneously, e.g., 32 data and path; examples of this specialization include support for
32 address pins, then the current required to drive the pins bytes and half words in a register file and the register stack
can be temporarily quite large. In such cases the package used to handle autoincrement/autodecrement in VAX micro-
inductance (due largely to the power leads between the die processors. The role of good implementation is magnified in
and the package) can lead to a transient in the on-chip power VLSI where what is obtainable is much broader in range and
supply voltage. This problem can be mitigated by using mul- much more significantly affected by the technology.
tiple power and ground wires or by more sophisticated die
bonding and packaging technology. VIII. FUTURE TRENDS
The power distribution plot for MIPS (see Fig. 5) shows
how this power budget might be consumed. Power is used for VLSI processor technology combines several different ar-
three principle goals in nMOS: to overcome delays due to eas: architecture, organization, and implementation tech-
serial combinations of gates, to reduce communication nology. Until recently, technology has been the driving
delays between functional blocks, and to reduce off-chip force: rapid improvements in density and chip size have
communication delays. The MIPS power distribution plot made it possible to double the on-chip device count every few
shows the major power consumers are years. These improvements have led to both architectural
* the ALU with its extensive multilevel logic, changes (from 8- to 16- to 32-bit data paths, and to larger
* the pins with the drive logic, and instruction sets) and organizational changes (incorporation of
* the control bus, which provides most of the time-critical pipelining and caches). As the technology to implement a full
interchip communication. 32-bit processor has become available, architectural issues,
rather than implementation concerns, have assumed a larger
role in determining what is designed.
D. Summary
VLSI technology has a fundamental effect on the design A. Architectural Trends
decisions made in the architecture and organization of pro- In the past few years many designers have been occupied
cessors. Since pipelining is a basic technique by which VLSI with exploring the tradeoffs between streamlined and more
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:45:15 UTC from IEEE Xplore. Restrictions apply.
HENNESSY: VLSI PROCESSOR ARCHITECTURE 1243
complex architectures. Future architectures will probably caching of methods. In Smalltalk, the destination of a proce-
embrace some combination of both these ideas. Three major dure call may depend on the argument passed. Caching the
areas, parallel processing, support for nonprocedural lan- method in the instruction stream requires special support for
guages, and more attention to systems-level issues, stand out nonreentrant code. Third, SOAR has hardware support for an
as foci of future architectures. efficient storage reclamation algorithm, called generation
Parallel processing is an ideal vehicle for increasing per- scavenging [64]. To support this technique requires the abil-
formance using VLSI-based processors. The low-cost of rep- ity to trap on a small percentage of the store operations (about
licating these processors makes a parallel processor attractive 0.2 percent). Checking for this infrequent trap condition is
as a method for attaining higher performance. However, done by the SOAR hardware.
many unsolved problems still exist in this arena. Another The SOAR architecture and implementation shows how
paper in this issue address the development of concurrent the RISC philosophy of building support for the most fre-
processor architectures for VLSI in more detail [61]. quent cases can be extended to a dynamic object-oriented
Another architectural area that is currently being explored environment. Smalltalk is supported by providing fast and
is the architecture of processors for nonprocedural lan- simple ways to handle the most common situations (e.g.,
guages, such as Lisp, Smalltalk, and Prolog. There are sev- integer add) and using traps to routines that handle the excep-
eral important reasons for interest in this area. First, such tional cases. This approach is very different from the Xerox
languages perform less well than procedural languages Smalltalk implementations [65], [66] that use a custom in-
(Pascal, Fortran, C, etc.) on most architectures. Thus, one struction set which is heavily encoded and implemented with
goal of the architectural investigations is to determine extensive microcode.
whether there are significant ways to achieve improved per- Another major problem facing VLSI processor architects
formance for such languages through architectural support. arises as the performance of these architectures approaches
A second important issue is the role of such languages in mainframe performance. Prior to the most recent processor
exploiting parallelism. Many advocates of this class of lan- designs, architects did not have to devote as much attention
guages contend that they offer a better route to obtaining to systems issues: memory speeds were adequate to keep the
parallelism in programs. If efforts to develop parallel pro- processor busy, off-chip memory maps sufficed, and simple
cessors are successful, then this advantage can be best bus designs were fast enough for the needs of the processor.
exploited by supporting the execution of programs in an effi- As these processors have become faster and have been
cient manner, both for sequential and parallel activities. adopted into complete computers (with large mapped memo-
Several important VLSI processors have been designed to ries and multiple I/O devices), these issues assume increasing
support this class of languages. The SCHEME chips importance. VLSI processors will need to be more concerned
[13], [62] (called SCHEME-79 and SCHEME-81) are pro- with the memory system: how it is mapped, what memory
cessors designed at MIT to directly execute SCHEME, a hierarchy is available, and the design of special processor-
statically-scoped variant of Lisp. In addition to direct support memory paths that can keep the processor's bandwidth
for interpreting SCHEME, the SCHEME chips include hard- requirements satisfied. Likewise, interrupt structure and
ware support for garbage collection (a microcoded garbage support for a wide variety of high speed I/O devices will be-
collector) and dynamic type checking. come more important.
SCHEME-81 includes tag bits to type each data item. The
tag specifies whether a word is a datum (e.g., list, integer, B. Organizational Trends
etc.), or an instruction. Special support is provided for ac- Increasing processor speeds will bring increased need for
cessing tags and using them either as opcodes, to be inter- memory bandwidth. Packaging constraints will make it in-
preted by the microcode, or as data type specifications, to be creasing disadvantageous to obtain that bandwidth from off-
checked dynamically when the datum is used. A wide micro- chip. Thus, caches will migrate onto the processor chip.
code word is used to control multiple sets of register-operator Similarly, memory address translation support will also move
units that function in parallel within the data path. The onto the processor chip. Two important instances of this
SCHEME-81 design supports multiple SCHEME processors. move can be seen: the Intel iAPX432 includes an address
The primary mechanism to support multiprocessing is the cache, while the Motorola 68020 includes a small (256 byte)
SBUS. The novel feature of the SBUS is that it provides a instruction cache. Cache memory is an attractive use of sili-
protocol to manipulate Lisp structures over the bus. con because it can directly improve performance and its
The SOAR (Smalltalk on a RISC) processor [63] is a chip regularity limits the design effort per unit of silicon area.
designed at U.C. Berkeley to support Smalltalk. SOAR pro- Although today's microprocessors are used as CPU's in
vides efficient execution of Smalltalk by concentrating on many computers, much of the functionality required in the
three key areas. First, SOAR supports the dynamic type CPU is handled off-chip. Many of the required functions not
checking of tagged objects required by Smnalltalk. SOAR supported on the processor require powerful coprocessors.
handles tagged data by executing instructions and checking Among the functions performed by coprocessors, floating
the tag in parallel; if both operands are not simple integers, point and I/O interfacing are the most common. In the case of
the processor does a trap to a routine for the data type speci- floating point, limited on-chip silicon area prevents the inte-
fied by the tag. This makes the frequent case where both tags gration of a high performance floating point unit unto the
are integers extremely fast. Second, SOAR provides fast chip. For the near future, designers will be faced with the
procedure call with a variation of the RISC register window- difficult task of choosing what to incorporate on the pro-
ing scheme and with hardware support to simplify software cessor chip. Without the cache both the fixed point and
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:45:15 UTC from IEEE Xplore. Restrictions apply.
1244 IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 12, DECEMBER 1984
floating point performance of the processor may suffer. lium arsenide (GaAs) and wafer-scale integration. A GaAs
Thus, using a separate coprocessor is a concession to the medium for integrated circuits offers the advantage of signifi-
lack of silicon area. The challenge is to design a copro- cantly higher switching speeds versus silicon-based inte-
cessor interface that avoids performance loss due to commu- grated circuits [68]. The primary advantage Qf GaAs comes
nication and coordination required between the processor and from increased mobility of electrons, which leads to im-
the coprocessor. provements in transistor switching speed over silicon by
I/O coprocessors allow another processor to be devoted to about an order of magnitude; furthermore, its power dissi-
the detailed control of an I/O device. A separate I/O pro- pation per gate is similar to nMOS (but still considerably
cessor not only eliminates the need for such functionality on higher than CMOS). Several fundamental problems must
the processor chip, it also supports overlapped processing by be overcome before GaAs becomes a viable technology for a
removing the I/O interface from the set of active tasks to be processor. The most mature GaAs processes are for depletion
executed by the processor. As I/O processors become more mode MESFETS; logic design with such devices is more
powerful, they migrate from a coprocessor model to a sepa- complex and consumes more transistors than MOS design.
rate I/O processor that uses DMA and the bus to interface to Currently, many problems prevent the fabrication of large
the memory and main processor. (>10000 transistors) GaAs integrated circuits with accept-
able yields. Until these problems are overcome, the advan-
C. Technology Trends tages of silicon technologies will make them the choice for
One of the most fundamental changes in technology is the VLSI processors.
shift to CMOS as the fabrication technology for VLSI pro- Wafer scale integration allows effective use of silicon and
cessors [67]. The major advantage of CMOS is the low power high bandwidth interconnect between blocks on the same
consumption: CMOS designs use essentially no static power. wafer. If the blocks represent components similar to indi-
This advantage simplifies circuit design and allows designers vidual IC's, their integration on a single wafer yields in-
to use their power budget more effectively to reduce critical creased packing density and communication bandwidth
paths. Another advantage of CMOS is the absence of ratios because of shorter wires and more connections. Lower total
in the design of logic structures; this simplifies the design packaging costs are also possible. There are several major
process compared to nMOS design. hurtles that must be surpassed to make wafer-scale tech-
The major drawbacks of CMOS are in layout density. nology suitable for high performance custom VLSI pro-
These disadvantages come from two factors: logic design and cessors. A major problem is to create a design methodology
design rule requirements. Static logic designs in CMOS will that generates individual testable blocks that will have high
often require more transistors and hence more connections yields and that can be selectively interconnected to other
than the nMOS counterpart. Many designs will also require working blocks. The need for multiple connection paths
both a signal and its complement; this increases the wiring among the blocks and the high bandwidth of these con-
space needed for the logic. CMOS designs can also take more nections makes this problem very difficult.
space because of well separation rules. The p and n transistor
types must be placed into different wells; since the well D. Summary
spacing rules are comparatively large, the separation between New and future architectural concepts are serving as driv-
transistors of different types must be large. This can lead to ing forces for the design of new VLSI processors. Interest in
cell designs whose density and area are dominated by the well nonprocedural languages leads to the creation of processors
spacing rules. such as SCHEME and SOAR that are specifically designed to
One important development that will help MOS tech- support such languages. The potential of a parallel processor
nologies (but is particularly important for CMOS) is the constructed using VLSI microprocessors is an exciting pos-
availability of multiple levels of low resistance interconnect. sibility. The Intel iAPX432 specifically provides for
The larger number of connections in CMOS makes this al- multiprocessing. The importance of this type of system
most mandatory to avoid dominating layout density by architecture will influence other processors to provide
interconnection constraints. A two-level metal process support for multiprocessing.
provides another level of interconnect and is the best solu- The increasing performance of VLSI processors will force
tion. A silicide process allows the designer access to a low designers to consider system performance, memory hier-
resistance polysilicon layer; this allows polysilicon to be archies, and floating point performance. Systems level prod-
used for longer routes but does not provide an additional layer ucts constructed using these processors will require support
of interconnect. for memory mapping, interrupts, and high speed I/O. To
The design of faster and larger VLSI processors will re- attain the desired performance goals both on' and off chip
quire improvements in packaging both to lower delays and to caches will be needed to reduce the bandwidth demands
increase the package connectivity. The development of pin on main memory. The next generation of VLSI processors
grid packages has helped solve both of these problems to a will be easily competitive with minicomputers and super-
significant extent. Packaging technologies that use wafers minicomputers in integer performance; however, without
with multiple levels of interconnect as a substrate are being floating point support they will be much slower than the
developed. These wafer-based packaging technologies pro- larger machines with integrated floating point support. High
vide high density and a large number of connections; they performance floating point is a function both of the available
offer an alternative to the multilayer ceramic package. coprocessor hardware for floating point and a low overhead
Two of the biggest areas of unknown opportunity are gal- coprocessor interface.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:45:15 UTC from IEEE Xplore. Restrictions apply.
HENNESSY: VLSI PROCESSOR ARCHITECTURE 1245
Despite the increased input from the architectural and soft- for high level language programs, a VLSI processor can
ware directions to the design of VLSI processors, technology reach performance levels formerly attainable only by large-
remains a powerful driving force. CMOS will bring relief scale mainframes.
from the power problems associated with large nMOS inte-
grated circuits; the problems presented by CMOS technology ACKNOWLEDGMENT
are minor compared to its benefits. Steady improvements in The material in this paper concerning MIPS and several of
packaging technology can be predicted; more radical pack- the figures are due to the collective efforts of the MIPS
aging technologies offer substantial increases in packaging team: T. Gross, N. Jouppi, S. Przybylski, and C. Rowen;
density and interconnection bandwidth. T. Gross and N. Jouppi also made suggestions on an early
Wafer-scale integration and GaAs FET's stand as two new draft of the paper. M. Katevenis (who supplied the table
technologies that may substantially alter VLSI processor de- of instructions for the Berkeley RISC processor) and J.
sign. Wafer-scale integration offers several benefits but its Moussouris also made numerous valuable suggestions from
success will depend on a balanced design methodology that a non-MIPS perspective.
can overcome fabrication defects without substantially im-
pacting performance. GaAs offers very high speed devices;
to be useful for large IC's, such as a VLSI CPU, will require REFERENCES
major improvements in yield.
[1] DEC VAX]] Architecture Handbook, Digital Equipment Corp., May-
nard, MA, 1979.
[2] J. Rattner, "Hardware/Software Cooperation in the iAPX 432," in Proc.
IX. CONCLUSIONS Svmp. Architectural Supportfor Programming Languages and Operating
Systems, Ass. Comput. Mach., Palo Alto, CA, Mar. 1982, p. 1.
A processor architecture supplies the definition of a host [3] M. Flynn, "Directions and issues in architecture and language:
environment for applications. The use of high level lan- Language -* Architecture -> Machine -*," Computer, vol. 13, no. 10,
pp. 5-22, Oct. 1980.
guages requires that we evaluate the environment defined by [4] R. Johnnson and D. Wick, "An overview of the Mesa processor architec-
the instruction set in terms of its suitability as a target for ture," in Proc. Symp. Architectural Supportfor Programming Languages
and Operating Systems, Ass. Comput. Mach., Palo Alto, CA, Mar.
compilers of these languages. New instruction set designs 1982, pp. 20-29.
must use measurements based on compiled code to ascertain [5] M. Hopkins, "A perspective on microcode," in Proc. COMPCON
the effectiveness of certain features. Spring '83, IEEE, San Francisco, CA, Mar. 1983, pp. 108-110.
[6] "Compiling high-level functions on low-level machines," in Proc.
The architect must trade off the suitability of the feature (as Int. Conf. Computer Design, IEEE, Port Chester, -NY, Oct. 1983.
measured by its use in compiled code), against its cost, which [7] D. Clark and H. Levy, "Measurement and analysis of instruction use in
'is measured by the execution speed in an implementation, the the VAX 11/780," in Proc. 9th Annu. Symp. Computer Architecture,
ACM/IEEE, Austin, TX, Apr. 1982.
area and power (which help calibrate the opportunity cost), [8] H. T. Kung and C. E. Leiserson, "Algorithms for VLSI processor ar-
and the overhead imposed on other instructions by the pres- rays," in Introduction to VLSI Systems, C. A. Mead and L. Conway,
ence of this instruction set feature or collection of features. Eds. Reading, MA: Addison-Wesley, 1978.
[9] G. Louie, T. Ho, and E. Cheng, "The MicroVAX I data-path chip," VLSI
This approach can be used to measure the effectiveness of Design, vol. 4, no. 8, pp. 14-21, Dec. 1983.
support features for the operating system; the designer must [10] G. Radin, "The 801 minicomputer," in Proc. SIGARCH/SIGPLAN
consider the frequency of use of such a feature, the per- Symp. Architectural Supportfor Programming Lynguages and Operating
Systems, Ass. Comput Mach., Palo Alto, CA, Mar. 1982, pp. 39-47.
formance improvement gained, and the cost of the feature. [11] D. A. Patterson and D. R. Ditzel, "The case for the reduced instruction
All three of these measurements must be considered before set computer," Comput. Architecture News, vol. 8, no. 6, pp. 25-33,
deciding to include the feature. Oct. 1980.
[12] C. Mead and L. Conway, Introduction to VLSI Systems. Menlo Park,
The investigation of these tradeoffs has led to two signifi- CA: Addison-Wesley, 1980.
cantly different styles of instruction sets: simplified in- [13] J. Holloway, G. Steele, G. Sussman, and A. Bell, "SCHEME-79 -LISP
struction sets and microcoded instruction sets. These styles on a chip," Computer, vol. 14, no. 7, pp. 10-21, July 1981.
[14] R. Sherburne, M. Katevenis, D. Patterson, and C. Sequin, "Local mem-
of instruction sets have devoted silicon resources to different ory in RISCs," in Proc. Int. Conf. Computer Design, IEEE, Rye, NY,
uses, resulting in different performance tradeoffs. The sim- Oct. 1983, pp. 149-152.
plified instruction set architectures use silicon area to im- [15] T. R. Gross and J. L. Hennessy, "Optimizing delayed branches," in Proc.
Micro-15, IEEE, Oct. 1982, pp. 114-120.
plement more on-chip data storage. Processors with more [16] W. A. Wulf, "Compilers and computer architecture," Computer, vol. 14,
powerful instructions and denser instruction encodings re- no. 7, pp. 41-48, July 1981.
quire more control logic to interpret the instructio'is." This [17] J. Hennessy, "Overview of the Stanford UCode compiler system," Stan-
ford Univ., Stanford, CA.
use of available silicon leads to a tradeoff between data [18] F. Chow, "A portable, machine-independent global optimizer-design
and instruction bandwidth: the simplified architectures and measurments," Ph.D. dissertation, Stanford Univ., Stanford, CA,
have lower data bandwidth and higher instruction band- 1984.
[19] J. R. Larus, "A comparison of microcode, assembly code, and high level
width than the microcode-based architectures. languages on the VAX-11 and RISC-I," Comput. Architecture News,
VLSI appears to be the first choice implementation media vol. 10, no. 5, pp. 10-15, Sept. 1982.
for many processor architectures. Increased densities and [20] L. J. Shustek, "Analysis and performance of computer instruction sets,"
Ph.D. dissertation, Stanford University, Stanford, CA, May 1977; also
decreased switching times make the technology continuously published as SLAC Rep. 205.
more competitive. These advantages have motivated de- [21] D. A. Patterson and C. H. Sequin, "A VLSI RISC," Computer, vol. 15,
signers to use VLSI as the medium to explore new architec- no. 9, pp. 8-22, Sept. 1982.
[22] M. Auslander and M. Hopkins, "An overview of the PL.8 compiler," in
tures. By combining improvements in technology, better Proc. SIGPLAN Symp. Compiler Construction, Ass. Comput. Mach.,
processor organizations, and architectures that are good hosts Boston, MA, June 1982, pp. 22-31.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:45:15 UTC from IEEE Xplore. Restrictions apply.
1246 IEEE TRANSACTIONS ON COMPUTERS, VOL. c-33, NO. 12, DECEMBER 1984
[23] W. Johnson, "A VLSI superminicomputer CPU," in Dig. 1984 Int. Solid- [51] M. Flynn, The Interpretive Interface: Resources and Program Represen-
State Circuits Conf., IEEE, San Francisco, CA, Feb. 1984, tation in Computer Organization. New York: Academic, 1977,
pp. 174-175. ch. 1-3, pp. 41-70; see also Proc. Symp. High Speed Computer and
[24] J. Beck, D. Dobberpuhl, M. Doherty, E. Dornekamp, R. Grondalski, D. Algorithm Organization.
Grondalski, K. Henry, M. Miller, R. Supnik, S. Thierauf, and R. Witek, [52] M. Flynn and L. Hoevel, "Execution architecture: The DELtran experi-
"A 32b microprocessor with on-chip virtual memory management," in ment," IEEE Trans. Comput., vol. C-32, no. 2, pp. 156-174, Feb.
Dig. 1984 Int. Solid-State Circuits Conf., IEEE, San Francisco, CA, 1983.
Feb. 1984, pp. 178-179. [53] G. Meyer, "The case against stack-oriented instruction sets," Comput.
[25] R. Sites, "How to use 1000 registers," in Proc. Ist Caltech Conf. VLSI, Architecture News, vol. 6, no. 3, Aug. 1977.
California Inst. Technol., Pasadena, CA, Jan. 1979. [54] MC68000 Users Manual, 2nd ed., Motorola Inc., Austin, TX, 1980.
[26] F. Baskett, "A VLSI Pascal machine," Univ. California, Berkeley, [55] E. Stritter and T. Gunther, "A microprocessor architecture for a changing
lecture. world: The Motorola 68000," Computer, vol. 12, no. 2, pp. 43-52,
[27] D. Ditzel and R. McLellan, "Register allocation for free: The C machine Feb. 1979.
stack cache," in Proc. Symp. Architectural Support for Programming [56] R. Schumann and W. Parker, "A 32b bus interface chip," in Dig. 1984
Languages and Operating Systems, Ass. Comput. Mach., Palo Alto, Int. Solid-State Circuits Conf., IEEE, San Francisco, CA, Feb. 1984,
CA, Mar. 1982, pp. 48-56. pp. 176-177.
[28] The C170 Macroprogrammer's Handbook, Bolt, Beranek, and Newman, [57] R. Sites, "Instruction ordering for the Cray-I computer," University of
Inc., Cambridge, MA, 1980. Califomia, San Diego, Tech. Rep. 78-CS-023, July 1978.
[29] B. Lampson, "Fast procedure calls," in Proc. SIGARCHISIGPLAN [58] J. Mavor, M. Jack, and P. Denyer, Introduction to MOS LSI De-
Symp. Architectural Supportfor Programming Languages and Operating sign. London, England: Addison-Wesley, 1983.
Systems, Ass. Comput. Mach., Mar. 1982, pp. 66-76. [59] A. Rainal, "Computing inductive noise of chip packages," Bell Lab.
[30] S. Wakefield, "Studies in execution architectures," Ph.D. dissertation, Tech. J., vol. 63, no. 1, pp. 177-195, Jan. 1984.
Stanford Univ., Stanford, CA, Jan. 1983. [60] D. A. Patterson and C. H. Sequin, "RISC-I: A reduced instruction set
[31] R. Ragan-Kelly, "Performance of the pyramid computer," in Proc. VLSI computer," in Proc. 8th Annu. Symp. Computer Architecture,
COMPCON, Feb. 1983. Minneapolis, MN, May 1981, pp. 443-457.
[32] Y. Tamir and C. Sequin, "Strategies for managing the register file in [61] C. Seitz, "Concurrent VLSI architectures," IEEE Trans. Comput., this
RISC," IEEE Trans. Comput., vol. C-32, no. 11, pp. 977-988, Nov. issue, pp. 1247-1265.
1983. [62] J. Batali, E. Goodhue, C. Hanson, H. Shrobe, R. Stallman, and G.
[33] M. Katevenis, "Reduced instruction set computer architectures for Sussman, "The SCHEME-81 architecture-system and chip," in Proc.
VLSI," Ph.D. dissertation, Univ. California, Berkeley, Oct. 1983. Conf. Advanced Research in VLSI, Paul Penfield, Jr., Ed, Cambridge,
[34] G. J. Chaitin, M. A. Auslander, A. K. Chandra, J. Cocke, M. E. Hop- MA: MIT Press, Jan. 1982, pp. 69-77.
kins, and P. W. Markstein, "Register allocation by coloring," IBM [63] D. Ungar, R. Blau, P. Foley, D. Simples, and D. Patterson, "Architecture
Watson Research Center, Res. Rep. 8395, 1981. of SOAR: Smalltalk on a RISC," in Proc. Ilth Symp. Computer Archi-
[35] B. Leverett, "Register allocation in optimizing compilers," Ph.D. dis- tecture, ACM/IEEE, Ann Arbor, MI, June 1984, pp. 188-197.
sertation, Carnegie-Mellon Univ., Pittsburgh, PA, Feb. 1981. [64] D. Ungar, "Generation scavening: A nondisruptive high performance
[36] F. C. Chow and J. L. Hennessy, "Register allocation by priority-based storage reclaimation algorithm," in Proc. Software Eng. Symp. Practical
coloring," in Proc. 1984 Compiler Construction Conf., Ass. Comput. Software Development Environments, ACM, Pittsburgh, PA, Apr. 1984,
Mach., Montreal, P.Q., Canada, June 1984. pp. 157-167.
[37] A. J. Smith, "Cache memories," Ass. Comput. Mach. Comput. Surveys, [65] A. Goldberg and D. Robson, Smalltalk-80: The Language and Its Imple-
vol. 14, no. 3, pp. 473-530, Sept. 1982. mentation. Reading, MA: Addison-Wesley, 1983.
[38] D. Clark, "Cache Performance in the VAX 11/780," ACM Trans. [66] L. Deutsch, "The Dorado Smalltalk-80 implementation: Hardware archi-
Comput. Syst., vol. 1, no. 1, pp. 24-37, Feb. 1983. tecture's impact on software architecture," in Smalltalk-80: Bits of
[39] M. Easton and R. Fagin, "Cold start vs. warm start miss ratios," Commun. History, Words of Advice, Glenn Krasner, Ed. Reading, MA:
Ass. Comput. Mach., vol. 21, no. 10, pp. 866-872, Oct. 1978. Addison-Wesley, 1983, pp. 113-126.
[40] D. Clark and J. Emer, "Performance of the VAX-11/780 translation [67] R. Davies, "The case for CMOS," IEEE Spectrum, vol. 20, no. 10,
buffer," to be published. pp. 26-32, Oct. 1983.
[41] R. Fabry, "Capability based addressing," Commun. Ass. Comput. [68] R. Eden, A. Livingston, and B. Welch, "Integrated circuits: The case for
Mach., vol. 17, no. 7, pp. 403-412, July 1974. gallium arsenide," IEEE Spectrum, vol. 20, no. 12, pp. 30-37, Dec.
[42] W. Wulf, R. Levin, and S. Harbinson, Hydra:C.mmp: An Experimental 1983.
Computer System. New York: McGraw-Hill, 1981.
[43] M. Wilkes and R. Needham, The Cambridge CAP Computer and Its
Operating System. New York: North Holland, 1979.
[44] M. Wilkes, "Hardware support for memory management functions," in
Proc. SIGARCHISIGPLAN Symp. Architectural Support for Pro-
gramming Languages and Operating Systems, Ass. Comput. Mach.,
Mar. 1982, pp. 107-116.
[45] J. Hennessy, N. Jouppi, S. Przybylski, C. Rowen, and T. Gross, "Design
of a high performance VLSI processor," in Proc. 3rd Caltech Conf. VLSI,
California Inst. Technol., Pasadena, CA, Mar. 1983, pp. 33-54.
[46] S. Przybylski, T. Gross, J. Hennessy, N. Jouppi, and C. Rowen, John L. Hennessy received the B.E. degree in elec-
"Organization and VLSI implementation of MIPS," J. VLSI Comput. trical engineering from Villanova University, Vil-
Syst., vol. 1, no. 3, Spring 1984; see also Tech. Rep. 83-259. lanova, PA, in 1973 and is the recipient of the 1983
[47] J. Hennessy, N. Jouppi, F. Baskett, and J. Gill, "MIPS: A VLSI pro- John J. Gallen Memorial Award. He received the
cessor architecture," in Proc. CMU Conf. VLSI Systems and Com- Masters and Ph.D. degrees in computer science from
putations, Rockville, MD: Computer Science Press, Oct. 1981, the State University of New York, Stony Brook, in
pp. 337-346; see also Tech. Rep. 82-223. 1975 and 1977, respectively.
[48] J. L. Hennessy, N. Jouppi, F. Baskett, T. R. Gross, and J. Gill, Since September 1977 he has been with the Com-
"Hardware/software tradeoffs for increased performance," in Proc. puter Systems Laboratory at Stanford University
SIGARCHISIGPLAN Symp. Architectural Support for Programming where he is currently an Associate Professor ofElec-
Languages and Operating Systems, Ass. Comput. Mach., Palo Alto, trical Engineering and Director of the Computer
CA, Mar. 1982, pp. 2-11. Systems Laboratory. He has done research on several issues in compiler design
[49] J. L. Hennessy and T. R. Gross, "Postpass code optimization of pipeline and optimization. Much of his current work is in VLSI. He is the designer of
constraints," ACM Trans. on Programming Lang. Syst., vol. 5, no. 3, the SLIM system, which constructs VLSI control implementations from high
July 1983. level language specificationis. He is also the leader of the MIPS project. MIPS
[50] T. R. Gross, "Code optimization of pipeline constraints," Ph.D. dis- is a high performance VLSI microprocessor designed to execute code for high
sertation, Stanford Univ., Stanford, CA, Aug. 1983. level languages.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on July 29,2024 at 08:45:15 UTC from IEEE Xplore. Restrictions apply.