Computer System Organization: Processors

Computer System Organization
PROCESSORS
 Its function is to execute programs stored in the main memory by fetching their
instructions, examining them, and then executing them one after another.
 The components are connected by a bus, parallel wires for transmitting address, data,
and control signals. Can be external to the CPU and also internal to the CPU
 The CPU is composed of several distinct parts (control unit; fetching instructions, and
determining their type; ALU).
 The CPU also contains a small, high-speed memory, made up of a number of registers,
each having has a certain size and function. Can be read and written at high speed.
 important register; Program Counter (PC), which points to the next instruction to be
fetched for execution. Instruction Register (IR), which holds the instruction currently
being executed.
Instruction Execution
 ALU input registers (A and B in the figure) hold the input while the ALU is performing
operation, yielding a result in the output register.
 Not all designs have the A, B, and output registers.
 Most instructions can be divided into one of two categories: register-memory or
register-register.
 Register-memory instructions allow memory words to be fetched into registers (‘word’ is
the unit of data moved between memory and registers) or allow registers to be stored
back into memory.
 A typical register-register instruction fetches two operands from the registers to the
ALU input registers, and stores the result back in one of the registers.
 The process of running two operands through the ALU and storing the result is called the
data path cycle and is the heart of most CPUs.
 It defines what the machine can do.
 Modern computers have multiple ALUs operating in parallel and specialized for
different functions.
 The faster the data path cycle is, the faster the machine runs
Instruction Execution and the Fetch-Decode-Execute Cycle
The CPU executes each instruction in a series of small steps known as the fetch-decode-execute
cycle, which is fundamental to all computers. These steps are as follows:
1. Fetch: The CPU fetches the next instruction from memory into the instruction register.
2. Change the Program Counter: The program counter is updated to point to the next
instruction.
3. Decode: The CPU determines the type of instruction that was fetched.
4. Address Calculation: If the instruction requires a memory word, the CPU determines the
location of that word.
5. Operand Fetch: The necessary data (operand) is fetched from memory, if required.
6. Execute: The CPU executes the instruction.
7. Repeat: The cycle starts again for the next instruction.
This cycle ensures that every instruction in a program is executed step-by-step, with the CPU
continuously fetching, decoding, and executing instructions.
2. Interpreters and CPU Simulation
The text explains how a program can mimic the function of a CPU through a method called
interpretation. A program (referred to as an interpreter) can fetch, decode, and execute the
instructions of another program. This makes it possible to run a program without needing
dedicated hardware for its machine language. The interpreter breaks down instructions into
smaller steps and executes them on a different machine, typically in software rather than
hardware.
 Interpreter: A program that reads and executes instructions written in one language by
converting them into machine instructions of the hardware (L0). This method replaces
hardware execution with software, making it more flexible and cost-effective.
 Translation: In contrast, translation involves converting an entire program into machine
language (L0) and then executing it, which is a different approach from interpretation.
3. Advantages of Interpretation
The text discusses the advantages of using interpreters to execute programs:
1. Cost-Effective: Interpreting instructions through software is cheaper than building

complex hardware to directly execute those instructions.
2. Fixing Instructions: Interpreters allow for errors in instruction implementation to be
fixed in the field.
3. Adding New Instructions: New instructions can be added later at minimal cost.
4. Structured Design: Interpreters support more organized and efficient development,
testing, and documentation of complex instructions.
4. Historical Evolution: Interpretation vs. Direct Execution
Early computers had relatively simple instruction sets, but the desire for more powerful
computers led to the introduction of more complex instructions. IBM's recognition that a single
architecture for an entire family of machines was beneficial led to the development of
interpreter-based designs.
 Maurice Wilkes (1951) suggested using interpreters to enable lower-cost computers to

execute a large set of complex instructions, a concept that influenced the IBM
System/360 architecture. This approach allowed IBM to develop a family of computers
with consistent instruction sets but varying price points and performance levels.
 Interpretation made it possible to implement complex instructions in simple, low-cost
computers by shifting complexity to software rather than hardware.
5. VAX Architecture and Its Downfall
The Digital Equipment Corporation's VAX computer is presented as an example of the zenith
of complex instruction sets. The VAX had a large number of instructions and ways to specify
operands, but its reliance on interpretation and the lack of high-performance design made it
inefficient. The complexity of the VAX architecture contributed to DEC's eventual downfall.
6. Impact on Microprocessors
By the late 1970s, interpreter-based designs were widespread, even in microprocessors. The
use of interpreters allowed microprocessor designers to manage the growing complexity enabled
by integrated circuits.
 Motorola's 68000 microprocessor, which used a large interpreted instruction set,

succeeded, while Zilog's Z8000, which had a large instruction set without interpretation,
failed, demonstrating the advantages of interpreters in microprocessor design.
 Control Stores (fast read-only memories) were used to hold the interpreters, reducing
the performance gap between interpreted instructions and directly executed instructions.
7. Performance Considerations
The text explains how interpreted instructions take longer to execute than directly executed
instructions but not significantly longer. For example, if an interpreted instruction took 10
microinstructions and 2 memory accesses, it would take around 2000 nanoseconds (ns) to
execute. This is only a factor of two slower than direct execution (1000 ns), making the
performance penalty acceptable, especially considering the cost savings.
Summary:
 Fetch-Decode-Execute Cycle: Fundamental process for executing instructions in the

CPU.
 Interpreters: Software-based systems that can execute instructions written in one
language on a machine designed for another language, making it possible to create
flexible, cost-effective systems.
 IBM System/360: Pioneered the use of interpreter-based designs to build a family of
compatible computers, ranging from low-cost to high-performance models.
 VAX Computer: Demonstrated the challenges of overcomplicating instruction sets,
eventually leading to inefficiencies.
 Microprocessors: By the late 1970s, interpreters were widely used in microprocessors,
allowing designers to manage complexity while still producing cost-effective chips.
Interpreters thus became a central feature in computer architecture, providing a way to manage
increasing complexity while keeping hardware costs manageable.
RISC vs. CISC:
The Development of RISC
 In the 1970s, computer designers were focused on creating more complex instructions to
close the "semantic gap" between what high-level programming languages needed and
what machines could do. The goal was to make computers more powerful, often using
interpreters to handle increasingly complex instructions.
 John Cocke at IBM led a group that took an opposite approach: simplifying the
instruction set. Their experimental minicomputer, the 801, was the first attempt to create
a RISC machine, although it wasn’t commercially released.
 In 1980, at UC Berkeley, David Patterson and Carlo Sequin designed a new type of
CPU chip called RISC I, and later RISC II. Around the same time, John Hennessy at
Stanford developed the MIPS chip. Both designs led to commercially important
products: SPARC (for Sun Microsystems) and MIPS.
2. RISC vs. CISC: Key Differences
 RISC (Reduced Instruction Set Computer):

o RISC architectures have a small number of simple instructions that can be
executed quickly, typically in one cycle of the CPU’s data path.
o The philosophy behind RISC was that even if it took multiple RISC instructions
to perform what a single CISC instruction could do, the simplicity and speed of
RISC would compensate for this.
 CISC (Complex Instruction Set Computer):
o CISC machines, such as the DEC VAX or Intel x86, had a large number of
complex instructions. These instructions were often designed to match high-level
programming languages and could perform multiple operations in one instruction.
o CISC machines required more time to execute individual instructions but could
often accomplish more work per instruction.
3. The RISC vs. CISC Debate
 The debate between RISC and CISC supporters became known as a "religious war" in the
computer architecture world:
o RISC supporters argued that simpler instructions, even if they required more
steps, were faster overall because they could be executed in one cycle of the data
path.
o CISC supporters pointed out that complex instructions were more efficient
because they could do more work in fewer steps.
 RISC Advantages:
o Simpler, faster instructions.
o The ability to issue (start) more instructions per second, which led to better
performance.
o Less reliance on interpreters, leading to reduced interpretation penalties as
memory speeds improved.
4. Why CISC Survived
Despite the theoretical performance advantages of RISC, CISC architectures, like the Intel
Pentium, continued to dominate the marketplace for several reasons:
1. Backward Compatibility:
o Billions of dollars had been invested in software for the Intel architecture, and
customers needed the ability to run older software on new machines without
modification.
2. Hybrid Approach:
o Intel found a way to incorporate RISC principles into their CISC designs.
Starting with the Intel 486, Intel CPUs included a RISC core to handle the
simplest and most common instructions efficiently, while using CISC
interpretation for more complex instructions.
o This hybrid approach allowed Intel to combine the advantages of both
architectures: faster execution of common instructions (like RISC) while
maintaining compatibility with older software (like CISC).
o While this hybrid design wasn’t as fast as a pure RISC design, it provided
competitive overall performance and kept Intel’s dominance in the market.
Summary of RISC vs. CISC:
 RISC (Reduced Instruction Set Computer):

o Emphasizes simple, fast instructions that can be executed in a single cycle.
o Typically has fewer instructions (around 50).
o Optimized for performance by starting many instructions per second.
 CISC (Complex Instruction Set Computer):
o Focuses on complex instructions that do more work per instruction.
o Typically has a large instruction set (200-300 instructions).
o The instructions may take longer to execute but are often more powerful.
Key Takeaway:
While RISC theoretically offers better performance for simple instructions, CISC machines
have survived and thrived by adopting a hybrid approach. By incorporating RISC-like cores
within their CISC architectures, companies like Intel managed to maintain compatibility with
existing software while also improving performance, which helped them dominate the market.
The RISC vs. CISC debate underscores how different architectural philosophies can coexist and
evolve, particularly when driven by market demands such as backward compatibility and
performance.
Design Principles for Modern Computers

This section outlines modern design principles for computers, especially
focusing on RISC (Reduced Instruction Set Computer) design principles.
These principles have become best practices in computer architecture, guiding how
modern general-purpose CPUs are designed. Here’s a breakdown of each principle:
i. All Instructions Are Directly Executed by Hardware

 Principle: All commonly used instructions should be executed directly by
hardware, without relying on interpretation by microinstructions.
 Explanation: In RISC systems, common instructions are executed directly
by the hardware. This avoids the delays caused by interpreting instructions
into microinstructions (which is more common in CISC architectures).
When CISC machines encounter complex instructions, they often break
them into microinstructions for hardware execution, adding an extra layer of
processing that slows performance.
 Impact: This design choice eliminates the interpretation step, allowing
faster execution for most instructions, improving overall system speed.
ii. Maximize the Rate at Which Instructions Are Issued

 Principle: The goal is to issue as many instructions as possible per second,
maximizing system performance.
 Explanation: A high instruction issue rate is a key goal for modern
processors. A processor that can issue 500 million instructions per second is
a 500 MIPS processor (Millions of Instructions Per Second). The actual
time it takes for each instruction to complete is less important than how
many instructions can be issued and started at the same time.
o This is where parallelism plays a major role. Multiple instructions
can be issued and executed in parallel, even if they don’t finish in the
original program order. For example, instruction 2 may need to wait
for instruction 1 to complete before it can execute, but with good
bookkeeping and pipeline management, multiple instructions can be
processed concurrently.
 Impact: By focusing on issuing as many instructions as possible, a modern
processor can achieve higher performance, even if individual instructions
take varying amounts of time to complete.
iii. Instructions Should Be Easy to Decode

 Principle: Instructions should be designed to be simple and easy to decode,
enabling faster processing.
 Explanation: Decoding instructions is a critical step that determines how
quickly instructions can be issued. To make decoding easier, instructions
should be:
o Regular: Follow a predictable pattern.
o Fixed-length: Have consistent sizes to simplify the decoding process.
o Few fields: Minimize the number of fields (like opcode, operand
addresses) in each instruction.
When instructions are simpler and easier to decode, the CPU can issue them more
quickly, avoiding delays caused by complex or irregular instruction formats.
 Impact: Simple, regular, and predictable instructions allow for faster
decoding, enabling higher instruction throughput.
iv. Only Loads and Stores Should Reference Memory

 Principle: Only LOAD and STORE instructions should access memory. All
other instructions should operate solely on CPU registers.
 Explanation: Accessing memory is generally slower and less predictable
than accessing CPU registers. By limiting memory references to LOAD and
STORE instructions, RISC architectures ensure that:
o Most operations are performed within registers.
o Memory accesses can be overlapped with other instructions, hiding
the delay caused by memory access.
Separating memory access from computation helps streamline instruction
execution and reduces the unpredictability introduced by memory access times.
 Impact: Restricting memory access to specific instructions improves overall
system performance by minimizing slow memory operations and allowing
for more efficient parallel execution.
v. Provide Plenty of Registers

 Principle: CPUs should have many registers (at least 32) to store data that
is actively being used, minimizing the need to access slower main memory.
 Explanation: Accessing main memory is much slower than accessing data
stored in CPU registers. By providing plenty of registers, a CPU can keep
frequently used data in fast-access locations. This reduces the need to
constantly write data back to memory (a process known as register spilling)
and then reload it when needed again.
o Having more registers allows the CPU to keep more data close at
hand, minimizing slow memory operations and improving the
efficiency of the instruction pipeline.
 Impact: The availability of many registers helps minimize memory access,
improving the speed and efficiency of the CPU’s operations by keeping
important data readily accessible in registers.
vi. Summary of Design Principles for Modern Computers:

1. Direct Hardware Execution: Instructions are executed directly by the
hardware, without being broken down into microinstructions, leading to
faster execution.
2. Maximizing Instruction Issue Rate: The focus is on issuing as many
instructions as possible per second, with parallelism and pipelining used to
boost performance.
3. Easy Instruction Decoding: Instructions are designed to be simple, regular,
and easy to decode, allowing for faster issuance and execution.
4. Memory Access Limited to LOAD/STORE: Only LOAD and STORE
instructions access memory, while all other operations are performed on
registers, improving efficiency.
5. Plenty of Registers: A large number of registers (at least 32) ensures that
frequently used data can stay in registers, minimizing slow memory
accesses.
vii. Why These Principles Matter:

These RISC design principles have been fundamental in shaping the architecture
of modern high-performance CPUs. They optimize the speed at which instructions
are issued and executed by focusing on simplicity, parallelism, and efficient
memory access. While CISC architectures may also incorporate these ideas (such
as using a RISC core within a CISC processor), these principles remain central to
designing modern CPUs that achieve high performance and efficiency.
If future technological advancements (such as faster memory) change the balance
between CPU and memory speeds, these principles might evolve, but they remain
critical guidelines for today’s hardware designers.
Instruction-Level Parallelism
Instruction-Level Parallelism (ILP) refers to the ability of a computer to execute multiple

instructions simultaneously. The section highlights how computer architects aim to increase the
performance of processors, not just by improving clock speed, but by exploring parallelism.
There are two major forms of parallelism:
1. Instruction-level parallelism (ILP): Focuses on executing multiple instructions within a

single CPU cycle.
2. Processor-level parallelism: Involves using multiple CPUs or cores to work on the same
problem concurrently.
In this section, the emphasis is on instruction-level parallelism (ILP), achieved through

techniques like pipelining and superscalar architectures.
Pipelining
Pipelining is one of the most common techniques used to exploit instruction-level parallelism in
modern processors. It involves breaking the execution of instructions into smaller stages, each
performed by a separate functional unit. The stages work simultaneously, so multiple instructions
are in various stages of execution at any given time.
How Pipelining Works:
In a basic pipeline:
1. Stage 1 (S1): Fetch the instruction from memory and place it in a buffer (instruction
register).
2. Stage 2 (S2): Decode the instruction and determine what operands are needed.
3. Stage 3 (S3): Fetch the operands from either the registers or memory.
4. Stage 4 (S4): Execute the instruction (usually arithmetic or logical operations).
5. Stage 5 (S5): Write the result back to the register.
 This process ensures that while one instruction is being decoded in stage S2, the next
instruction can be fetched in stage S1, and so on. This simultaneous operation is what
makes pipelining effective.
Pipeline Example:
 If each stage of the pipeline takes 2 nanoseconds (nsec), the total time to execute a single
instruction is 10 nsec (5 stages × 2 nsec). However, because instructions are completed
every clock cycle, the processor can issue an instruction every 2 nsec once the pipeline is
full. This gives a rate of 500 MIPS (Million Instructions Per Second), which is much
faster than the 100 MIPS rate it would have achieved without pipelining.
Analogy for Pipelining:
Imagine a cake factory where five workers (one for each stage of the pipeline) perform different
tasks on the cake: placing an empty box, inserting the cake, sealing the box, labeling it, and
shipping it. Every 10 seconds, a box is fully prepared for shipping. However, because each
worker performs their task simultaneously, a finished box is produced every 2 seconds, rather
than every 10.
Superscalar Architectures
Superscalar architectures take the concept of pipelining further by having multiple pipelines
that work in parallel to issue multiple instructions in a single clock cycle. This allows a processor
to execute multiple instructions at the same time, significantly increasing throughput.
Dual Pipelines:
 In a dual-pipeline architecture, two pipelines work simultaneously to fetch and execute

two instructions in parallel. For this to work, the instructions must not conflict with each
other (e.g., both instructions can't use the same register simultaneously).
 If there is a conflict, one instruction is delayed while the other is executed.
Intel Example:
 The Intel Pentium processor introduced dual pipelines, which increased processing speed
compared to its predecessor, the Intel 486, which had only one pipeline.
 The Pentium's first pipeline (the u pipeline) could execute any instruction, while the
second pipeline (the v pipeline) could handle simpler instructions like integer operations.
 By running code optimized for dual pipelines, the Pentium could execute integer
programs twice as fast as the 486 at the same clock speed.
Multiple Functional Units
For even more parallelism, high-end CPUs use multiple functional units within a single
pipeline. These functional units can handle different types of operations (such as arithmetic or
floating-point calculations), allowing for multiple instructions to be executed simultaneously.
Superscalar Evolution:
 The term superscalar architecture was coined in 1987 to describe CPUs that issue
multiple instructions per clock cycle, but its roots go back to the CDC 6600 computer,
which could issue one instruction every 100 nsec and hand it off to one of 10 functional
units.
 Modern superscalar processors issue multiple instructions (often 4 or 6) in a single
cycle, which are then executed by the functional units. This increases the overall
throughput of the processor.
Performance Trade-offs
 Latency vs. Bandwidth: Pipelining allows a trade-off between latency (the time it takes
to complete a single instruction) and processor bandwidth (the number of instructions
the processor can complete per second). The latency for an instruction is proportional to
the number of pipeline stages, but the bandwidth is determined by how many instructions
can be processed in parallel.
 Instruction Issue Rate: The S3 stage (which issues instructions) must issue instructions
faster than the S4 stage (which executes them) can complete them, to keep the functional
units busy. If the execution units take longer to finish an instruction than the pipeline
takes to issue them, some units will be idle, reducing efficiency.
Conclusion
Instruction-level parallelism (ILP) is an essential strategy for improving processor performance

by allowing multiple instructions to be executed simultaneously. Pipelining breaks down
instruction execution into smaller stages that can be processed in parallel, while superscalar
architectures allow multiple instructions to be issued and executed at the same time. With
multiple pipelines and functional units, modern processors can achieve extremely high
instruction throughput, measured in MIPS (Millions of Instructions Per Second), even when
individual instructions may take several cycles to complete.
Processor-Level Parallelism
Processor-Level Parallelism (PLP) refers to the strategy of using multiple CPUs (or cores) to
solve problems simultaneously, offering substantial performance improvements beyond what is
possible with instruction-level parallelism (ILP) alone. This method of parallelism is essential
when dealing with highly complex computational tasks that exceed the performance capabilities
of a single CPU, even with advanced techniques like pipelining and superscalar execution.
Why Processor-Level Parallelism?
As the demand for faster computers grows—for tasks like simulating astronomical events,
economic modeling, or high-performance gaming—CPU clock speeds and ILP techniques like
pipelining can only go so far. Clock speeds are limited by factors such as the speed of light and
heat dissipation, and ILP techniques provide limited gains, typically between 5x to 10x
performance improvements. To achieve much higher performance (e.g., 50x or 100x gains), the
solution is to design systems with multiple CPUs working in parallel.
Data Parallel Computers
A large number of computational problems, particularly in fields like science, engineering, and
graphics, involve repeated calculations on arrays of data. This makes them prime candidates for
data parallelism, where the same operations are performed on different sets of data. Data
parallelism is exploited in two primary architectures:
1. SIMD (Single Instruction-stream, Multiple Data-stream) processors.

2. Vector processors.
Both architectures are well-suited for applications that involve highly regular data patterns with
extensive opportunities for parallel execution.
1. SIMD Processors (Single Instruction-stream, Multiple Data-stream)
SIMD processors consist of many identical processing units that all execute the same
instruction simultaneously on different sets of data. A single control unit broadcasts
instructions to all processors, which execute the instructions in lockstep on their individual data.
 First SIMD Processor: The first SIMD processor, the ILLIAC IV, was developed at the
University of Illinois in the early 1970s. It consisted of a grid of processors where each
unit executed the same instruction but worked on different data. The ILLIAC IV was
intended to be a 1-gigaflop machine, but due to funding constraints, only one quadrant
(50 megaflops) was built.
 Modern GPUs: Today, Graphics Processing Units (GPUs) are a primary example of
SIMD processors. They excel in parallel data processing because graphics algorithms
involve regular operations on large sets of data (e.g., pixels, textures, vertices). An
example is the Nvidia Fermi GPU, which contains SIMD stream multiprocessors
(SMs), each of which controls multiple SIMD processors. In a Fermi GPU, up to 32
stream multiprocessors can execute up to 512 operations per cycle, offering vast
computational power compared to a typical quad-core CPU, which could perform only
about 1/32 of that.
2. Vector Processors
Vector processors resemble SIMD processors in that they are efficient at performing operations
on pairs of data elements. However, there are key differences:
 A vector processor operates with vector registers, which can hold multiple data
elements. The operations are performed on these vector registers using pipelined
functional units.
 Instead of many processing units, vector processors use a single, heavily pipelined unit to
execute operations serially on the elements of the vector. For example, adding two
vectors would involve fetching data from two vector registers, adding the elements
pairwise using a pipelined adder, and then storing the result back into a vector register.
 Cray Supercomputers: Seymour Cray's company, Cray Research, developed many
vector processors, starting with the Cray-1 in 1974. These machines were famous for
their ability to perform high-speed computations on vector data, particularly in scientific
and engineering applications.
 Intel Core Architecture: Modern processors, such as those based on the Intel Core
architecture, include SSE (Streaming SIMD Extensions), which leverage vector
operations to accelerate regular computations. This is conceptually similar to SIMD but
with vector processing capabilities.
Comparison Between SIMD and Vector Processors
SIMD Processors Vector Processors

Multiple identical processors execute the A single, heavily pipelined processor executes
same instruction on different data sets. operations on data stored in vector registers.
Ideal for tasks with highly regular, parallel
Ideal for scientific and multimedia tasks where
data patterns (e.g., graphics). vector operations are common.
Vector data is loaded into vector registers, and
Each processor works with its own memory.
operations are performed on them.
Example: Nvidia Fermi GPU. Example: Cray-1 supercomputer.
Both architectures are used to speed up regular computations by exploiting data parallelism,
but they differ in how they achieve that parallelism—SIMD uses multiple processors, while
vector processors use pipelined functional units and vector registers.
Processor-Level Parallelism in Modern Systems
Processor-level parallelism has become increasingly common in modern computing due to the
growing demand for massive computational power. While SIMD and vector processors have
played a significant role in specific domains, multi-core and multi-processor systems are the
main method for general-purpose parallelism in modern computers. By combining ILP
techniques with processor-level parallelism, these systems can handle complex, data-intensive
applications more efficiently than a single CPU ever could.
In summary:
 SIMD processors use multiple processors to execute the same instruction on different
data streams.
 Vector processors process arrays of data by using vector registers and pipelined
functional units.
 Modern GPUs heavily rely on SIMD architectures for high-performance graphics
processing, while Intel Core processors utilize vector operations through SSE for
scientific and multimedia tasks.
Processor-level parallelism, through data-parallel architectures like SIMD and vector processors,
continues to push the boundaries of computational performance, allowing for faster processing of
tasks in scientific computing, multimedia, and gaming.
Multiprocessors
A multiprocessor is a system where multiple CPUs share a common memory

and can communicate directly. The CPUs are said to be tightly coupled because
they interact closely through shared memory, much like multiple people working
together on a shared task using a common blackboard.
Characteristics of Multiprocessors:
 Shared Memory: Each CPU can read and write to a single shared memory
space, which makes programming easier since all CPUs can access the same
data. For example, if a program is searching for cancer cells in an image,
each CPU can access the image in the common memory and work on
different regions.
 Coordination: Since all CPUs have access to the same memory, they must
coordinate their operations to prevent conflicts, such as two CPUs trying to
write to the same memory location at the same time. This coordination is
usually handled in software.
Multiprocessor Implementation:
 Bus-Based Multiprocessor: The simplest implementation of a
multiprocessor involves connecting all the CPUs and a shared memory
through a single bus (as shown in Fig. 2-8(a)). Each CPU can access any
part of the shared memory via the bus.
o Limitation: As the number of CPUs increases, the bus becomes a
bottleneck because multiple processors are trying to access memory
simultaneously, leading to contention and reduced performance.
 Local Memory for Each CPU: To alleviate the contention on the bus, a
more advanced design gives each CPU its own local memory in addition to
the shared memory (Fig. 2-8(b)). This local memory can be used for
program code and non-shared data, reducing the traffic on the main bus. For
example, each CPU can keep its private data in local memory while using
the shared memory for the data that must be accessed by all CPUs.
 Caching: Another way to reduce memory contention is by using caches.
Each CPU can have a local cache that stores frequently accessed data,
reducing the need to access the shared memory frequently.
Advantages of Multiprocessors:
 Easier Programming Model: A shared memory model is intuitive and
simpler for programmers because all CPUs can access the same data without
needing complex communication protocols.
 Good for Shared Data: Multiprocessors work well when multiple
processors need to access and modify shared data, such as in simulations or
image processing.
2. Multicomputers
Multicomputers are systems that consist of a large number of independent
computers, each with its own private memory. Unlike multiprocessors, these
CPUs do not share a common memory, and instead, they communicate by sending
messages to each other. The CPUs in a multicomputer are said to be loosely
coupled because they operate independently and interact via explicit
communication.
Characteristics of Multicomputers:
 No Shared Memory: Each CPU has its own private memory, and the only
way for CPUs to share information is by sending messages to one another.
This is similar to sending emails, but at much faster speeds.
 Message Passing: CPUs in a multicomputer communicate by sending
messages through a network of connections. For example, one CPU may
perform part of a computation and send the result to another CPU, which
uses it for further processing.
Multicomputer Topologies:
 Direct Connections are impractical for large systems, so various network
topologies are used to connect CPUs, such as:
o 2D and 3D grids: CPUs are connected in grid-like structures where
messages must pass through intermediate CPUs or switches to reach
the destination.
o Trees and Rings: Other topologies such as tree or ring structures can
also be used, where messages travel through a sequence of nodes to
reach their target.
 Message-Passing Delays: Even though CPUs are loosely coupled, the
message-passing times are still relatively fast, often on the order of a few
microseconds.
Advantages of Multicomputers:
 Scalability: Multicomputers are easier to build at large scales compared to
multiprocessors. Systems with hundreds of thousands of CPUs, such as
IBM's Blue Gene/P, have been built, with each CPU operating
independently but connected via a message-passing network.
 Less Contention: Since each CPU has its own private memory, there is no
contention over shared memory resources, making multicomputers ideal for
large-scale distributed computing tasks.
3. Hybrid Systems: Combining Multiprocessors and Multicomputers

While multiprocessors are easier to program because of the shared memory
model, multicomputers are easier to build at scale. To balance these advantages,
modern computer designers are working on hybrid systems that combine the
benefits of both architectures:
 Illusion of Shared Memory: Hybrid systems attempt to present the
programmer with the illusion of shared memory, even though the hardware
is implemented using a message-passing multicomputer model. This gives
programmers the simplicity of a multiprocessor while avoiding the hardware
challenges of building large shared memory systems.
 Efficiency: These systems try to provide the benefits of shared memory for
smaller tasks, while using message passing for larger, more distributed tasks,
effectively combining the strengths of both multiprocessors and
multicomputers.
Summary: Multiprocessors vs. Multicomputers

Feature Multiprocessor Multicomputer
Tightly coupled (shared Loosely coupled (no shared
Coupling
memory). memory, message passing).
Shared memory accessible by Each CPU has its own private
Memory Model
all CPUs. memory.
Communication via message
Communication Direct memory access.
passing.
Difficult to scale beyond a
Easily scalable to thousands of
Scalability modest number of processors
CPUs (e.g., IBM Blue Gene/P).
(<256).
More complex programming
Programming Easier to program due to
due to message-passing
Complexity shared memory.
communication.
Suitable for large-scale
Ideal for shared memory
Use Cases distributed tasks (e.g., scientific
tasks (e.g., image processing).
simulations).
Conclusion: Both multiprocessors and multicomputers represent essential

parallel computing architectures, each with its strengths and weaknesses.
Multiprocessors provide a simpler programming model through shared memory
but face scalability challenges. On the other hand, multicomputers can scale to
hundreds of thousands of CPUs but require more complex communication through
message passing. To address these challenges, hybrid systems aim to combine the
ease of programming from multiprocessors with the scalability of multicomputers.

Computer System Organization: Processors

Uploaded by

Copyright:

Available Formats

Computer System Organization: Processors

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computer System Organization: Processors

Uploaded by

Copyright:

Available Formats

Computer System Organization

2. Interpreters and CPU Simulation

The text discusses the advantages of using interpreters to execute programs:

1. Cost-Effective: Interpreting instructions through software is cheaper than building

4. Historical Evolution: Interpretation vs. Direct Execution

 Maurice Wilkes (1951) suggested using interpreters to enable lower-cost computers to

5. VAX Architecture and Its Downfall

 Motorola's 68000 microprocessor, which used a large interpreted instruction set,

 Fetch-Decode-Execute Cycle: Fundamental process for executing instructions in the

RISC vs. CISC:

The Development of RISC

2. RISC vs. CISC: Key Differences

 RISC (Reduced Instruction Set Computer):

3. The RISC vs. CISC Debate

4. Why CISC Survived

Summary of RISC vs. CISC:

 RISC (Reduced Instruction Set Computer):

Design Principles for Modern Computers

i. All Instructions Are Directly Executed by Hardware

ii. Maximize the Rate at Which Instructions Are Issued

iii. Instructions Should Be Easy to Decode

iv. Only Loads and Stores Should Reference Memory

v. Provide Plenty of Registers

vi. Summary of Design Principles for Modern Computers:

vii. Why These Principles Matter:

Instruction-Level Parallelism (ILP) refers to the ability of a computer to execute multiple

1. Instruction-level parallelism (ILP): Focuses on executing multiple instructions within a

In this section, the emphasis is on instruction-level parallelism (ILP), achieved through

How Pipelining Works:

Analogy for Pipelining:

 In a dual-pipeline architecture, two pipelines work simultaneously to fetch and execute

Multiple Functional Units

Instruction-level parallelism (ILP) is an essential strategy for improving processor performance

Why Processor-Level Parallelism?

Data Parallel Computers

1. SIMD (Single Instruction-stream, Multiple Data-stream) processors.

1. SIMD Processors (Single Instruction-stream, Multiple Data-stream)

Comparison Between SIMD and Vector Processors

SIMD Processors Vector Processors

Processor-Level Parallelism in Modern Systems

A multiprocessor is a system where multiple CPUs share a common memory

3. Hybrid Systems: Combining Multiprocessors and Multicomputers

Summary: Multiprocessors vs. Multicomputers

Conclusion: Both multiprocessors and multicomputers represent essential

You might also like