Computer System Organization: Processors
Computer System Organization: Processors
Computer System Organization: Processors
PROCESSORS
Its function is to execute programs stored in the main memory by fetching their
instructions, examining them, and then executing them one after another.
The components are connected by a bus, parallel wires for transmitting address, data,
and control signals. Can be external to the CPU and also internal to the CPU
The CPU is composed of several distinct parts (control unit; fetching instructions, and
determining their type; ALU).
The CPU also contains a small, high-speed memory, made up of a number of registers,
each having has a certain size and function. Can be read and written at high speed.
important register; Program Counter (PC), which points to the next instruction to be
fetched for execution. Instruction Register (IR), which holds the instruction currently
being executed.
Instruction Execution
ALU input registers (A and B in the figure) hold the input while the ALU is performing
operation, yielding a result in the output register.
Not all designs have the A, B, and output registers.
Most instructions can be divided into one of two categories: register-memory or
register-register.
Register-memory instructions allow memory words to be fetched into registers (‘word’ is
the unit of data moved between memory and registers) or allow registers to be stored
back into memory.
A typical register-register instruction fetches two operands from the registers to the
ALU input registers, and stores the result back in one of the registers.
The process of running two operands through the ALU and storing the result is called the
data path cycle and is the heart of most CPUs.
It defines what the machine can do.
Modern computers have multiple ALUs operating in parallel and specialized for
different functions.
The faster the data path cycle is, the faster the machine runs
Instruction Execution and the Fetch-Decode-Execute Cycle
The CPU executes each instruction in a series of small steps known as the fetch-decode-execute
cycle, which is fundamental to all computers. These steps are as follows:
1. Fetch: The CPU fetches the next instruction from memory into the instruction register.
2. Change the Program Counter: The program counter is updated to point to the next
instruction.
3. Decode: The CPU determines the type of instruction that was fetched.
4. Address Calculation: If the instruction requires a memory word, the CPU determines the
location of that word.
5. Operand Fetch: The necessary data (operand) is fetched from memory, if required.
6. Execute: The CPU executes the instruction.
7. Repeat: The cycle starts again for the next instruction.
This cycle ensures that every instruction in a program is executed step-by-step, with the CPU
continuously fetching, decoding, and executing instructions.
The text explains how a program can mimic the function of a CPU through a method called
interpretation. A program (referred to as an interpreter) can fetch, decode, and execute the
instructions of another program. This makes it possible to run a program without needing
dedicated hardware for its machine language. The interpreter breaks down instructions into
smaller steps and executes them on a different machine, typically in software rather than
hardware.
Interpreter: A program that reads and executes instructions written in one language by
converting them into machine instructions of the hardware (L0). This method replaces
hardware execution with software, making it more flexible and cost-effective.
Translation: In contrast, translation involves converting an entire program into machine
language (L0) and then executing it, which is a different approach from interpretation.
3. Advantages of Interpretation
Early computers had relatively simple instruction sets, but the desire for more powerful
computers led to the introduction of more complex instructions. IBM's recognition that a single
architecture for an entire family of machines was beneficial led to the development of
interpreter-based designs.
The Digital Equipment Corporation's VAX computer is presented as an example of the zenith
of complex instruction sets. The VAX had a large number of instructions and ways to specify
operands, but its reliance on interpretation and the lack of high-performance design made it
inefficient. The complexity of the VAX architecture contributed to DEC's eventual downfall.
6. Impact on Microprocessors
By the late 1970s, interpreter-based designs were widespread, even in microprocessors. The
use of interpreters allowed microprocessor designers to manage the growing complexity enabled
by integrated circuits.
7. Performance Considerations
The text explains how interpreted instructions take longer to execute than directly executed
instructions but not significantly longer. For example, if an interpreted instruction took 10
microinstructions and 2 memory accesses, it would take around 2000 nanoseconds (ns) to
execute. This is only a factor of two slower than direct execution (1000 ns), making the
performance penalty acceptable, especially considering the cost savings.
Summary:
Interpreters thus became a central feature in computer architecture, providing a way to manage
increasing complexity while keeping hardware costs manageable.
In the 1970s, computer designers were focused on creating more complex instructions to
close the "semantic gap" between what high-level programming languages needed and
what machines could do. The goal was to make computers more powerful, often using
interpreters to handle increasingly complex instructions.
John Cocke at IBM led a group that took an opposite approach: simplifying the
instruction set. Their experimental minicomputer, the 801, was the first attempt to create
a RISC machine, although it wasn’t commercially released.
In 1980, at UC Berkeley, David Patterson and Carlo Sequin designed a new type of
CPU chip called RISC I, and later RISC II. Around the same time, John Hennessy at
Stanford developed the MIPS chip. Both designs led to commercially important
products: SPARC (for Sun Microsystems) and MIPS.
The debate between RISC and CISC supporters became known as a "religious war" in the
computer architecture world:
o RISC supporters argued that simpler instructions, even if they required more
steps, were faster overall because they could be executed in one cycle of the data
path.
o CISC supporters pointed out that complex instructions were more efficient
because they could do more work in fewer steps.
RISC Advantages:
o Simpler, faster instructions.
o The ability to issue (start) more instructions per second, which led to better
performance.
o Less reliance on interpreters, leading to reduced interpretation penalties as
memory speeds improved.
Despite the theoretical performance advantages of RISC, CISC architectures, like the Intel
Pentium, continued to dominate the marketplace for several reasons:
1. Backward Compatibility:
o Billions of dollars had been invested in software for the Intel architecture, and
customers needed the ability to run older software on new machines without
modification.
2. Hybrid Approach:
o Intel found a way to incorporate RISC principles into their CISC designs.
Starting with the Intel 486, Intel CPUs included a RISC core to handle the
simplest and most common instructions efficiently, while using CISC
interpretation for more complex instructions.
o This hybrid approach allowed Intel to combine the advantages of both
architectures: faster execution of common instructions (like RISC) while
maintaining compatibility with older software (like CISC).
o While this hybrid design wasn’t as fast as a pure RISC design, it provided
competitive overall performance and kept Intel’s dominance in the market.
Key Takeaway:
While RISC theoretically offers better performance for simple instructions, CISC machines
have survived and thrived by adopting a hybrid approach. By incorporating RISC-like cores
within their CISC architectures, companies like Intel managed to maintain compatibility with
existing software while also improving performance, which helped them dominate the market.
The RISC vs. CISC debate underscores how different architectural philosophies can coexist and
evolve, particularly when driven by market demands such as backward compatibility and
performance.
Pipelining
Pipelining is one of the most common techniques used to exploit instruction-level parallelism in
modern processors. It involves breaking the execution of instructions into smaller stages, each
performed by a separate functional unit. The stages work simultaneously, so multiple instructions
are in various stages of execution at any given time.
In a basic pipeline:
1. Stage 1 (S1): Fetch the instruction from memory and place it in a buffer (instruction
register).
2. Stage 2 (S2): Decode the instruction and determine what operands are needed.
3. Stage 3 (S3): Fetch the operands from either the registers or memory.
4. Stage 4 (S4): Execute the instruction (usually arithmetic or logical operations).
5. Stage 5 (S5): Write the result back to the register.
This process ensures that while one instruction is being decoded in stage S2, the next
instruction can be fetched in stage S1, and so on. This simultaneous operation is what
makes pipelining effective.
Pipeline Example:
If each stage of the pipeline takes 2 nanoseconds (nsec), the total time to execute a single
instruction is 10 nsec (5 stages × 2 nsec). However, because instructions are completed
every clock cycle, the processor can issue an instruction every 2 nsec once the pipeline is
full. This gives a rate of 500 MIPS (Million Instructions Per Second), which is much
faster than the 100 MIPS rate it would have achieved without pipelining.
Imagine a cake factory where five workers (one for each stage of the pipeline) perform different
tasks on the cake: placing an empty box, inserting the cake, sealing the box, labeling it, and
shipping it. Every 10 seconds, a box is fully prepared for shipping. However, because each
worker performs their task simultaneously, a finished box is produced every 2 seconds, rather
than every 10.
Superscalar Architectures
Superscalar architectures take the concept of pipelining further by having multiple pipelines
that work in parallel to issue multiple instructions in a single clock cycle. This allows a processor
to execute multiple instructions at the same time, significantly increasing throughput.
Dual Pipelines:
Intel Example:
The Intel Pentium processor introduced dual pipelines, which increased processing speed
compared to its predecessor, the Intel 486, which had only one pipeline.
The Pentium's first pipeline (the u pipeline) could execute any instruction, while the
second pipeline (the v pipeline) could handle simpler instructions like integer operations.
By running code optimized for dual pipelines, the Pentium could execute integer
programs twice as fast as the 486 at the same clock speed.
For even more parallelism, high-end CPUs use multiple functional units within a single
pipeline. These functional units can handle different types of operations (such as arithmetic or
floating-point calculations), allowing for multiple instructions to be executed simultaneously.
Superscalar Evolution:
The term superscalar architecture was coined in 1987 to describe CPUs that issue
multiple instructions per clock cycle, but its roots go back to the CDC 6600 computer,
which could issue one instruction every 100 nsec and hand it off to one of 10 functional
units.
Modern superscalar processors issue multiple instructions (often 4 or 6) in a single
cycle, which are then executed by the functional units. This increases the overall
throughput of the processor.
Performance Trade-offs
Latency vs. Bandwidth: Pipelining allows a trade-off between latency (the time it takes
to complete a single instruction) and processor bandwidth (the number of instructions
the processor can complete per second). The latency for an instruction is proportional to
the number of pipeline stages, but the bandwidth is determined by how many instructions
can be processed in parallel.
Instruction Issue Rate: The S3 stage (which issues instructions) must issue instructions
faster than the S4 stage (which executes them) can complete them, to keep the functional
units busy. If the execution units take longer to finish an instruction than the pipeline
takes to issue them, some units will be idle, reducing efficiency.
Conclusion
Processor-Level Parallelism
Processor-Level Parallelism (PLP) refers to the strategy of using multiple CPUs (or cores) to
solve problems simultaneously, offering substantial performance improvements beyond what is
possible with instruction-level parallelism (ILP) alone. This method of parallelism is essential
when dealing with highly complex computational tasks that exceed the performance capabilities
of a single CPU, even with advanced techniques like pipelining and superscalar execution.
As the demand for faster computers grows—for tasks like simulating astronomical events,
economic modeling, or high-performance gaming—CPU clock speeds and ILP techniques like
pipelining can only go so far. Clock speeds are limited by factors such as the speed of light and
heat dissipation, and ILP techniques provide limited gains, typically between 5x to 10x
performance improvements. To achieve much higher performance (e.g., 50x or 100x gains), the
solution is to design systems with multiple CPUs working in parallel.
A large number of computational problems, particularly in fields like science, engineering, and
graphics, involve repeated calculations on arrays of data. This makes them prime candidates for
data parallelism, where the same operations are performed on different sets of data. Data
parallelism is exploited in two primary architectures:
Both architectures are well-suited for applications that involve highly regular data patterns with
extensive opportunities for parallel execution.
SIMD processors consist of many identical processing units that all execute the same
instruction simultaneously on different sets of data. A single control unit broadcasts
instructions to all processors, which execute the instructions in lockstep on their individual data.
First SIMD Processor: The first SIMD processor, the ILLIAC IV, was developed at the
University of Illinois in the early 1970s. It consisted of a grid of processors where each
unit executed the same instruction but worked on different data. The ILLIAC IV was
intended to be a 1-gigaflop machine, but due to funding constraints, only one quadrant
(50 megaflops) was built.
Modern GPUs: Today, Graphics Processing Units (GPUs) are a primary example of
SIMD processors. They excel in parallel data processing because graphics algorithms
involve regular operations on large sets of data (e.g., pixels, textures, vertices). An
example is the Nvidia Fermi GPU, which contains SIMD stream multiprocessors
(SMs), each of which controls multiple SIMD processors. In a Fermi GPU, up to 32
stream multiprocessors can execute up to 512 operations per cycle, offering vast
computational power compared to a typical quad-core CPU, which could perform only
about 1/32 of that.
2. Vector Processors
Vector processors resemble SIMD processors in that they are efficient at performing operations
on pairs of data elements. However, there are key differences:
A vector processor operates with vector registers, which can hold multiple data
elements. The operations are performed on these vector registers using pipelined
functional units.
Instead of many processing units, vector processors use a single, heavily pipelined unit to
execute operations serially on the elements of the vector. For example, adding two
vectors would involve fetching data from two vector registers, adding the elements
pairwise using a pipelined adder, and then storing the result back into a vector register.
Cray Supercomputers: Seymour Cray's company, Cray Research, developed many
vector processors, starting with the Cray-1 in 1974. These machines were famous for
their ability to perform high-speed computations on vector data, particularly in scientific
and engineering applications.
Intel Core Architecture: Modern processors, such as those based on the Intel Core
architecture, include SSE (Streaming SIMD Extensions), which leverage vector
operations to accelerate regular computations. This is conceptually similar to SIMD but
with vector processing capabilities.
Both architectures are used to speed up regular computations by exploiting data parallelism,
but they differ in how they achieve that parallelism—SIMD uses multiple processors, while
vector processors use pipelined functional units and vector registers.
Processor-level parallelism has become increasingly common in modern computing due to the
growing demand for massive computational power. While SIMD and vector processors have
played a significant role in specific domains, multi-core and multi-processor systems are the
main method for general-purpose parallelism in modern computers. By combining ILP
techniques with processor-level parallelism, these systems can handle complex, data-intensive
applications more efficiently than a single CPU ever could.
In summary:
SIMD processors use multiple processors to execute the same instruction on different
data streams.
Vector processors process arrays of data by using vector registers and pipelined
functional units.
Modern GPUs heavily rely on SIMD architectures for high-performance graphics
processing, while Intel Core processors utilize vector operations through SSE for
scientific and multimedia tasks.
Processor-level parallelism, through data-parallel architectures like SIMD and vector processors,
continues to push the boundaries of computational performance, allowing for faster processing of
tasks in scientific computing, multimedia, and gaming.
Multiprocessors
2. Multicomputers
Multicomputers are systems that consist of a large number of independent
computers, each with its own private memory. Unlike multiprocessors, these
CPUs do not share a common memory, and instead, they communicate by sending
messages to each other. The CPUs in a multicomputer are said to be loosely
coupled because they operate independently and interact via explicit
communication.
Characteristics of Multicomputers:
No Shared Memory: Each CPU has its own private memory, and the only
way for CPUs to share information is by sending messages to one another.
This is similar to sending emails, but at much faster speeds.
Message Passing: CPUs in a multicomputer communicate by sending
messages through a network of connections. For example, one CPU may
perform part of a computation and send the result to another CPU, which
uses it for further processing.
Multicomputer Topologies:
Direct Connections are impractical for large systems, so various network
topologies are used to connect CPUs, such as:
o 2D and 3D grids: CPUs are connected in grid-like structures where
messages must pass through intermediate CPUs or switches to reach
the destination.
o Trees and Rings: Other topologies such as tree or ring structures can
also be used, where messages travel through a sequence of nodes to
reach their target.
Message-Passing Delays: Even though CPUs are loosely coupled, the
message-passing times are still relatively fast, often on the order of a few
microseconds.
Advantages of Multicomputers:
Scalability: Multicomputers are easier to build at large scales compared to
multiprocessors. Systems with hundreds of thousands of CPUs, such as
IBM's Blue Gene/P, have been built, with each CPU operating
independently but connected via a message-passing network.
Less Contention: Since each CPU has its own private memory, there is no
contention over shared memory resources, making multicomputers ideal for
large-scale distributed computing tasks.