Complete CO Notes 5UNITS
Complete CO Notes 5UNITS
Complete CO Notes 5UNITS
Course Objectives:
• To learn the fundamentals of computer organization and its relevance to classical and
modern problems of computer design
• To make the students understand the structure and behavior of various functional
modules of a computer.
• To understand the techniques that computers use to communicate with I/O devices
• To study the concepts of pipelining and the way it can speed up processing.
• To understand the basic characteristics of multiprocessors
Course Outcomes:
Unit I:
Basic Structure of Computer: Computer Types, Functional Units, Basic operational Concepts,
Bus Structure, Software, Performance, Multiprocessors and Multicomputer.
Machine Instructions and Programs: Numbers, Arithmetic Operations and Programs,
Instructions and Instruction Sequencing, Addressing Modes, Basic Input/output Operations, Stacks
and Queues, Subroutines, Additional Instructions.
Unit II:
Arithmetic: Addition and Subtraction of Signed Numbers, Design and Fast Adders, Multiplication
of Positive Numbers, Signed-operand Multiplication, Fast Multiplication, Integer Division,
Floating-Point Numbers and Operations.
Basic Processing Unit: Fundamental Concepts, Execution of a Complete Instruction, Multiple-
Bus Organization, Hardwired Control, Multiprogrammed Control.
Unit III:
The Memory System: Basic Concepts, Semiconductor RAM Memories, Read-Only Memories,
Speed, Size and Cost, Cache Memories, Performance Considerations, Virtual Memories, Memory
Management Requirements, Secondary Storage.
Unit IV:
Input/output Organization: Accessing I/O Devices, Interrupts, Processor Examples, Direct
Memory Access, Buses, Interface Circuits, Standard I/O Interfaces.
Unit V:
Pipelining: Basic Concepts, Data Hazards, Instruction Hazards, Influence on Instruction Sets.
Large Computer Systems: Forms of Parallel Processing, Array Processors, The Structure of
General-Purpose, Interconnection Networks.
Textbook:
1) “Computer Organization”, Carl Hamacher, Zvonko Vranesic, Safwat Zaky, McGraw Hill
Education, 5th Edition, 2013.
Reference Textbooks:
1. Computer System Architecture, M.Morris Mano, Pearson Education, 3rd Edition.
2. Computer Organization and Architecture, Themes and Variations, Alan Clements, CENGAGE
Learning.
3. Computer Organization and Architecture, Smruti Ranjan Sarangi, McGraw Hill Education.
4. Computer Architecture and Organization, John P.Hayes, McGraw Hill Education.
CONTENTS
1 Unit-I: Page No
1.1 Introduction
1.2 Unit-I notes 1-43
1.3 Part A Questions
1.4 Part B Questions
2 Unit-II:
2.1 Introduction
2.2 Unit-I notes 43- 83
2.3 Part A Questions
2.4 Part B Questions
3 Unit-III:
3.1 Introduction
3.2 Unit-I notes 84-126
3.3 Part A Questions
3.4 Part B Questions
4 Unit-IV:
4.1 Introduction
4.2 Unit-I notes 127-172
4.3 Part A Questions
4.4 Part B Questions
5 Unit-V:
5.1 Introduction
5.2 Unit-I notes 172-205
5.3 Part A Questions
5.4 Part B Questions
UNIT-1
BASIC STRUCTURE OF COMPUTERS
Computer types
A computer can be defined as a fast electronic calculating machine that
accepts the (data) digitized input information process it as per the list of
internally stored instructions and produces the resulting information.
List of instructions are called programs & internal storage is called computer
memory.
1
Fig : Functional units of computer
Input device accepts the coded information as source program i.e. high
level language. This is either stored in the memory or immediately used by the
processor to perform the desired operations. The program stored in the memory
determines the processing steps. Basically the computer converts one source
program to an object program. i.e. into machine language.
Finally the results are sent to the outside world through output device.
All of these actions are coordinated by the control unit.
Input unit: -
The source program/high level language program/coded information/simply
data is fed to a computer through input devices keyboard is a most common type.
Whenever a key is pressed, one corresponding word or number is translated into its
equivalent binary code over a cable & fed either to memory or processor.
Joysticks, trackballs, mouse, scanners etc are other input devices.
Memory unit: -
Its function into store programs and data. It is basically to two types
1. Primary memory
2. Secondary memory
1. Primary memory: - Is the one exclusively associated with the processor and
operates at the electronics speeds programs must be stored in this memory while they
are being executed. The memory contains a large number of semiconductors
storage cells. Each capable of storing one bit of information. These are processed
in a group of fixed site called word.
To provide easy access to a word in memory, a distinct address is associated
with each word location. Addresses are numbers that identify memory location.
Number of bits in each word is called word length of the computer.
Programs must reside in the memory during execution. Instructions and data can be
written into the memory or read out under the control of processor.
Memory in which any location can be reached in a short and fixed
amount of time after specifying its address is called random-access memory (RAM).
The time required to access one word in called memory access time.
Memory which is only readable by the user and contents of which can’t be altered
is called read only memory (ROM) it contains operating system.
Caches are the small fast RAM units, which are coupled with the processor
and are aften contained on the same IC chip to achieve high performance. Although
primary storage is essential it tends to be expensive.
2
2 Secondary memory: - Is used where large amounts of data & programs have to
be stored, particularly information that is accessed infrequently.
Examples: - Magnetic disks & tapes, optical disks (ie CD-ROM’s), floppies etc.,
Arithmetic logic unit (ALU):-
Most of the computer operators are executed in ALU of the processor like
addition, subtraction, division, multiplication, etc. the operands are brought into the
ALU from memory and stored in high speed storage elements called register. Then
according to the instructions the operation is performed in the required sequence.
The control and the ALU are may times faster than other devices connected to
a computer system. This enables a single processor to control a number of external
devices such as key boards, displays, magnetic and optical disks, sensors and other
mechanical controllers.
Output unit:-
These actually are the counterparts of input unit. Its basic function is to send
the processed results to the outside world.
Examples:- Printer, speakers, monitor etc.
Control unit:-
It effectively is the nerve center that sends signals to other units and senses
their states. The actual timing signals that govern the transfer of data between
input unit, processor, memory and output unit are generated by the control unit.
Basic operational concepts
To perform a given task an appropriate program consisting of a list of
instructions is stored in the memory. Individual instructions are brought from the
memory into the processor, which executes the specified operations. Data to be
stored are also stored in the memory.
Examples: - Add LOCA, R0
This instruction adds the operand at memory location LOCA, to operand in
register R0 & places the sum into register. This instruction requires the
performance of several steps,
1. First the instruction is fetched from the memory into the processor.
2. The operand at LOCA is fetched and added to the contents of R0
3. Finally the resulting sum is stored in the register R0
The preceding add instruction combines a memory access operation with an ALU
Operations. In some other type of computers, these two types of operations are
performed by separate instructions for performance reasons.
Load LOCA, R1
Add R 1 , R0
3
Transfers between the memory and the processor are started by sending
the address of the memory location to be accessed to the memory unit and
issuing the appropriate control signals. The data are then transferred to or from the
memory.
The fig shows how memory & the processor can be connected. In addition to
the ALU & the control circuitry, the processor contains a number of registers used for
several different purposes.
The instruction register (IR):- Holds the instructions that is currently being executed.
Its output is available for the control circuits which generates the timing signals
that control the various processing elements in one execution of instruction.
The program counterPC:-
This is another specialized register that keeps track of execution of a program.
It contains the memory address of the next instruction to be fetched and executed.
Besides IR and PC, there are n-general purpose registers R0 through Rn-1.
Memory
MAR MDR
Control
PC R0
R1
Processo
IR
ALU
Rn - 1
n general purpose
registers
4
The other two registers which facilitate communication with memory are: -
1. MAR – (Memory Address Register):- It holds the address of the location to
be accessed.
2. MDR – (Memory Data Register):- It contains the data to be written into or
read out of the address location.
Operating steps are
1. Programs reside in the memory & usually get these through the I/P unit.
2. Execution of the program starts when the PC is set to point at the first
instruction of the program.
3. Contents of PC are transferred to MAR and a Read Control Signal is sent to
the memory.
4. After the time required to access the memory elapses, the address word is read
out of the mmory and loaded into the MDR.
5. Now contents of MDR are transferred to the IR & now the instruction is
ready to be decoded and executed.
6. If the instruction involves an operation by the ALU, it is necessary to obtain
the required operands.
7. An operand in the memory is fetched by sending its address to MAR &
Initiating a read cycle.
8. When the operand has been read from the memory to the MDR, it is
transferred from MDR to the ALU.
9. After one or two such repeated cycles, the ALU can perform the
desired operation.
10. If the result of this operation is to be stored in the memory, the result is
sent to MDR.
11. Address of location where the result is stored is sent to MAR & a write cycle
is initiated.
12. The contents of PC are incremented so that PC points to the next instruction
that is to be executed.
Normal execution of a program may be preempted (temporarily
interrupted) if some devices require urgent servicing, to do this one device raises an
Interrupt signal.
An interrupt is a request signal from an I/O device for service by the
processor. The processor provides the requested service by executing an
appropriate interrupt service routine.
The Diversion may change the internal stage of the processor its state must be
saved in the memory location before interruption. When the interrupt-routine
service is completed the state of the processor is restored so that the
interrupted program may continue.
5
Bus structure
The simplest and most common way of interconnecting various parts of
the computer. To achieve a reasonable speed of operation, a computer must be
organized so that all its units can handle one full word of data at a given time.A
group of lines that serve as a connecting port for several devices is called a bus.
In addition to the lines that carry the data, the bus must have lines for address
and control purpose. Simplest way to interconnect is to use the single bus as shown
Performance
The most important measure of the performance of a computer is how
quickly it can execute programs. The speed with which a computer executes program
is affected by the design of its hardware. For best performance, it is necessary to
design the compiles, the machine instruction set, and the hardware in a coordinated
way.
6
The total time required to execute the program is elapsed time is a measure of
the performance of the entire computer system. It is affected by the speed of the
processor, the disk and the printer. The time needed to execute a instruction is
called the processor time.
Just as the elapsed time for the execution of a program depends on all units in a
computer system, the processor time depends on the hardware involved in the
execution of individual machine instructions. This h a r d w a r e comprises the
processor and the memory which are usually connected by the bus as shown in the
fig c.
The pertinent parts of the fig. c are repeated in fig. d which includes the
cache memory as part of the processor unit.
Let us examine the flow of program instructions and data between the
memory and the processor. At the start of execution, all program instructions and the
required data are stored in the main memory. As the execution proceeds,
instructions are fetched one by one over the bus into the processor, and a copy is
placed in the cache later if the same instruction or data item is needed a second time,
it is read directly from the cache.
The processor and relatively small cache memory can be fabricated on a
single IC chip. The internal speed of performing the basic steps of instruction
processing on chip is very high and is considerably faster than the speed at which
the instruction and data can be fetched from the main memory. A program will
be executed faster if the movement of instructions and data between the main
memory and the processor is minimized, which is achieved by using the cache.
For example:- Suppose a number of instructions are executed repeatedly over a
short period of time as happens in a program loop. If these instructions are
available in the cache, they can be fetched quickly during the period of repeated use.
The same applies to the data that are used repeatedly.
7
Processor clock: -
Processor circui ts are controlle d by a timing signal called clock. The
clock designer the regular time intervals called clock cycles. To execute a machine
instruction the processor divides the action to be performed into a sequence of
basic steps that each step can be completed in one clock cycle. The length P of one
clock cycle is an important parameter that affects the processor performance.
Processor used in today’s personal computer and work station have a clock rates
that range from a few hundred million to over a billion cycles per second.
8
The contents of R1 & R2 are first transferred to the inputs of ALU. After
the addition operation is performed, the sum is transferred to R3. The processor can
read the next instruction from the memory, while the addition operation is being
performed. Then of that instruction also uses, the ALU, its operand can be transferred
to the ALU inputs at the same time that the add instructions is being transferred to
R3.
In the ideal case if all instructions are overlapped to the maximum degree
possible the execution proceeds at the rate of one instruction completed in each
clock cycle. Individual instructions still require several clock cycles to complete. But
for the purpose of computing T, effective value of S is 1.
9
value of ‘S’ on the other hand if individual instructions perform more complex
operations, a fewer instructions will be needed, leading to a lower value of N and a
larger value of S. It is not obvious if one choice is better than the other.
But complex instructions combined with pipelining (effective value of S= 1)
would achieve one best performance. However, it is much easier to implement
efficient pipelining in processors with simple instruction sets.
Performance measurements
It is very important to be able to access the performance of a computer, comp
designers use performance estimates to evaluate the effectiveness of new features.
The previous argument suggests that the performance of a computer is
given by the execution time T, for the program of interest.
The program selected range from game playing, compiler, and data
base applications to numerically intensive programs in astrophysics and quantum
chemistry. In each case, the program is compiled under test, and the running time on a
real computer is measured. The same program is also compiled and run on one
computer selected as reference.
The ‘SPEC’ rating is computed as follows.
10
If the SPEC rating = 50
Means that the computer under test is 50 times as fast as the ultra sparc
10. This is repeated for all the programs in the SPEC suit, and the geometric mean of
the result is computed.
Let SPECi be the rating for program ‘i’ in the suite. The overall SPEC rating
for the computer is given by
Number Representation
We obviously need to represent both positive and negative numbers. Three
systems are used for representing such numbers :
• Sign-and-magnitude
• 1’s-complement
• 2’s-complement
In all three systems, the leftmost bit is 0 for positive numbers and 1 for negative
numbers. Fig 2.1 illustrates all three representations using 4-bit numbers. Positive
11
values have identical representations in al systems, but negative values have
different representations. In the sign-and-magnitude systems, negative values are
represented by changing the most significant bit (b3 in figure 2.1) from 0 to 1 in the
B vector of the corresponding positive value. For example, +5 is represented by
0101, and -5 is represented by 1101. In 1’s-complement representation, negative
values are obtained by complementing each bit of the corresponding positive
number. Thus, the representation for -3 is obtained by complementing each bit in the
vector 0011 to yield 1100. clearly, the same operation, bit complementing, is done in
converting a negative number to the corresponding positive value. Converting
either way is referred to as forming the 1’s-complement of a given number.
Finally, in the 2’s-complement system, forming the 2’s-complement of a number is
done by subtracting that number from 2n.
Hence, the 2’s complement of a number is obtained by adding 1 to the 1’s complement
ofthat number.
0 1 0 1
+0 +0 +1 +1
0 1 1 Carry-out 10
12
single, basic operation. Each group of n bits is referred to as a word of
information, and n is called the word length. The memory of a computer can be
schematically represented as a collection of words as shown in figure (a).
BYTE ADDRESSABILITY:-
We now have three basic information quantities to deal with: the bit, byte
and word. A byte is always 8 bits, but the word length typically ranges from 16 to
64 bits. The most practical assignment is to have successive addresses refer to
successive byte
Fig a Memory words
13
(b) Four characters
Locations in the memory. This is the assignment used in most modern computers,
and is the one we will normally use in this book. The term byte-addressable memory
is use for this assignment. Byte locations have addresses 0,1,2, …. Thus, if the word
length of the machine is 32 bits, successive words are located at addresses 0,4,8,….,
with each word consisting of four bytes.
14
WORD ALIGNMENT:-
In the case of a 32-bit word length, natural word boundaries occur at
addresses 0,4, 8, …, as shown in above fig. We say that the word locations have
aligned addresses . in general, words are said to be aligned in memory if they begin at a
byte address that is a multiple of the number of bytes in a word. The memory of bytes
in a word is a power of 2. Hence, if the word length is 16 (2 bytes), aligned words
begin at byte addresses 0,2,4,…, and for a word length of 64 (23 bytes), aligned
words begin at bytes addresses 0,8,16 ….
There is no fundamental reason why words cannot begin at an arbitrary
byte address. In that case, words are said to have unaligned addresses. While
the most common case is to use aligned addresses, some computers allow the
use of unaligned word addresses.
15
characters of the string. There are two ways to indicate the length of the string.
A special control character with the meaning “end of string” can be used as the last
character in the string, or a separate memory word location or processor register can
contain a number indicating the length of the string in bytes.
16
The contents of LOC are unchanged by the execution of this instruction, but
the old contents of register R1 are overwritten.
The second example of adding two numbers contained in processor
registers R1 and R2 and placing their sum in R3 can be specified by the assembly
language statement
Add R1, R2, R3
BASIC INSTRUCTIONS:-
The operation of adding two numbers is a fundamental capability in
any computer. The statement
C=A+B
In a high-level language program is a command to the computer to add
the current values of the two variables called A and B, and to assign the sum to
a third variable, C. When the program containing this statement is compiled, the
three variables, A, B, and C, are assigned to distinct locations in the memory. We
will use the variable names to refer to the corresponding memory location
addresses. The contents of these locations represent the values of the three
variables. Hence, the above high-level language statement requires the action.
C <= [A] + [B]
To carry out this action, the contents of memory locations A and B are
fetched from the memory and transferred into the processor where their sum is
computed. This result is then sent back to the memory and stored in location C.
17
An alternative approach is to use a sequence of simpler instructions to
perform the same task, with each instruction having only one or two operands.
Suppose that two- address instructions of the form
Operation Source, Destination
Are available. An Add instruction of this typeis
Add A, B
Which performs operation B <= [A] + [B].
A single two-address instruction cannot be used to solve our original problem,
which is to add the the contents of locations A and B, without destroying either of
them, and to place the sum in location C. The problem can be solved by using another
two-address instruction that copies the contents of one memory location into
another. Such an instruction is
Move B, C
Which performs the operations C< = [B], leaving the contents of location B unchanged.
Using only one-address instructions, the operation C< = [A] + [B] can
be performed by executing the sequence of instructions
Load A
Add B
Store C
Some early computers were designed around a single accumulator
structure. Most modern computers have a number of general-purpose processor
registers – typically 8 to 32, and even considerably more in some cases. Access to
data in these registers is much faster than to data stored in memory locations because
the registers are inside the processor.
Let Ri represent a general-purpose register. The
instructions
Load A, Ri
Store Ri, A and
Add A, Ri
Are generalizations of the Load, Store, and Add instructions for the single-
accumulator case, in which register Ri performs the function of the accumulator.
When a processor has several general-purpose registers, many
instructions involve only operands that are in the register. In fact, in many modern
processors, computations can be performed directly only on data held in
processor registers.Instructions such as
18
Add Ri, Rj
Or
Add Ri, Rj, Rk
In both of these instructions, the source operands are the contents of
registers Ri and Rj. In the first instruction, Rj also serves as the destination register,
whereas in the second instruction, a third register, Rk, is used as the destination.
It is often necessary to transfer data between different locations. This is achieved with
the instruction
Move Source,
Destination
When data are moved to or from a processor register, the Move instruction
can be used rather than the Load or Store instructions because the order of the
source and destination operands determines which operation is intended.
In processors where arithmetic operations are allowed only on operands that are
processor registers, the C = A + B task can be performed by the instruction sequence
Move A, Ri
Move B, R
Add Ri, Rj
Move Rj, C
In processors where one operand may be in the memory but the other must
be in register, an instruction sequence for the required task would be
Move A, Ri
Add B, Ri
Move Ri, C
The speed with which a given task is carried out depends on the time it takes to
transfer instructions from memory into the processor and to access the
operands referenced by these instructions. Transfers that involve the memory are
much slower than transfers within the processor.
We have discussed three-, two-, and one-address instructions. It is also
possible
to use instructions in which the locations of all operands are defined implicitly.
Such instructions are found in machines that store operands in a structure called a
pushdown stack. In this case, the instructions are called zero-address instructions.
19
fig 2.8 shows a possible program segment for this task as it appears in the memory of
a computer. We have assumed that the computer allows one memory operand
20
This instruction is placed in the instruction register (IR) in the processor.
The instruction in IR is examined to determine which operation is to be
performed. The specified operation is then performed by the processor. This
often involves fetching operands from the memory or from processor registers,
performing an arithmetic or logic operation, and storing the result in the destination
location.
BRANCHING:-
Consider the task of adding a list of n numbers. Instead of using a long list of
add instructions, it is possible to place a single add instruction in a program loop, as
shown in fig b. The loop is a straight-line sequence of instructions executed as
many times as needed. It starts at location LOOP and ends at the instruction
Branch > 0. During each pass through this loop, the address of the next list entry is
determined, and that entry is fetched and added to
21
Fig b Using a loop to add n numbers
22
Decrement R1
Reduces the contents of R1 by 1 each time through the loop.
This type of instruction loads a new value into the program counter. As a
result, the processor fetches and executes the instruction at this new address,
called the branch target, instead of the instruction at the location that follows the
branch instruction in sequential address order. A conditional branch instruction
causes a branch only if a specified condition is satisfied. If the condition is not
satisfied, the PC is incremented in the normal way, and the next instruction in
sequential address order is fetched and
executed.
Branch > 0 LOOP
CONDITION CODES:-
The processor keeps track of information about the results of various
operations for use by subsequent conditional branch instructions. This is
accomplished by recording the required information in individual bits, often called
condition code flags. These flags are usually grouped together in a special
processor register called the condition code register or status register. Individual
condition code flags are set to 1 or cleared to 0, depending on the outcome of the
operation performed.
Four commonly used flags are
N(negative) Set to 1 if the result is negative; otherwise, cleared to 0
Z(zero) Set to 1 if the result is 0; otherwise, cleared to 0
V(overflow) Set ot1 if arithmetic overflow occurs; otherwise, cleared to 0
C(carry) Set to 1 if a carry-out results from the operation; otherwise, cleared to 0
The instruction Branch > 0, discussed in the previous section, is an example of
a branch instruction that tests one or more of the condition flags. It causes a
branch if the value tested is neither negative nor equal to zero. That is, the branch is
taken if neither N nor Z is 1. The conditions are given as logic expressions
involving the condition code flags.
In some computers, the condition code flags are affected automatically by
instructions that perform arithmetic or logic operations. However, this is not
23
always the case. A number of computers have two versions of an Add instruction.
GENERATING MEMORY ADDRESSES:-
Let us return to fig b. The purpose of the instruction block at LOOP is to add a
different number from the list during each pass through the loop. Hence, the
Add instruction in the block must refer to a different address during each pass.
How are the addresses to be specified ? The memory operand address cannot be
given directly in a single Add instruction in the loop. Otherwise, it would need to be
modified on each pass through the loop.
The instruction set of a computer typically provides a number of such
methods, called addressing modes. While the details differ from one computer to
another, the underlying concepts are the same.
24
Table 2.1 Generic addressing modes
25
EA = effective address
Value = a signed number
Register mode - The operand is the contents of a processor register; the name
(address)of the register is given in the instruction.
Absolute mode – The operand is in a memory location; the address of this location
is given explicitly in the instruction. (In some assembly languages, this mode is called
Direct)
The instruction
Move LOC, R2
Processor registers are used as temporary storage locations where the data is
a register are accessed using the Register mode. The Absolute mode can represent
global variables in a program. A declaration such as
Integer A, B;
Places the value 200 in register R0. Clearly, the Immediate mode is only used to
specify the value of a source operand. Using a subscript to denote the Immediate mode
is not appropriate in assembly languages. A common convention is to use the sharp sign
(#) in front of the value to indicate that this value is to be used as an immediate
operand. Hence, we write the instruction above in the form
Move #200, R0
Indirect mode – The effective address of the operand is the contents of a register
or memory location whose address appears in the instruction.
26
To execute the Add instruction in fig (a), the processor uses the value which is in
register R1, as the effective address of the operand. It requests a read operation from the
memory to read the contents of location B. the value read is the desired operand,
which the processor adds to the contents of register R0. Indirect addressing through a
memory location is also possible as shown in fig (b). In this case, the processor first
reads the contents of memory location A, then requests a second read operation using
the value B as an address to obtain the operand
27
Move N,R1
Move #NUM, R2
Clear R0
LOOP ADD (R2), R0
ADD #4, R2
DECREMENT R1
Branch > 0 LOOP
Move R0, SUM
28
Index mode – the effective address of the operand is generated by adding a
constant value to the contents of a register.
The register use may be either a special register provided for this
purpose, or, more commonly, it may be any one of a set of general-purpose registers
in the processor. In either case, it is referred to as index register. We indicate the
Index mode symbolically
as
X (Ri)
Where X denotes the constant value contained in the instruction and Ri is
the name of the register involved. The effective address of the operand is given by
EA = X + [Rj]
The contents of the index register are not changed in the process of
generating the effective address. In an assembly language program, the constant X
may be given either as an explicit number or as a symbolic name representing a
numerical value.
Fig a illustrates two ways of using the Index mode. In fig a, the index
register, R1, contains the address of a memory location, and the value X defines
an offset (also called a displacement) from this address to the location where the
operand is found. An alternative use is illustrated in fig b. Here, the constant X
corresponds to a memory address, and the contents of the index register define
the offset to the operand. In either case, the effective address is the sum of two
values; one is given explicitly in the instruction, and the other is stored in a register.
29
Fig (b) Offset is in the index register
In the most basic form of indexed addressing several variations of this basic form
provide a very efficient access to memory operands in practical programming
situations. For example, a second register may be used to contain the offset X, in
which case we can write the Index mode as
(Ri, Rj)
In this case, the effective address is the sum of the constant X and the contents
of registers Ri and Rj. This added flexibility is useful in accessing multiple
components inside each item in a record, where the beginning of an item is
specified by the (Ri, Rj) part of the addressing mode. In other words, this mode
implements a three-dimensional array.
30
RELATIVE ADDRESSING:-
We have defined the Index mode using general-purpose processor
registers. A useful version of this mode is obtained if the program counter, PC, is
used instead of a general purpose register. Then, X(PC) can be used to address a
memory location that is X bytes away from the location presently pointed to by the
program counter.
Relative mode – The effective address is determined by the Index mode using
the program counter in place of the general-purpose register Ri.
This mode can be used to access data operands. But, its most common use is
to specify the target address in branch instructions. An instruction such as
31
Basic input/output operations
We now examine the means by which data are transferred between the
memory of a computer and the outside world. Input/Output (I/O) operations are
essential, and the way they are performed can have a significant effect on the
performance of the computer.
Consider a task that reads in character input from a keyboard and
produces character output on a display screen. A simple way of performing such I/O
tasks is to use a method known as program-controlled I/O. The rate of data transfer
from the keyboard to a computer is limited by the typing speed of the user, which is
unlikely to exceed a few characters per second. The rate of output transfers from
the computer to the display is much higher. It is determined by the rate at which
characters can be transmitted over the link between the computer and the display
device, typically several thousand characters per second. However, this is still
much slower than the speed of a processor that can execute many millions of
instructions per second. The difference in speed between the processor and I/O
devices creates the need for mechanisms to synchronize the transfer of data
between them.
32
The keyboard and the display are separate device as shown in fig a. the action
of striking a key on the keyboard does not automatically cause the corresponding
character to be displayed on the screen. One block of instructions in the I/O program
transfers the character into the processor, and another associated block of
instructions causes the character to be displayed.
Striking a key stores the corresponding character code in an 8-bit buffer register
associated with the keyboard. Let us call this register DATAIN, as shown in fig a. To
inform the processor that a valid character is in DATAIN, a status control flag, SIN, is
set to 1. A program monitors SIN, and when SIN is set to 1, the processor reads the
contents of DATAIN. When the character is transferred to the processor, SIN is
automatically cleared to 0. If a second character is entered at the keyboard, SIN is again
set to 1, and the processor repeats.
33
This end is called the top of the stack, and the other end is called the bottom.
Another descriptive phrase, last-in-first-out (LIFO) stack, is also used to describe this
type of storage mechanism; the last data item placed on the stack is the first one
removed when retrieval begins. The terms push and pop are used to describe placing
a new item on the stack and removing the top item from the stack, respectively
Fig b shows a stack of word data items in the memory ocomputer. It contains
numerical values, with 43 at the bottom and -28 at the top
A processor register is used to keep track of the address of the element of the stack
that is at the top at any given time. This register is called the stack pointer (SP). It
could be one of the general-purpose registers or a register dedicated to this function.
Fig b A stack of words in the memory
Another useful data structure that is similar to the stack is called a queue.
Data are stored in and retrieved from a queue on a first-in-first-out (FIFO) basis. Thus,
if we assume that the queue grows in the direction of increasing addresses in the
memory, which is a common practice, new data are added at the back (high-
address end) and retrieved from the front (low-address end) of the queue.
There are two important differences between how a stack and a queue are
implemented. One end of the stack is fixed (the bottom), while the other end
rises and falls as data are pushed and popped. A single pointer is needed to point to
the top of the gher addresses as data are added at the back and removed from the
front. So two pointers are needed to keep track of the two ends of the queue.
34
Another difference between a stack and a queue is that, without further control, a
queue would continuously move through the memory of a computer in the
direction of higher addresses. One way to limit the queue to a fixed region in
memory is to use a circular buffer. Let us assume that memory addresses from
BEGINNING to END are assigned to the queue.
The first entry in the queue is entered into location BEGINNING, and successive
entries are appended to the queue by entering them at successively higher addresses.
By the time the back of the queue reaches END, space will have been created at the
beginning if some items have been removed from the queue. Hence, the back
pointer is reset to the value BEGINNING and the process continues. As in the case of
a stack, care must be taken to detect when the region assigned to the data structure is
either completely full or completely empty.
Subroutines
In a given program, it is often necessary to perform a particular subtask
many times on different data-values. Such a subtask is usually called a subroutine. For
example, a subroutine may evaluate the sine function or sort a list of values into
increasing or decreasing order.
It is possible to include the block of instructions that constitute a
subroutine at every place where it is needed in the program. However, to save
space, only one copy of the instructions that constitute the subroutine is placed in
the memory, and any program that requires the use of the subroutine simply
branches to its starting location. When a program branches to a subroutine we say
that it is calling the subroutine.
The instruction that performs this branch operation is named a Call instruction.
After a subroutine has been executed, the calling program must
resume execution, continuing immediately after the instruction that called the
subroutine. The subroutine is said to return to the program that called it by executing
a Return instruction.
The way in which a computer makes it possible to call and return from
subroutines is referred to as its subroutine linkage method. The simplest
subroutine linkage method is to save the return address in a specific location,
which may be a register dedicated to this function. Such a register is called the
link register. When the subroutine completes its task, the Return instruction returns to
the calling program by branching indirectly through the link register.
35
The Call instruction is just a special branch instruction that performs the
following operations
PARAMETER PASSING:-
When calling a subroutine, a program must provide to the subroutine the
parameters, that is, the operands or their addresses, to be used in the computation.
Later, the subroutine returns other parameters, in this case, the results of the
computation. This exchange of information between a calling program and a subroutine
is referred to as parameter passing. Parameter passing may be accomplished in
several ways. The parameters may be placed in registers or in memory locations,
where they can be accessed by the subroutine. Alternatively, the parameters may be
placed on the processor stack used for saving the return address.
37
the subroutine, created at the time the subroutine is entered and freed up when the
subroutine returns control to the calling program. Such space is called a stack frame.
The pointers SP and FP are manipulated as the stack frame is built, used, and
dismantled for a particular of the subroutine. We begin by assuming that SP point to
the old top-of-stack (TOS) element in fig b. Before the subroutine is called, the
calling program pushes the four parameters onto the stack. The call instruction is then
executed, resulting in the return address being pushed onto the stack. Now, SP points to
this return address, and the first instruction of the subroutine is about to be
executed. This is the point at which the frame pointer FP is set to contain the proper
memory address. Since FP is usually a general-purpose register, it may contain
information of use to the Calling program. Therefore, its contents are saved by pushing
them onto the stack. Since the SP now points to this position, its contents are copied
into FP.
After these instructions are executed, both SP and FP point to the saved FP
contents.
Subtract #12, SP
Finally, the contents of processor registers R0 and R1 are saved by pushing them
onto the stack. At this point, the stack frame has been set up as shown in the fig.
The subroutine now executes its task. When the task is completed, the subroutine
pops the saved values of R1 and R0 back into those registers, removes the local variables
from the stack frame by executing the instruction.
Add #12, SP
38
And pops the saved old value of FP back into FP. At this point, SP points to the
return address, so the Return instruction can be executed, transferring control back to the
calling program.
1.19 Logic instructions
Logic operations such as AND, OR, and NOT, applied to individual bits, are the basic
building blocks of digital circuits, as described. It is also useful to be able to
perform logic operations is software, which is done using instructions that apply
these operations to all bits of a word or byte independently and in parallel. For
example, the instruction
Not dst
Logical shifts:-
Two logical shift instructions are needed, one for shifting left (LShiftL) and
another for shifting right (LShiftR). These instructions shift an operand over a number
of bit positions specified in a count operand contained in the instruction. The
general form of a logical left shift instruction is
Rotate Operations:-
In the shift operations, the bits shifted out of the operand are lost, except
for the last bit shifted out which is retained in the Carry flag C. To preserve all
bits, a set of rotate instructions can be used. They move the bits that are shifted out
of one end of the operand back into the other end. Two versions of both the left and
right rotate instructions.
40
Part-A
2 MARKS QUESTIONS WITH ANSWERS
1. Defina a BUS and its types?
Ans: These are the wires or electronics path ways that joins various components together to
communicate with each other. This network of wires or electronics pathways is known as 'BUS'.
Thus BUS is simply a set of wires of lines that connects various components inside a computer.
Computer's BUS can be divided into two types :
• Internal Bus
• External Bus
RAM requires a flow of electricity to retain data (e.g. the computer powered on).
ROM will retain data without the flow of electricity (e.g. when computer is powered
off).
RAM is a type of volatile memory. Data in RAM is not permanently written. When you
power off your computer the data stored in RAM is deleted.
ROM is a type of non- volatile memory. Data in ROM is permanently written and is not
erased when you power off your computer.
6. What are the types of RAM?
Ans: There are different types of RAM, including DRAM (Dynamic Random Access Memory)
and SRAM (Static Random Access Memory).
Dynamic RAM : loses its stored information in a very short time (for milli sec.) even when power
supply is on. D-RAM’s are cheaper & lower.
Static RAM uses a completely different technology. S-RAM retains stored information only as
long as the power supply is on. Static RAM’s are costlier and consume more power. 41
7. What is Flash EEPROM?
Ans: EEPROM (Electrically Erasable PROM) : The chip can be erased & reprogrammed on the
board easily byte by byte. It can be erased with in a few milliseconds. There is a limit on the
number of times the EEPROM’s can be reprogrammed, i.e.; usually around 10,000 times.
8. Define addressing Mode and its types?
Ans: The various types of addressing modes for microprocessors are as follows:
- Direct Mode: In this mode of addressing the instruction in itself includes memory access. This
can be accessed by the CPU directly.
- Indirect Mode: In this type of addressing the address which has been specified within the
instruction contains the address of the operands.
- Register direct and indirect modes: In this mode as the name suggests no addresses are
specified in this mode instead registers are specified.
- Immediate Mode: In this type of addressing mode the operand itself is the data.
- Relative Mode: In this mode the operand specified in the instruction is an offset address and
not the actual address.
9. What are the functional units of a computer?
Ans: The computer system is divided into three separate units for its operation.
They are
1) arithmetic logical unit
2) control unit.
3) central processing unit.
11. What are the examples for input and output devices?
Input Devices Output Devices
1.Digital camera 1. Monitor
2. Gamepad 2. Printer
3. Joystick 3. Plotters
4. Keyboard 4. Projector
5. Microphone 5. LCD Projection Panels
42
IMPORTANT QUESTIONS
2 MARKS
1. Differentiate between computer organization and architecture? [Jun 2015]
2. List any five assembly language instructions? [Jun 2015]
3. What are the phases in instruction cycle? [Jun 2015]
4. Describe basic function of assembler and compiler? [Dec 2015]
5. What is use of program counter? [Dec 2015]
Part-B
10 MARKS
1. List and explain the functional units of a computer with a neat diagram? [Jun 2015]
2. Explain the computer levels of programming languages? [Jun 2015]
3. Draw and explain Von Neumann Architecture? [Dec 2015]
4. Describe following instructions with suitable examples? [Dec 2015]
(a) RAR (b) ADC (c) JZ (d) PUSH (e) SHL
5. (A). How the data can be read from memory? Explain with timing diagrams for memory read
and memory write operations? [Dec 2015]
(B). Explain briefly about three categories of computer programming languages?
6. (A). Explain briefly about the two commonly used organizations for multi-byte data
organization?
(B). What are the different addressing modes used in assembly language instructions?[Dec
2015]
43
UNIT-2
ARTHEMATIC AND BASIC PROCESSING UNIT
In figure-1, the function table of a full-adder is shown; sum and carryout are the
outputs for adding equally weighted bits x i and yi, in two numbers X and Y.
The logic expressions for these functions are also shown, along with an example of
addition of the4-bit unsigned numbers 7 and 6. Note that each stage of the addition
process must accommodate a carry-in bit. We use ci, to represent the carry-in to the
ith stage, which is the same as the carryout from the (i - 1) th stage.
The logic expression for si in Figure-1 can be implemented with a 3-input XOR gate.
The carryout function, ci +1 is implemented with a two-level AND-OR logic
circuit. A convenient symbol for the complete circuit for a single stage of addition,
called a full adder (FA), is as shown in the figure-1a.
A cascaded connection of such n full adder blocks, as shown in Figure 1b, forms
a parallel adder & can be used to add two n-bit numbers. Since the carries must
propagate, or ripple, through this cascade, the configuration is called an n-bit ripple-
carry adder. 44
The carry-in, Co, into the least-significant-bit (LSB) position [Ist stage] provides a
convenient means of adding 1 to a number. Take for instance; forming the 2's-
complement of a number involves adding 1 to the 1’s-complement of the number.
The carry signals are also useful for interconnecting k adders to form an adder
capable of handling input numbers that are kn bits long, as shown in Figure-1c.
45
FIG: Addition of binary vectors.
46
DESIGN OF FAST ADDERS:
parallel adder an nth stage adder can not complete the addition process before
all its previous stages have completed the addition even with input bits ready. This
is because, the carry bit from previous stage has to be made available for
addition of the present stage.
47
The above expressions Gi and Pi are called carry generate and propagate
functions for stage i. If the generate function for stage i is equal to 1, then ci+1 = 1,
independent of the input carry, ci. This occurs when both xi and yi are 1. The
propagate function means that an input carry will produce an output carry when
either xi or yi or both equal to 1. Now, using Gi & Pi functions we can decide carry
for ith stage even before its previous stages have completed their addition
operations. All Gi and Pi functions can be formed independently and in parallel in
only one gate delay after the Xi and Yi inputs are applied to an n-bit adder. Each bit
stage contains an AND gate to form Gi, an OR gate to form Pi
and a three-input XOR gate to form si. However, a much simpler circuit can be
derive
by considering the propagate function Pi = xi which differs from Pi = xi + yi only
when xi = yi =1 where Gi = 1 (so it does not matter whether Pi is 0 or 1). Then, the
basic diagram in Figure-5 can be used in each bit stage to predict carry ahead of any
stage completing its addition.
Consider the
ci+1expression,
48
This is because, Ci = (Gi-1 + Pi-1Ci-1).
Further, Ci-1 = (Gi-2 + Pi-2Ci-2) and so on. Expanding in this fashion, the final
carry expression can be written as below;
Thus, all carries can be obtained in three gate delays after the input signals Xi,
Yi and Cin are applied at the inputs. This is because only one gate delay is
needed to
develop all Pi and Gi signals, followed by two gate delays in the AND-OR circuit
(SOP
expression) for c + After a further XOR gate delay, all sum bits are available.
49
Therefore, independent of n, the number of stages, the n-bit addition process
requires only four gate delays.
Now, consider the design of a 4-bit parallel adder. The carries can be implemented as
;i = 0
;i = 1
;i = 2
;i = 3
The complete 4-bit adder is shown in Figure 5b where the B cell indicates Gi, Pi & Si
generator. The carries are implemented in the block labeled carry look-ahead
logic. An adder implemented in this form is called a carry look ahead adder. Delay
through the adder is 3 gate delays for all carry bits and 4 gate delays for all sum bits.
In comparison, note that a 4-bit ripple-carry adder requires 7 gate delays for S3(2n-
1) and 8 gate delays(2n) for c4.
If we try to extend the carry look- ahead adder of Figure 5b for longer
operands, we run into a problem of gate fan-in constraints. From the final expression
for Ci+1 & the carry expressions for a 4 bit adder, we see that the last AND gate and
the OR gate require a fan-in of i + 2 in generating cn-1. For c4 (i = 3)in the 4-bit adder,
a fan-in of 5 is required. This puts the limit on the practical implementation. So the
adder design shown in Figure
4b cannot be directly extended to longer operand sizes. However, if we cascade a
number of 4-bit adders, it is possible to build longer adders without the practical
problems of fan- in. An example of a 16 bit carry look ahead adder is as shown in
figure 6. Eight 4-bit carry look-ahead adders can be connected as in Figure-2 to form
a 32-bit adder.
50
FIG: 16 bit carry-look ahead adder.
51
method. The multiplier array & the components of each bit cell are indicated in
the diagram, while the flow diagram shown explains the multiplication
procedure.
FIG-7b
The following SHIFT & ADD method flow chart depicts the multiplication logic for
unsigned numbers.
52
Despite the use of a combinational network, there is a considerable
amount of delay associated with the arrangement shown. Although the preceding
combinational multiplier is easy to understand, it uses many gates for multiplying
numbers of practical size, such as 32- or 64-bit numbers. The worst case signal
propagation delay path is from the upper right corner of the array to the high-order
product bit output at the bottom left corner of the array. The path includes the two
cells at the right end of each row, followed by all the cells in the bottom row.
Assuming that there are two gate delays from the inputs to the outputs of a full
adder block, the path has a total of 6(n - 1) - 1 gate delays, including the initial AND
gate delay in all cells, for the n x n array. In the delay expression, (n-1) because,
only the AND gates are actually needed in the first row of the array because the
incoming (initial) partial product PPO is zero.
53
Multiplication can also be performed using a mixture of combinational array
techniques (similar to those shown in Figure 7) and sequential techniques requiring less
combinational logic. Multiplication is usually provided as an instruction in the
machine instruction set of a processor. High-performance processor (DS processors)
chips use an appreciable area of the chip to perform arithmetic functions on both
integer and floating- point operands. Sacrificing an area on-chip for these
arithmetic circuits increases the speed of processing. Generally, processors built for
real time applications have an on- chip multiplier.
ww.vtucs
Another simplest way to perform multiplication is to use the adder circuitry in
the ALU for a number of sequential steps. The block diagram in Figure 8a
shows the hardware arrangement for sequential multiplication. This circuit
performs multiplication by using single n-bit adder n times to implement the spatial
addition performed by the n
54
rows of ripple-carry adders. Registers A and Q combined to hold PPi while multiplier
bit qi generates the signal Add/No-add. This signal controls the addition of the
multiplicand M to PPi to generate PP(i + 1). The product is computed in n cycles. The
partial product grows in length by one bit per cycle from the initial vector, PPO, of
n 0s in register A. The carry-out from the adder is stored in flip-flop C. To begin
with, the multiplier is loaded into register Q, the multiplicand into register M and
registers C and A are cleared to 0. At the end of each cycle C, A, and Q are shifted
right one bit position to allow for growth of the partial product as the multiplier is
shifted out of register Q. Because of this shifting, multiplier bit qi, appears at the LSB
position of Q to generate the Add/No-add signal at the correct time, starting with
qo during the first cycle, q1 during the second cycle, and so on. After they are used,
the multiplier bits are discarded by the right-shift operation. Note that the carry-out
from the adder is the leftmost bit of PP(i + 1), and it must be held in the C flip-flop
to be shifted right with the contents of A and Q. After n cycles, the high-order half-
of- the product is held in register A and the low-order half is in register Q. The
multiplication example used above is shown in Figure 8b as it would be performed
by this hardware arrangement.
SIGNED-OPERAND MULTIPLICATION:
55
For a negative multiplier, a straightforward solution is to form the 2's-
complement of both the multiplier and the multiplicand and proceed as in the case of a
positive multiplier. This is possible because complementation of both operands does
not change the value or the sign of the product. In order to take care of both
negative and positive multipliers, BOOTH algorithm can be used.
1 0 0 1 1 (-13) X 0 1 0 1 1 (+11)
111111 0 0 11
111110 0 1 1
000000 00
111001 1
000000
110111 0 0 0 1 (-143)
Booth Algorithm
The Booth algorithm generates a 2n-bit product and both positive and negative
2's-complement n-bit operands are uniformly treated. To understand this
algorithm, consider a multiplication operation in which the multiplier is positive
and has a single block of 1s, for example, 0011110(+30). To derive the product, as in
the normal standard procedure, we could add four appropriately shifted
versions of the multiplicand,. However, using the Booth algorithm, we can reduce
the number of required operations by
regarding this multiplier as the difference between numbers 32 & 2 as shown
below;
0 1 0 0 0 0 0 (32)
0 0 0 0 0 1 0 (-2)
0 0 1 1 1 1 0 (30)
56
+1000 - 10. In general, in the Booth scheme, -1 times the shifted multiplicand is
selected when moving from 0 to 1, and +1 times the shifted multiplicand is selected
when moving from 1 to 0, as the multiplier is scanned from right to left.
Figure 10 illustrates the normal and the Booth algorithms for the said
example. The Booth algorithm clearly extends to any number of blocks of 1s in a
multiplier, including the situation in which a single 1 is considered a block. See
Figure 11a for another example of recoding a multiplier. The case when the least
significant bit of the multiplier is 1 is handled by assuming that an implied 0 lies to
its right. The Booth algorithm can also be used directly for negative multipliers, as
shown in Figure 11a.
57
To verify the correctness of the Booth algorithm for negative multipliers, we use
the following property of negative-number representations in the 2's-complement
58
Fig:12.Booth multiplier recoding table.
T he Booth algorithm has two attractive features. First, it handles both positive andnegative
multipliers uniformly. Second, it achieves some efficiency in the number ofadditions
required when the multiplier has a few large blocks of 1 s. The speed gained byskipping
59
over 1s depends on the data. On average, the speed of doing multiplication withthe Booth
algorithm is the same as with the normal algorithm.
FAST MULIPLICATION:
There are two techniques for speeding up the multiplication operation. The first
technique guarantees that the maximum number of summands (versions of the
multiplicand) that must be added is n/2 for n-bit operands. The second technique reduces
the time needed to add the summands (carry-save addition of summands
method).
.
This bit-pair recoding technique halves the maximum number of summands. It
is derived from the Booth algorithm. Group the Booth-recoded multiplier bits in
pairs, and observe the following: The pair (+1 -1) is equivalent to the pair (0 +1). That
is, instead of adding —1 times the multiplicand M at shift position i to + 1 x M at
position i + 1, the same result is obtained by adding +1 x M at position I Other
examples are: (+1 0) is equivalent to (0 +2),(-l +1) is equivalent to (0 —1). and so
on. Thus, if the Booth-recoded multiplier is examined two bits at a time, starting from
the right, it can be rewritten in a form that requires at most one version of the
multiplicand to be added to the partial product for each pair of multiplier bits.
60
FIG – 15
61
INTEGER DIVISION:
62
FIG – 17: Binary Division
Restoring Division:
Figure 17 shows a logic circuit arrangement that implements restoring division. Note
its similarity to the structure for multiplication that was shown in Figure 8. An n-bit
positive divisor is loaded into register M and an n-bit positive dividend is loaded into
register Q at the start of the operation. Register A is set to 0. After the division is
complete, the n-bit quotient is in register Q and the remainder is in register A. The
required subtractions are facilitated by using 2's-complement arithmetic. The extra
bit position at the left end of both A and M accommodates the sign bit during
subtractions. The following algorithm performs restoring division.
Do the following ntimes:
1. Shift A and Q left one binary position.
2. Subtract M from A, and place the answer back in A.
3. If the sign of A is 1, set q0 to 0 and add M back to A (that is, restore
A);otherwise, set q0to 1.
63
FIG – 18: Restoring Division
No restoring Division:
64
Step 2 is needed to leave the proper positive remainder in A at the end of
the n cycles of Step 1. The logic circuitry in Figure 17 can also be used to
perform this algorithm. Note that the Restore operations are no longer needed,
and that exactly one Add or Subtract operation is performed per cycle. Figure 19
shows how the division example in Figure 18 is executed by the no restoring-division
algorithm.
There are no simple algorithms for directly performing division on signed operands
that are comparable to the algorithms for signed multiplication. In division, the
operands can be preprocessed to transform them into positive values. After using
one of the algorithms
just discussed, the results are transformed to the correct signed values, as necessary
65
FLOATING-POINT NUMBERS AND OPERATIONS:
F(B) = -bo x 2° + b-1 x 2-1 +b-2x2-2 + ... + b-(n-X) x 2-{n~l) where the range of F is
66
to define a floating-point number representation as one in which a number is
represented by its sign, a string of significant digits, commonly called the
mantissa, and an exponent to an implied base for the scale factor.
The heart of any computer is the central processing unit (CPU). The CPU
executes all the machine instructions and coordinates the activities of all other
units during the execution of an instruction. This unit is also called as the
Instruction Set Processor (ISP). By looking at its internal structure, we can
understand how it performs the tasks of fetching, decoding, and executing
instructions of a program. The processor is generally called as the central processing
unit (CPU) or micro processing unit (MPU).An high-performance processor can be built
by making various functional units operate in parallel. High-performance processors
have a pipelined organization where the execution of one instruction is started
before the execution of the preceding instruction is completed. In another
approach, known as superscalar operation, several instructions are fetched and
executed at the same time. Pipelining and superscalar architectures provide a very
high performance for any processor.
A typical computing task consists of a series of steps specified by a sequence of
machine instructions that constitute a program. A program is a set of
instructions performing a meaningful task. An instruction is command to the
processor & is executed by carrying out a sequence of sub-operations called as micro-
operations. Figure 1 indicates various blocks of a typical processing unit. It consists of
PC, IR, ID, MAR, MDR, a set of register arrays for temporary storage, Timing and
Control unit as main units.
FUNDAMENTAL CONCEPTS:
67
Fig-1
68
In cases where an instruction occupies more than one word, steps 1 and 2 must
be repeated as many times as necessary to fetch the complete instruction. These two
steps together are usually referred to as the fetch phase; step 3 constitutes the
decoding phase; and step 4 constitutes the execution phase.
To study these operations in detail, let us examine the internal organization of
the processor. The main building blocks of a processor are interconnected in a
variety of ways. A very simple organization is shown in Figure 2. A more complex
structure that provides high performance will be presented at the end.
Figure shows an organization in which the arithmetic and logic unit (ALU) and
all the registers are interconnected through a single common bus, which is
internal to the processor. The data and address lines of the external memory bus are
shown in Figure 7.1 connected to the internal processor bus via the memory data
register, MDR, and the memory address register, MAR, respectively. Register MDR
has two inputs and two outputs. Data may be loaded into MDR either from the
memory bus or from the internal processor bus. The data stored in MDR may be
placed on either bus. The input of MAR is connected to the internal bus, and its
output is connected to the external bus. The control lines of the memory bus are
69
connected to the instruction decoder and control logic block. This unit is responsible
for issuing the signals that control the operation of all the units inside the processor
and for interacting with the memory bus.
The number and use of the processor registers RO through R(n - 1) vary
considerably from one processor to another. Registers may be provided for general-
purpose use by the programmer. Some may be dedicated as special-purpose
registers, such as index registers or stack pointers. Three registers, Y, Z, and TEMP in
Figure 2, have not been mentioned before. These registers are transparent to the
programmer, that is, the programmer need not be concerned with them because
they are never referenced explicitly by any instruction. They are used by the
processor for temporary storage during execution of some instructions. These
registers are never used for storing data generated by one instruction for later
use by another instruction.
The multiplexer MUX selects either the output of register Y or a constant value 4 to be
provided as input A of the ALU. The constant 4 is used to increment the contents of
the program counter. We will refer to the two possible values of the MUX
control input Select as Select4 and Select Y for selecting the constant 4 or register Y,
respectively.
As instruction execution progresses, data are transferred from one register to
another, often passing through the ALU to perform some arithmetic or logic
operation. The instruction decoder and control logic unit is responsible for
implementing the actions specified by the instruction loaded in the IR register. The
decoder generates the control signals needed to select the registers involved and
direct the transfer of data. The registers, the ALU, and the interconnecting bus
are collectively referred to as the data path.
With few exceptions, an instruction can be executed by performing one or more of the
following operations in some specified sequence:
1. Transfer a word of data from one processor register to another or to the ALU
We now consider in detail how each of these operations is implemented, using the
simple processor model in Figure.
Instruction execution involves a sequence of steps in which data are transferred from
one register to another. For each register, two control signals are used to place the
contents of that register on the bus or to load the data on the bus into the register.
70
This is represented symbolically in Figure 3. The input and output of register Ri are
connected to the bus via switches controlled by the signals Riin and Riout
respectively. When Riin is set to 1, the data on the bus are loaded into Ri.
Similarly, when Riout, is set to 1, the contents of register Riout are placed on the
bus. While Riout is equal to 0, the bus can be used for transferring data from
other registers.
Suppose that we wish to transfer the contents of register RI to register R4. This can be
accomplished as follows:
1. Enable the output of register R1out by setting Rlout, tc 1. This places the
contents of R1 on the processor bus.
2. Enable the input of register R4 by setting R4in to 1. This loads data from
the processor bus into register R4.
All operations and data transfers within the processor take place within time
periods defined by the processor clock. The control signals that govern a particular
transfer are asserted at the start of the clock cycle. In our example, Rl out and R4in
are set to 1. The registers consist of edge-triggered flip-flops. Hence, at the next
active edge of the clock, the flip-flops that constitute R4 will load the data present
at their inputs. At the same time, the control signals Rlout and R4in will return to
0. We will use this simple model of the timing of data transfers for the rest of this
chapter. However, we should point out that other schemes are possible. For
example, data transfers may use both the rising and falling edges of the clock. Also,
when edge-triggered flip-flops are not used, two or more clock signals may be
needed to guarantee proper transfer of data. This is known as multiphase
clocking.
An implementation for one bit of register Ri is shown in Figure 7.3 as an
example. A two-input multiplexer is used to select the data applied to the input of
an edge-triggered D flip-flop. When the control input Riin is equal to 1, the
multiplexer selects the data on the bus. This data will be loaded into the flip-flop at
the rising edge of the clock. When Riin is equal to 0, the multiplexer feeds back the
value currently stored in the flip-flop.
The Q output of the flip-flop is connected to the bus via a tri-state gate. When
Riout, isequal to 0, the gate's output is in the high-impedance (electrically
disconnected) state. This corresponds to the open-circuit state of a switch. When
Riout, = 1, the gate drives the bus to 0 or 1, depending on the value of Q.
71
EXECUTION OF A COMPLETE INSTRUCTION:
Let us now put together the sequence of elementary operations required to execute
one instruction. Consider the instruction
Add (R3), R1
The listing shown in figure above indicates the sequence of control steps
required to perform these operations for the single-bus architecture of Figure
2. Instruction execution proceeds as follows. In step 1, the instruction fetch
operation is initiated by loading the contents of the PC into the MAR and sending a
Read request to the memory. The Select signal is set to Select4, which causes the
multiplexer MUX to select the constant 4. This value is added to the operand at
input B, which is the contents of the PC, and the result is stored in register Z. The
updated value is moved from register Z back into the PC during step 2, while waiting
for the memory to respond. In step 3, the word fetched from the memory is loaded
into the IR. Steps 1 through 3 constitute the instruction fetch phase, which is the same
for all instructions. The instruction decoding circuit interprets the contents of the
IR at the beginning of step 4. This enables the control circuitry to activate the
control signals for steps 4 through 7, which constitute the execution phase. The
contents of register R3 are transferred to the MAR in step 4, and a memory read
operation is initiated.
72
Then the contents of Rl are transferred to register Y in step 5, to prepare for the
addition operation. When the Read operation is completed, the memory
operand is available in register MDR, and the addition operation is performed in step
6. The contents of MDR are gated to the bus, and thus also to the B input of the ALU,
and register Y is selected as the second input to the ALU by choosing Select Y.
The sum is stored in register Z, then transferred to Rl in step 7. The End signal
causes a new instruction fetch cycle to begin by returning to step 1.
This discussion accounts for all control signals in Figure 7.6 except Y in step
2. There is no need to copy the updated contents of PC into register Y when
executing the Add instruction. But, in Branch instructions the updated value of the
PC is needed to compute the Branch target address. To speed up the execution of
Branch instructions, this value is copied into register Y in step 2. Since step 2 is part
of the fetch phase, the same action will be performed for all instructions. This
does not cause any harm because register Y is not used for any other purpose at
that time.
Branch Instructions:
A branch instruction replaces the contents of the PC with the branch
target address. This address is usually obtained by adding an offset X, which is
given in the branch instruction, to the updated value of the PC. Listing in figure 8
below gives a control sequence that implements an unconditional branch
instruction. Processing starts, as usual, with the fetch phase. This phase ends when
the instruction is loaded into the IR in step 3. The offset value is extracted from
the IR by the instruction decoding circuit, which will also perform sign extension if
required. Since the value of the updated PC is already available in register Y, the
offset X is gated onto the bus in step 4, and an addition operation is performed.
The result, which is the branch target address, is loaded into the PC in step 5.
The offset X used in a branch instruction is usually the difference between the
branch target address and the address immediately following the branch instruction.
Fig 8
73
For example, if the branch instruction is at location 2000 and if the branch
target address is 2050, the value of X must be 46. The reason for this can be readily
appreciated from the control sequence in Figure 7. The PC is incremented during
the fetch phase, before knowing the type of instruction being executed. Thus, when
the branch address is computed in step 4, the PC value used is the updated value,
which points to the instruction following the branch instruction in the memory.
Consider now a conditional branch. In this case, we need to check the status of
the condition codes before loading a new value into the PC. For example, for a Branch-
on- negative (Branch<0) instruction, step 4 is replaced with
MULTIPLE-BUS ORGANIZATION:
The resulting control sequences shown are quite long because only one data
item can be transferred over the bus in a clock cycle. To reduce the number of
steps needed, most commercial processors provide multiple internal paths that
enable several transfers to take place in parallel.
Figure 7 depicts a three-bus structure used to connect the registers and the ALU
of a processor. All general-purpose registers are combined into a single block
called the register file. In VLSI technology, the most efficient way to implement
a number of registers is in the form of an array of memory cells similar to those used
in the imple- mentation of random-access memories (RAMs) described in Chapter 5.
The register file in Figure 9 is said to have three ports. There are two outputs, allowing
the contents of two different registers to be accessed simultaneously and have their
contents placed on buses A and B. The third port allows the data on bus C to be
loaded into a third register during the same clock cycle.
Buses A and B are used to transfer the source operands to the A and B
inputs of the ALU, where an arithmetic or logic operation may be performed. The
result is transferred to the destination over bus C. If needed, the ALU may simply pass
one of its two input operands unmodified to bus C. We will call the ALU control
signals for such an operation R=A or R=B. The three-bus arrangement obviates the
need for registers Y and Z in Figure 2.
74
Fig: multibus organization
The control sequence for executing this instruction is given in Figure. In step
1, the contents of the PC are passed through the ALU, using the R=B control signal, and
loaded into the MAR to start a memory read operation. At the same time the
PC is incremented by 4. Note that the value loaded into MAR is the original contents
of the PC. The incremented value is loaded into the PC at the end of the clock
75
cycle and will not affect the contents of MAR. In step 2, the processor waits for
MFC and loads the data received into MDR, then transfers them to IR in step 3.
Finally, the execution phase of the instruction requires only one control step to
complete, step 4.
By providing more paths for data transfer a significant reduction in the
number of clock cycles needed to execute an instruction is achieved.
HARDWIRED CONTROL:
To execute instructions, the processor must have some means of generating the
con- trol signals needed in the proper sequence. Computer designers use a wide
variety of techniques to solve this problem. The approaches used fall into one of
two categories: hardwired control and micro programmed control. We discuss each of
these techniques in detail, starting with hardwired control in this section.
Consider the sequence of control signals given in Figure 7. Each step in
this sequence is completed in one clock period. A counter may be used to keep
track of the control steps, as shown in Figure 11. Each state, or count, of this
counter corresponds to one control step. The required control signals are
determined by the following information:
1. Contents of the control step counter
2. Contents of the instruction register
3. Contents of the condition code flags
4. External input signals, such as MFC and interrupt requests
76
To gain insight into the structure of the control unit, we start with a simplified
view of the hardware involved. The decoder/encoder block in Figure is a
combinational circuit that generates the required control outputs, depending on the
state of all its inputs. By separating the decoding and encoding functions, we obtain the
more detailed block diagram in Figure 12. The step decoder provides a separate signal
line for each step, or time slot, in the control sequence. Similarly, the output of the
instruction decoder consists of a separate line for each machine instruction. For any
instruction loaded in the IR, one of the output lines INS1 through INSm is set to 1, and
all other lines are set to 0. (For design details of decoders, refer to Appendix A.) The
input signals to the encoder block in Figure 12 are combined to generate the
individual control signals Yin, PCout, Add, End, and so on. An example of how the
encoder generates the Zin control signal for the processor organization in Figure 2
is given in Figure 13. This circuit implements the logic function
This signal is asserted during time slot Ti for all instructions, during T6 for an
Add instruction, during T4 for an unconditional branch instruction, and so on.
The logic function for Zin is derived from the control sequences in Figures 7 and
8. As another example, Figure 14 gives a circuit that generates the End control
signal from the logic function
The End signal starts a new instruction fetch cycle by resetting the control step
counter to its starting value. Figure 12 contains another control signal called RUN.
When
Fig 12
77
set to 1, RUN causes the counter to be incremented by one at the end of every
clock cycle. When RUN is equal to 0, the counter stops counting. This is needed
whenever the WMFC signal is issued, to cause the processor to wait for the reply from
the memory.
Fig 13a
The control hardware shown can be viewed as a state machine that changes
from one state to another in every clock cycle, depending on the contents of the
instruction register, the condition codes, and the external inputs. The outputs of the
state machine are the control signals. The sequence of operations carried out by this
machine is determined by the wiring of the logic elements, hence the name
"hardwired." A controller that uses this approach can operate at high speed.
However, it has little flexibility, and the complexity of the instruction set it can
implement is limited.
Fig 13b
78
MICROPROGRAMMED CONTROL
ALU is the heart of any computing system, while Control unit is its brain. The
design of a control unit is not unique; it varies from designer to designer.
Some of the commonly used control logic design methods are;
• Sequence Reg & Decoder method
Fig 15
79
The control signals required inside the processor can be generated using
a control step counter and a decoder/ encoder circuit. Now we discuss an alternative
scheme, called micro programmed control, in which control signals are generated
by a program similar to machine language programs.
First, we introduce some common terms. A control word (CW) is a word whose
individual bits represent the various control signals in Figure 12. Each of the control
steps in the control sequence of an instruction defines a unique combination of Is and
Os in the CW. The CWs corresponding to the 7 steps of Figure 6 are shown in Figure 15.
We have assumed that Select Y is represented by Select = 0 and Select4 by Select = 1.
A sequence of CWs corresponding to the control sequence of a machine
instruction constitutes the micro routine for that instruction, and the individual
control words in this micro routine are referred to as microinstructions.
The micro routines for all instructions in the instruction set of a computer
are stored in a special memory called the control store. The control unit can
generate the control signals for any instruction by sequentially reading the CWs of the
corresponding micro routine from the control store. This suggests organizing the
control unit as shown in Figure 16. To read the control words sequentially from
the control store, a micro program counter (µPC) is used. Every time a new
instruction is loaded into the IR, the output of the block labeled "starting address
generator" is loaded into the µPC. The µPC is then automatically incremented by
the clock, causing successive microinstructions to be read from the control store.
Hence, the control signals are delivered to various parts of the processor in the
correct sequence.
One important function of the control unit cannot be implemented by the simple
organization in Figure 16. This is the situation that arises when the control unit
is required to check the status of the condition codes or external inputs to choose
between alternative courses of action. In the case of hardwired control, this situation
is handled by including an appropriate logic function, in the encoder circuitry. In
micro programmed control, an alternative approach is to use conditional
branch microinstructions. In addition to the branch address, these
microinstructions specify which of the external inputs, condition codes, or, possibly,
bits of the instruction register should be checked as a condition for branching to take
place.
The instruction Branch <0 may now be implemented by a micro routine such as
that shown in Figure 17. After loading this instruction into IR, a
branch
80
Fig 16
Fig 17
81
Fig 18
Microinstructions
Having described a scheme for sequencing microinstructions, we now take a
closer look at the format of individual microinstructions. A straightforward way to
structure Microinstruction is to assign one bit position to each control signal, as in
Figure 15.
However, this scheme has one serious drawback — assigning individual bits to
each control signal results in long microinstructions because the number of required
signals is usually large. Moreover, only a few bits are set to 1 (to be used for active
gating) in any given microinstruction, which means the available bit space is poorly
used. Consider again the simple processor of Figure 2, and assume that it contains
only four general- purpose registers, R0, Rl, R2, and R3. Some of the connections
82
in this processor are permanently enabled, such as the output of the IR to the decoding
circuits and both inputs to the ALU. The remaining connections to various registers require a
total of 20 gating signals. Additional control signals not shown in the figure are also
needed, including the Read, Write, Select, WMFC, and End signals. Finally, we must specify
the function to be performed by the ALU. Let us assume that 16 functions are provided,
including Add, Subtract, AND, and XOR. These functions depend on the particular ALU used
and do not necessarily have a one-to-one correspondence with the machine instruction OP
codes. In total, 42 control signals are needed.
If we use the simple encoding scheme described earlier, 42 bits would be needed in
each microinstruction. Fortunately, the length of the microinstructions can be reduced
easily. Most signals are not needed simultaneously, and many signals are mutually
exclusive. For example, only one function of the ALU can be activated at a time. The
source for a data transfer must be unique because it is not possible to gate the contents of
two different registers onto the bus at the same time. Read and Write signals to the
memory cannot be active simultaneously. This suggests that signals can be grouped so that
all mutually exclusive signals are placed in the same group. Thus, at most one micro
operation per group is specified in any microinstruction. Then it is possible to use a binary
coding scheme to represent the signals within a group. For example, four bits suffice to
represent the 16 available functions in the ALU. Register output control signals can be
placed in a group consisting of PCout, MDRout, Zout, Offsetout, R0out Rlout, R2out, R3out,
and TEMPout. Any one of these can be selected by a unique 4-bit code.
Further natural groupings can be made for the remaining signals. Figure 19 shows an
example of a partial format for the microinstructions, in which each group occupies a
field large enough to contain the required codes. Most fields must include one inactive code
for the case in which no action is required. For example, the all-zero pattern in Fl indicates
that none of the registers that may be specified in this field should have its contents placed
on the bus. An inactive code is not needed in all fields.
For example, F4 contains 4 bits that specify one of the 16 operations performed in the
ALU. Since no spare code is included, the ALU is active during the execution of every
microinstruction. However, its activity is monitored by the rest of the machine through
register Z, which is loaded only when the Z in signal is activated.
Grouping control signals into fields requires a little more hardware because decoding
circuits must be used to decode the bit patterns of each field into individual control
signals. The cost of this additional hardware is more than offset by the reduced
number of bits in each microinstruction, which results in a smaller
control store. In Figure19, only 20 bits are needed to store the patterns for the 42
signals.
Fig 19
Such full encoding is likely to further reduce the length of micro words but also to
increase the complexity of the required decoder circuits.
80
Part-A
2 MARKS QUESTIONS WITH ANSWERS
1. What is instruction cycle?
Ans: An instruction cycle (sometimes called fetch-decode-execute cycle) is the basic
operation cycle of a computer. It is the process by which a computer retrieves a program instruction
from its memory, determines what actions the instruction requires, and carries out those actions.
2. What are the lists of memory reference instructions?
Ans: Memory reference instructions
For the instructions to be carried out in a sequential manner we need the proper definition of the
micro operations to be executed under it. So we need a precise defined form of them. As we know
that instructions are read from the memory into the registers so the term memory reference
instructions came into the picture.
1. AND to AC
2. ADD to AC
3. LDA: Load to AC
4. STA: Store AC
5. BUN: Branch Unconditionally
6. BSA: Branch and Save Return Address
7. ISZ: Increment and Skip if Zero
3. What are the types of interrupts?
Ans: when a user Request for another Process then this will create disturbance for the Running
Process. This is also called as the Interrupt.
Types of Interrupts
Generally there are three types o Interrupts those are Occurred For Example
1) Internal Interrupt
2) Software Interrupt.
3) External Interrupt.
IMPORTANT QUESTIONS
2 MARKS
1. Define Interrupt? List the types of interrupts? [Jun 2015]
2. Write about register transfer notations? [Jun 2015]
3. Define the terms guard bits and rounding with respect to floating point operations? [Dec 2015]
4. Design an adder to add two 4 bit numbers? [Dec 2015]
5. Explain software and hardware interrupts? [III-Dec 2015]
6. Define effective address? [III-Dec 2015]
7. Write any two main features of Booth’s algorithm? [III-Dec 2015]
Part- B
10 MARKS
1. Draw the flow chart for Booth’s algorithm for multiplication of signed 2’s complement numbers
and explain with an example?
[Jun 2015]
82
2. What is an addressing mode? A two word instruction is stored on memory at an address designated
by symbol W. The address field of instruction (stored at W+1) is designated by the symbol Y. The
operand used during the execution of the instruction is stored at an address symbolized by Z. An
index register contains the value X. State how Z is calculated from the other addresses if the
addressing mode of the instruction is (i) Direct (ii) Indirect (iii) Relative (iv) Indexed
[Jun 2015]
3. Multiply 7 and 3 using Booth’s Algorithm? [Dec 2015]
4. Perform division of 1000 and 0011 using restoring division algorithm? [Dec 2015]
5. Explain the way that the interrupts are handled by the computer by means of flow chart? [III-Dec
2015]
6. (A). Explain LDA and STA memory reference instructions with suitable examples?
(B). Describe briefly about the register organization for floating point operations? [III-Dec 2015]
83
UNIT 3
MEMORY SYSTEM
BASIC CONCEPTS
The maximum size of the memory that can be used in any computer is determined by
the addressing scheme.
For example, a 16-bit computer that generates 16-bit addresses is capable of addressing
upto 216 =64K memory locations. If a machine generates 32-bit addresses, it can
access upto 232 = 4G memory locations. This number represents the size of address
space of the computer.
With the above structure a READ or WRITE may involve an entire memory word
or it may involve only a byte. In the case of byte read, other bytes can also be read but
ignored by the CPU. However, during a write cycle, the control circuitry of the MM must
ensure that only the specified byte is altered. In this case, the higher-order 30 bits can
specify the word and the lower-order 2 bits can specify the byte within the word.
82
CPU-Main Memory Connection – A block schematic: -
k
n-bit data bus Up to addressable
MDR locations
From the system standpoint, the Main Memory (MM) unit can be viewed as a “block
box”. Data transfer between CPU and MM takes place through the use of two CPU
registers, usually called MAR (Memory Address Register) and MDR (Memory Data
Register). If MAR is K bits long and MDR is ‘n’ bits long, then the MM unit may contain
upto 2k addressable locations and each location will be ‘n’ bits wide, while the word
length is equal to ‘n’ bits. During a “memory cycle”, n bits of data may be transferred
between the MM and CPU. This transfer takes place over the processor bus, which has k
address lines (address bus), n data lines (data bus) and control lines like Read, Write,
Memory Function completed (MFC), Bytes specifiers etc (control bus). For a read
operation, the CPU loads the address into MAR, set READ to 1 and sets other control
signals if required. The data from the MM is loaded into MDR and MFC is set to 1. For a
write operation, MAR, MDR are suitably loaded by the CPU, write is set to 1 and other
control signals are set suitably. The MM control circuitry loads the data into appropriate
locations and sets MFC to 1. This organization is shown in the following block schematic
Some Basic
Concepts Memory
Access Times: -
It is a useful measure of the speed of the memory unit. It is the time that elapses
between the initiation of an operation and the completion of that operation (for
example, the time between READ and MFC).
83
Memory Cycle Time :-
It is an important measure of the memory system. It is the minimum time delay
required between the initiations of two successive memory operations (for example, the
time between two successive READ operations). The cycle time is usually slightly longer
than the access time.
Cache Memory:-
The CPU of a computer can usually process instructions and data faster than they
can be fetched from compatibly priced main memory unit. Thus the memory cycle time
become the bottleneck in the system. One way to reduce the memory access time is to
use cache memory. This is a small and fast memory that is inserted between the larger,
slower main memory and the CPU. This holds the currently active segments of a program
and its data. Because of the locality of address references, the CPU can, most of the
time, find the relevant information in the cache memory itself (cache hit) and
infrequently needs access to the main memory (cache miss) with suitable size of the
cache memory, cache hit rates of over 90% are possible leading to a cost-effective
increase in the performance of the system.
Memory Interleaving: -
This technique divides the memory system into a number of memory modules
and arranges addressing so that successive words in the address space are placed in
different modules. When requests for memory access involve consecutive addresses,
the access will be to different modules. Since parallel access to these modules is
possible, the average rate of fetching words from the Main Memory can be increased.
Virtual Memory: -
In a virtual memory System, the address generated by the CPU is referred to as
a virtual or logical address. The corresponding physical address can be different and the
required mapping is implemented by a special memory control unit, often called
the memory management unit. The mapping function itself may be changed during
program execution according to system requirements.
84
Because of the distinction made between the logical (virtual) address space and
the physical address space; while the former can be as large as the addressing
capability of the CPU, the actual physical memory can be much smaller. Only the active
portion of the virtual address space is mapped onto the physical memory and the rest
of the virtual address space is mapped onto the bulk storage device used. If the
addressed information is in the Main Memory (MM), it is accessed and execution
proceeds. Otherwise, an exception is generated, in response to which the memory
management unit transfers a contigious block of words containing the desired word
from the bulk storage unit to the MM, displacing some block that is currently inactive. If
the memory is managed in such a way that, such transfers are required relatively
infrequency (ie the CPU will generally find the required information in the MM), the
virtual memory system can provide a reasonably good performance and succeed in
creating an illusion of a large memory with a small, in expensive MM.
Memory cells are usually organized in the form of array, in which each cell is
capable of storing one bit of information. Each row of cells constitutes a memory word
and all cells of a row are connected to a common line called as word line. The cells in
each column are connected to Sense / Write circuit by two bit lines. Figure 3..1 shows
the possible arrangements of memory cells.
The Sense / Write circuits are connected to data input or output lines of the chip.
During a write operation, the sense / write circuit receives input information and stores
it in the cells of the selected word. The data input and data output of each sense / write
circuits are connected to a single bidirectional data line that can be connected to a data
bus of the cpu.
CS → Chip Select input selects a given chip in the multi-chip memory system
85
86
SEMI CONDUCTOR RAM MEMORIES:
Semi-Conductor memories are available is a wide range of speeds. Their cycle
time ranges from 100ns to 10ns. When first introduced in the late 1960s, they were
much more expensive. But now they are very cheap, and used almost exclusively in
implementing main memories.
STATIC MEMORIES:
Memories that consist of circuits capable of retaining their state as long as power is
applied are known as static memory.
Static random-access memory (SRAM) is a type of semiconductor memory that uses
bistable latching circuitry to store each bit. The term static differentiates it from dynamic
RAM (DRAM) which must be periodically refreshed. SRAM exhibits data remanence, but
is still volatile in the conventional sense that data is eventually lost when the memory is
not powered. Figure 3.2 shows the implementation of static RAM.
87
Two inverters are cross connected to form a latch. The latch is connected to two bit lines
by transistors T1 and T2. These transistors act as switches that can be opened / closed
under the control of the word line. When the word line is at ground level, the transistors
are turned off and the latch retains its state.
Read Operation:
In order to read the state of the SRAM cell, the word line is activated to close switches T 1
and T2. If the cell is in state 1, the signal on bit line b is high and the signal on the bit line
b is low. Thus b and bꞌ are complement of each other. Sense / write circuit at the end of
the bit line monitors the state of b and bꞌ and set the output according.
Write Operation:
The state of the cell is set by placing the appropriate value on bit line b and its
complement on bꞌ and then activating the word line. This forces the cell into the
corresponding state. The required signal on the bit lines are generated by Sense / Write
circuit.
CMOS RAM CELL
Transistor pairs (T3, T5) and (T4, T6) form the inverters in the latch. In state 1, the voltage
at point X is high by having T5, T6 on and T4, T5 are OFF. Thus T1 and T2 returned ON
(Closed), bit line b and bꞌ will have high and low signals respectively. The CMOS requires
5V (in older version) or 3.3.V (in new version) of power supply voltage.
88
Merit:
• It has low power consumption because the current flows in the cell only when the
cell is being accessed.
• Static RAMs can be accessed quickly. It access time is few nanoseconds.
Demerit:
• SRAMs are said to be volatile memories because their contents are lost when the
power is interrupted.
Asynchronous DRAMS:
Less expensive RAM‟s can be implemented if simpler cells are used. Such cells cannot
retain their state indefinitely. Hence they are called Dynamic RAM‘s (DRAM). The
information stored in a dynamic memory cell in the form of a charge on a capacitor and
this charge can be maintained only for a few milliseconds. The contents must be
periodically refreshed by restoring this capacitor charge to its full value.
In order to store information in the cell, the transistor T is turned on and the appropriate
voltage is applied to the bit line, which charges the capacitor. After the transistor is
turned off, the capacitor begins to discharge which is caused by the capacitor‘s own
leakage resistance. Hence the information stored in the cell can be retrieved correctly
before the threshold value of the capacitor drops down, as shown in Figure 3.4.
If charge on capacitor > threshold value →Bit line will have logic value 1.
89
If charge on capacitor < threshold value → Bit line will set to logic value 0.
During a read operation, the transistor is turned on and a sense amplifier connected to
the bit line detects whether the charge on the capacitor is above the threshold value.
A 16-megabit DRAM chip configured as 2M x 8, is show n in Figure 3.5.
DESCRIPTION:
The 4 bit cells in each row are divided into 512 groups of 8. 21 bit address is
needed to access a byte in the memory (12 bit to select a row, and 9 bits specify the
group of 8 bits in the selected row).
A (0-8) → Row address of a byte.
A (9-20) → Column address of a byte.
During Read/ Write operation, the row address is applied first. It is loaded into the row
address latch in response to a signal pulse on Row Address Strobe (RAS) input of the
chip. When a Read operation is initiated, all cells on the selected row are read and
refreshed. Shortly after the row address is loaded, the column address is applied to the
address pins and loaded into Column Address Strobe (CAS). The information in this latch
is decoded and the appropriate group of 8 Sense/Write circuits is selected. R/W =1(read
90
The output values of the selected circuits are transferred to the data lines D0 - D7.
R/W=0 (write operation). The information on D0 - D7 is transferred to the selected
circuits.
RAS and CAS are active low so that they cause the latching of address when they change
from high to low. This is because they are indicated by RAS and CAS. To ensure that the
contents of a DRAM‘s are maintained, each row of cells must be accessed periodically.
Refresh operation usually perform this function automatically. A specialized memory
controller circuit provides the necessary control signals RAS and CAS that govern the
timing. The processor must take into account the delay in the response of the memory.
Such memories are referred to as Asynchronous DRAM‘s.
Fast Page Mode:
Transferring the bytes in sequential order is achieved by applying the consecutive
sequence of column address under the control of successive CAS signals. This scheme
allows transferring a block of data at a faster rate. The block of transfer capability is
called as Fast Page Mode.
Synchronous DRAM:
Here the operations are directly synchronized with clock signal. The address and data
connections are buffered by means of registers. The output of each sense amplifier is
connected to a latch. A Read operation causes the contents of all cells in the selected
row to be loaded in these latches. The Figure 3.6 shows the structure of SDRAM.
91
Data held in the latches that correspond to the selected columns are transferred into
the data output register, thus becoming available on the data output pins.
First, the row address is latched under control of RAS signal. The memory
typically takes 2 or 3 clock cycles to activate the selected row. Then the column address is
latched under the control of CAS signal. After a delay of one clock cycle, the first set of data bits
is placed on the data lines. The SDRAM automatically increments the column address to access
the next 3 sets of bits in the selected row, which are placed on the data lines in the next 3 clock
cycles. A timing diagram for a typical burst of read of length 4 is shown in Figure 3.7.
• Latency
• Bandwidth
Latency refers to the amount of time it takes to transfer a word of data to or from the
memory. For a transfer of single word, the latency provides the complete indication of
memory performance. For a block transfer, the latency denotes the time it takes to
transfer the first word of data.
Bandwidth is defined as the number of bits or bytes that can be transferred in one
second. Bandwidth mainly depends upon the speed of access to the stored data and on
the number of bits that can be accessed in parallel.
92
Double Data Rate SDRAM (DDR-SDRAM):
The standard SDRAM performs all actions on the rising edge of the clock signal.
The double data rate SDRAM transfer data on both the edges (loading edge, trailing
edge). The Bandwidth of DDR-SDRAM is doubled for long burst transfer. To make it
possible to access the data at high rate, the cell array is organized into two banks. Each
bank can be accessed separately. Consecutive words of a given block are stored in
different banks. Such interleaving of words allows simultaneous access to two words
that are transferred on successive edge of the clock.
Larger Memories: Dynamic Memory System:
The physical implementation is done in the form of Memory Modules. If a large
memory is built by placing DRAM chips directly on the main system printed circuit board
that contains the processor, often referred to as Motherboard; it will occupy large
amount of space on the board. These packaging consideration have led to the
development of larger memory units known as SIMM‘s and DIMM‘s
93
MEMORY SYSTEM CONSIDERATION:
To reduce the number of pins, the dynamic memory chips use multiplexed
address inputs. The address is divided into two parts. They are,
High Order Address Bit (Select a row in cell array and it is provided first and latched
into memory chips under the control of RAS signal).
Low Order Address Bit (Selects a column and they are provided on same address
pins and latched using CAS signals).
The Multiplexing of address bit is usually done by Memory Controller Circuit, as
shown in Figure 3.8.
The Controller accepts a complete address and R/W signal from the processor, under
the control of a request signal which indicates that a memory access operation is
needed. The Controller then forwards the row and column portions of the address to the
memory and generates RAS and CAS signals. It also sends R/W and CS signals to the
memory. The CS signal is usually active low, hence it is shown as CS.
Refresh Overhead:
All dynamic memories have to be refreshed. In DRAM, the period for refreshing all
rows is 16ms whereas 64ms in SDRAM.
Rambus Memory:
The usage of wide bus is expensive. Rambus developed the implementation of
94
narrow bus. Rambus technology is a fast signaling method used to transfer information
between chips. Instead of using signals that have voltage levels of either 0 or Vsupply to
represent the logical values, the signals consist of much smaller voltage swings around a
reference voltage Vref. The reference Voltage is about 2V and the two logical values are
represented by 0.3V swings above and below Vref.
A two channel Rambus has 18 data lines which have no separate address lines. It is also
called as Direct RDRAM‘s. Communication between processor or some other device that
can serves as a master and RDRAM modules are served as slaves, is carried out by means of
packets transmitted on the data lines.
There are 3 types of packets. They are,
• Request
• Acknowledge
• Data
95
Figure 3.9: ROM cell
At Logic value ‗0‘ → Transistor (T) is connected to the ground point (P). Transistor switch
is closed and voltage on bit line nearly drops to zero.
At Logic value ‗1‘ → Transistor switch is open. The bit line remains at high voltage. To
read the state of the cell, the word line is activated. A Sense circuit at the end of the bit
line generates the proper output value.
Different types of non-volatile memory are:
• PROM
• EPROM
• EEPROM
• Flash Memory
M
e • It provides flexibility.
r • It is faster.
i • It is less expensive because they can be programmed directly by the user.
t
:
96
used, which has the ability to function either as a normal transistor or as a disabled
97
permanently open switch, by injecting charge into it that becomes trapped inside.
Erasure requires dissipating the charges trapped in the transistor of memory
cells. This can be done by exposing the chip to ultraviolet light, so that EPROM chips are
mounted in packages that have transparent windows.
Merits:
• It provides flexibility during the development phase of digital system.
• It is capable of retaining the stored information for a long time.
Demerits:
• The chip must be physically removed from the circuit for reprogramming and its
entire contents are erased by UV light.
When larger amounts of static data are to be stored (such as in USB flash drives) a
specific type of EEPROM such as flash memory is more economical than traditional
EEPROM devices. EEPROMs are realized as arrays of floating-gate transistors. EEPROM is
user modifiable read only memory (ROM) that can be erased and reprogrammed
(written to) repeatedly through the application of higher than normal electrical voltage
generated externally or internally in the case of modern EEPROMs.
EPROM usually must be removed from the device for erasing and programming, whereas
EEPROMs can be programmed and erased in circuit. Originally, EEPROMs were limited to
single byte operations which made them slower, but modern EEPROMs allow multi-byte
page operations. It also has a limited life - that is, the number of times it could be
reprogrammed was limited to tens or hundreds of thousands of times.
That limitation has been extended to a million write operations in modern EEPROMs.
In an EEPROM that is frequently reprogrammed while the computer is in use, the life of
the EEPROM can be an important design consideration. It is for this reason that
EEPROMs were used for configuration information, rather than random access memory.
Merits:
It can be both programmed and erased electrically.
It allows the erasing of all cell contents selectively.
Demerits:
It requires different voltage for erasing, writing and reading the stored data.
98
FLASH MEMORY:
In EEPROM, it is possible to read and write the contents of a single cell. In Flash
device, it is possible to read the contents of a single cell but it is only possible to write
the entire contents of a block. Prior to writing, the previous contents of the block are
erased.
E.g.: In MP3 player, the flash memory stores the data that represents sound.
Single flash chips cannot provide sufficient storage capacity for embedded system
application. There are 2 methods for implementing larger memory modules consisting of
number of chips. They are,
• Flash Cards
• Flash Drives.
Merits:
• Flash drives have greater density which leads to higher capacity and low cost per
bit.
• It requires single power supply voltage and consumes less power in their
operation.
Flash Cards:
One way of constructing larger module is to mount flash chips on a small card.
Such flash card have standard interface. The card is simply plugged into a conveniently
accessible slot. Its memory size is of 8, 32,64MB. E.g.: A minute of music can be stored in
1MB of memory. Hence 64MB flash cards can store an hour of music.
Flash Drives:
Larger flash memory module can be developed by replacing the hard disk drive. The
flash drives are designed to fully emulate the hard disk. The flash drives are solid state
electronic devices that have no movable parts.
Merits:
• They have shorter seek and access time which results in faster response.
They have low power consumption which makes them attractive for battery driven
application.
• The capacity of flash drive (<1GB) is less than hard disk (>1GB).
• It leads to higher cost per bit.
• Flash memory will deteriorate after it has been written a number of times
(typically at least 1 million times.)
99
SPEED, SIZE COST:
Magnetic Disk:
A huge amount of cost effective storage can be provided by magnetic disk. The main
memory can be built with DRAM which leaves SRA‘s to be used in smaller units where
speed is of essence.
MEMORY HIERARCHY
To this point in our study of systems, we have relied on a simple model of a computer
system as a CPU that executes instructions and a memory system that holds instructions
and data for the CPU. In our simple \model, the memory system is a linear array of bytes,
and the CPU can access each memory location in a constant amount of time. While this is an
effective model as far as it goes, it does not reflect the way that modern systems really
work. In practice, a memory system is a hierarchy of storage devices with different
capacities, costs, and access times. CPU registers hold the most frequently used data. Small,
fast cache memories nearby the CPU act as staging areas for a subset of the data and
instructions stored in the relatively slow main memory.
100
The main memory stages data stored on large, slow disks, which in turn often serve as
staging areas for data stored on the disks or tapes of other machines connected by
networks.
Memory hierarchies work because well written programs tend to access the storage at
any particular level more frequently than they access the storage at the next lower level.
So the storage at the next level can be slower, and thus larger and cheaper per bit. The
overall effect is a large pool of memory that costs as much as the cheap storage near the
bottom of the hierarchy, but that serves data to programs at the rate of the fast storage
near the top of the hierarchy.
As a programmer, you need to understand the memory hierarchy because it has a big
impact on the performance of your applications. If the data your program needs are
stored in a CPU register, then they can be accessed in zero cycles during the execution of
the instruction. If stored in a cache, 1 to 30 cycles. If stored in main memory, 50 to 200
cycles.
And if stored in disk tens of millions of cycles!. The entire computer memory can be
realized as the hierarchy shown in the Figure 14.10.
Here, then, is a fundamental and enduring idea in computer systems: If you understand
how the system moves data up and down the memory hierarchy, then you can write
your application programs so that their data items are stored higher in the hierarchy,
where the CPU can access them more quickly. This idea centers on a fundamental
property of computer programs known as locality.
Programs with good locality tend to access the same set of data items over and over
again, or they tend to access sets of nearby data items. Programs with good locality tend
to access more data items from the upper levels of the memory hierarchy than programs
with poor locality, and thus run faster.
For example, the running times of different matrix multiplication kernels that perform
thesame number of arithmetic operations, but have different degrees of locality, can
vary by a factor of 20!
101
Figure : Memory Hierarchy
Types of Cache Memory:
The Cache memory is of 2 types. They are:
• Primary /Processor Cache (Level1 or L1 cache)
• Secondary Cache (Level2 or L2 cache)
Primary Cache → It is always located on the processor chip.
Secondary Cache → It is placed between the primary cache and the rest of the
memory.
The main memory is implemented using the dynamic components (SIMM, RIMM,
and DIMM). The access time for main memory is about 10 times longer than the access
time for L1 cache.
CACHE MEMORY
Cache memory is an integral part of every system now. Cache memory is random
access memory (RAM) that a computer microprocessor can access more quickly than it
can access regular RAM. As the microprocessor processes data, it looks first in the cache
memory and if it finds the data there (from a previous reading of data), it does not have
to do the more time consuming reading of data from larger memory.
102
The effectiveness of cache mechanism is based on the property of Locality of
reference.
Locality of Reference:
During some time period and remainder of the program is accessed relatively
infrequently. It manifests itself in 2 ways. They are Temporal (The recently executed
instruction are likely to be executed again very soon), Spatial (The instructions in close
proximity to recently executed instruction are likely to be executed soon). If the active
segment of the program is placed in cache memory, then the total execution time can be
reduced significantly.
If the active segment of a program can be placed in a fast cache memory, then the
total execution time can be reduced significantly. The operation of a cache memory is
very simple. The memory control circuitry is designed to take advantage of the property
of locality of reference. The term Block refers to the set of contiguous address locations
of some size. The cache line is used to refer to the cache block.
PROCESSOR MAIN
MEMORY
CACHE
103
The Figure shows arrangement of Cache between processor and main memory.
The Cache memory stores a reasonable number of blocks at a given time but this
number is small compared to the total number of blocks available in Main Memory. The
correspondence between main memory block and the block in cache memory is
specified by a mapping function. The Cache control hardware decides that which block
should be removed to create space for the new block that contains the referenced word.
The collection of rule for making this decision is called the replacement algorithm. The
cache control circuit determines whether the requested word currently exists in the
cache. If it exists, then Read/Write operation will take place on appropriate cache
location. In this case Read/Write hit will occur. In a Read operation, the memory will not
be involved.
Write-through protocol:
Here the cache location and the main memory locations are updated
simultaneously.
Write-back protocol:
This technique is to update only the cache location and to mark it as with
associated flag bit called dirty/modified bit. The word in the main memory will be
updated later, when the block containing this marked word is to be removed from the
cache to make room for a new block. If the requested word currently does not exist in
the cache during read operation, then read miss will occur. To overcome the read miss
Load –through / early restart protocol is used.
Read Miss:
The block of words that contains the requested word is copied from the main
memory into cache.
Load –through:
After the entire block is loaded into cache, the particular word requested is
forwarded to the processor. If the requested word does exist in the cache during write
operation, then Write Miss will occur. If Write through protocol is used, the information
is written directly into main memory. If Write back protocol is used then blocks
containing the addressed word is first brought into the cache and then the desired word
in the cache is overwritten with the new information.
104
CACHE MEMORY DESIGN PARAMETERS
There are two cache design parameters that dramatically influence the cache
performance: the block size and the cache associability. There are also many other
implementation techniques both hardware and software that improve the cache
performance but they are not discussed here.
The simplest way to reduce the miss rate is to increase the block size. However
increasing the block size also increases the miss penalty (which is the time to load a block
from main memory into cache) so there is a trade–off between the block size and miss
penalty. We can increase the block size up to a level at which the miss rate is decreasing
but we also have to be sure that this benefit will not be exceeded by the increased miss
penalty.
The second cache design parameter that reduces cache misses is the associability. There
is an empirical result called the 2:1 rule of thumb which states that a direct mapped
cache of size N has about the same miss rate as a 2 way set associative cache of size N/2.
Unfortunately an increased associability will have a bigger hit time. More time will be
taken to retrieve a block inside of an associative cache than in a direct mapped cache. To
retrieve a block in an associative cache, the block must be searched inside of an entire
set since there is more than one place where the block can be stored.
Based on the cause that determines a cache miss we can classify the cache misses as
compulsory, capacity and conflict misses. This classification is called the 3C model.
Compulsory misses are issued when a first access is done to a block that is not in the
memory, so the block must be brought into cache. Increasing block size can reduce
compulsory misses due to perfecting the other elements in the block. If the cache cannot
contain all the blocks needed during the execution of a program, capacity misses will
occur due to blocks being discarded and later retrieved. If the block-placement strategy
is set associative or direct mapped, conflict misses (in addition to compulsory and
capacity misses) will occur because a block can be discarded and later retrieved if too
many blocks map to its set. Increasing the associability in general reduce the number of
conflict misses and implicitly the runtime of the programs. However this is not true all
the time. Minimizing the cache misses does not necessarily minimize the runtime. For
example, there can be fewer cache misses with more memory accesses.
105
Mapping Function:
Direct Mapping:
It is the simplest technique in which block j of the main memory maps onto block
‗J‘ modulo 128 of the cache. Thus whenever one of the main memory blocks 0, 128, 256
is loaded in the cache, it is stored in block 0. Block 1, 129, 257 are stored in cache block
1 and so on. The contention may arise when the cache is full, when more than one
memory block is mapped onto a given cache block position. The contention is resolved
by allowing the new blocks to overwrite the currently resident block. Placement of block
in the cache is determined from memory address.
The memory address is divided into 3 fields, namely,
• Low Order 4 bit field (word) → Select one of 16 words in a block.
• 7 bit cache block field → When new block enters cache, 7 bit determines the
cache position in which this block must be stored.
• 5 bit Tag field → The high order 5 bits of the memory address of the block is
stored in 5 tag bits associated with its location in the cache.
As execution proceeds, the high order 5 bits of the address is compared with tag bits
associated with that cache location. If they match, then the desired word is in that block
of the cache. If there is no match, then the block containing the required word must be
first read from the main memory and loaded into the cache. The direct mapping is shown
in Figure 15.2.
106
Merit:
• It is easy to implement.
Demerit:
• It is not very flexible.
Associative Mapping:
Here, the main memory block can be placed into any cache block position.12 tag
bits will identify a memory block when it is resolved in the cache. The tag bits of an
address received from the processor are compared to the tag bits of each block of the
cache to see if the desired block is present. This is called associative mapping. It gives
complete freedom in choosing the cache location. A new block that has to be brought
into the cache has to replace (eject) an existing block if the cache is full.
107
In this method, the memory has to determine whether a given block is in the
cache. A search of this kind is called an associative Search. The associative-mapped cache
is shown in Figure 15.3.
Merit:
• It is more flexible than direct mapping technique.
Demerit:
• Its cost is high.
Set-Associative Mapping:
It is the combination of direct and associative mapping. The blocks of the cache are
grouped into sets and the mapping allows a block of the main memory to reside in any
block of the specified set. In this case, the cache has two blocks per set, so the memory
blocks 0, 64, 128 4032 map into cache set to 0 and they can occupy either of the two
block position within the set.
6 bit set field → Determines which set of cache contains the desired block.
6 bit tag field → the tag field of the address is compared to the tags of the two
blocks of the set to clock if the desired block is present.
108
No of blocks per set No of set field
2 6
3 5
8 4
The cache which contains 1 block per set is called direct Mapping. A cache that
has ‗k‘ blocks per set is called as k-way set associative cache. Each block contains a
control bit called a valid bit. The Valid bit indicates that whether the block contains valid
data. The dirty bit indicates that whether the block has been modified during its cache
residency. The set-associative mapping is shown in Figure 15.4.
If the main memory block is updated by a source and if the block in the source is
already exists in the cache, then the valid bit will be cleared to ‗0‘. If Processor and DMA
use the same copies of data then it is called as the Cache Coherence Problem.
Merit:
• The Contention problem of direct mapping is solved by having few choices for block
placement.
• The hardware cost is decreased by reducing the size of associative search.
109
REPLACEMENT ALGORITHM:
In a direct-mapped cache, the position of each block is predetermined: hence, no
replacement strategy exists. In associative and set-associative caches there exists some
flexibility. When a new block is to be brought into the cache and all the positions that it
may occupy are full, the cache controller must decide which of the old blocks to
overwrite. This is an important issue, because the decision can be a strong determining
factor in system performance. In general, the objective is to keep blocks in the cache that
are likely to be referenced in the near future. However, it is not easy to determine which
blocks are about to be referenced.
Suppose it is required to track the LRU block of a four block set in a set-associative
cache. A 2-bit counter can be used for each block. When a hit occurs, the counter of the
block that is referenced is set to 0. Counters with values originally lower than the
referenced one are incremented by one, and all others remain unchanged. When a miss
occurs and the set is not full, the counter associated with the new block loaded from the
main memory is set to 0, and the values of all other counters are increased by one. When
a miss occurs and the set is full, the block with the counter value 3 is removed, the new
block is put in its place, and its counter is set to 0.
The other three block counters are incremented by one (It can be easily verified that the
counter values of occupied blocks are always distinct). The LRU algorithm has been used
extensively. Although it performs well for many access patterns, it can lead to poor
performance in some cases. For example, it produces disappointing results when
accesses are made to sequential elements of an array that is slightly too large to fit into
the cache. Performance of the LRU algorithm can be improved by introducing a small
amount of randomness in deciding which block to replace.
110
Several other replacement algorithms are also used in practice. An intuitively reasonable
rule would be to remove the "oldest" block from a full set when a new block must be
brought in. However, because this algorithm does not take into account the recent
pattern of access to blocks in the cache, it is generally not as effective as the LRU
algorithm in choosing the best blocks to remove. The simplest algorithm is to randomly
choose the block to be overwritten. Interestingly enough, this simple algorithm has been
found to be quite effective in practice.
Example: Let us consider 4 blocks/set, in set associative cache, where 2 bit counter can
be used for each block. When a ‗hit‘ occurs, then block counter = 0, the counter with
values originally lower than the referenced one are incremented by 1 and all others
remain unchanged. When a ‗miss‘ occurs and if the set is full, the blocks with the
counter value 3 is removed, the new block is put in its place and its counter is set to ‗0‘
and other block counters are incremented by 1.
Merit:
The performance of LRU algorithm is improved by randomness in deciding which
block is to be overwritten.
PERFORMANCE CONSIDERATION:
Two Key factors in the commercial success are the performance and cost where the
best possible performance is at low cost. A common measure of success is called the
Price Performance ratio.
Performance depends on how fast the machine instructions are brought to the processor
and how fast they are executed. To achieve parallelism (i.e., both the slow and fast units
are accessed in the same manner) interleaving is used.
Interleaving:
111
involved. At the same time, devices with DMA ability may be accessing information in
other modules.
In the second case, as shown in part (b) of the figure, which is called memory
interleaving. The low-order k bits of the memory address select a module, and the high-
order m bits name a location within the module. Thus, any component of the system that
generates requests for access to consecutive memory locations can keep several
modules busy at any one time which results in both faster access to a block of data and
higher average utilization of the memory system as a whole.
112
ultimately reflected in the time that the CPU is stalled because the required instructions
or data are not available for execution. In general, the miss penalty is the time needed to
bring a block of data from a slower unit in the memory hierarchy to a faster unit. The
miss penalty is reduced if efficient mechanisms for transferring data between the various
units of the hierarchy are implemented. The previous section shows how an interleaved
memory can reduce the miss penalty substantially. Consider now the impact of the cache
on the overall performance of the computer. Let h be the hit rate, M the miss penalty,
that is, the time to access information in the main memory, and C the time to access
information in the cache. The average access time experienced by the CPU is hC + (1 – h)
M.
113
OTHER ENHANCEMENTS
In addition to the main design issues just discussed, several other possibilities
exist for enhancing performance. We discuss three of them in this section.
Write Buffer
When the write-through protocol is used, each write operation results in writing
a new value into the main memory. If the CPU must wait for the memory function to be
completed, as we have assumed until now, then the CPU is slowed down by all write
requests. Yet the CPU typically does not immediately depend on the result of a write
operation, so it is not necessary for the CPU to wait for the write request to be
completed. To improve performance, a write buffer can be included for temporary
storage of write requests. The CPU places each write request into this buffer and
continues execution of the next instruction. The write requests stored in the write buffer
are sent to the main memory whenever the memory is not responding to read requests.
Note that it is important that the read requests be serviced immediately, because the
CPU usually cannot proceed without the data that is to be read from the memory.
Hence, these requests are given priority over write requests. The write buffer may hold a
number of write requests. Thus, it is possible that a subsequent read request may refer
to data that are still in the write buffer. To ensure correct operation, the addresses of
data to be read
from the memory are compared with the addresses of the data in the write buffer. In
case of a match, the data in the write buffer are used. This need for address comparison
entails considerable cost. But the cost is justified by improved performance. A different
situation occurs with the write-back protocol. In this case, the write operations are
simply performed on the corresponding word in the cache. But consider what happens
when a new block of data is to be brought into the cache as a result of a read miss, which
replaces an existing block that has some dirty data. The dirty block has to be written into
the main memory. If the required write-back is performed first, then the CPU will have to
wait longer for the new block to be read into the cache. It is more prudent to read the
new block first. This can be arranged by providing a fast write buffer for temporary
storage of the dirty block that is ejected from the cache while the new block is being
read. Afterward, the contents of the buffer are written into the main memory. Thus, the
write buffer also works well for the write-back protocol.
Prefetching
In the previous discussion of the cache mechanism, we assumed that new data
are brought into the cache when they are first needed. A read miss occurs, and the
desired data are loaded from the main memory. The CPU has to pause until the new data
114
arrive, which is the effect of the miss penalty. To avoid stalling the CPU, it is possible to
prefetch the data into the cache before they are needed. The simplest way to do this is
through software. A special prefetch instruction may be provided in the instruction set of
the processor. Executing this instruction causes the addressed data to be loaded into the
cache, as in the case of a read miss. However, the processor does not wait for the
referenced data. A prefetch instruction is inserted in a program to cause the data to be
loaded in the cache by the time they are needed in the program. The hope is that
prefetching will take place while the CPU is busy executing instructions that do not result
in a read miss, thus allowing accesses to the main memory to be overlapped with
computation in the CPU. Prefetch instructions can be inserted into a program either by
the programmer or by the compiler. It is obviously preferable to have the compiler insert
these instructions, which can be done with good success for many applications. Note
that software prefetching entails a certain overhead; because inclusion of prefetch
instructions increases the length of programs. Moreover, some prefetches may load into
the cache data that will not be used by the instruction. This can happen if the prefetched
data are ejected from the cache by a read miss involving other data. However, the
overall effect of software prefetching on performance is positive, and many processors
(including the PowerPC) have machine instructions to support this feature. Prefetching
can also be done through hardware. This involves adding circuitry that attempts to
discover a pattern in memory references, and then prefetches data according to this
pattern.
Lockup-Free
Cache the software prefetehing scheme just discussed does not work well if it
interferes significantly with the normal execution of instructions. This is the case if the
action of prefetehing stops other accesses to the cache until the prefetch is completed.
A cache of this type is said to be locked while it services a miss. We can solve this
problem by modifying the basic cache structure to allow the CPU to access the cache
while a miss is being serviced. In fact, it is desirable that more than one outstanding miss
can be supported. A cache that can support multiple outstanding misses is called lockup-
free. Since it can service only one miss at a time, it must include circuitry that keeps
track of all outstanding misses. This may be done with special registers that hold the
pertinent inFormation about these misses. Lockup-free caches were first used in the
early l980s in the Cyber series of computers manufactured by Control Data Company.
We have used software prefetching as an obvious motivation for a cache that is not
locked by a read miss. A much more important reason is that, in a processor that uses a
pipelined organization, which overlaps the execution of several instructions, a read miss
caused by one instruction could stall the execution of other instructions. A lockup-free
cache reduces the likelihood of such stalling.
115
VIRTUAL MEMORY:
Techniques that automatically move program and data blocks into the physical main memory
when they are required for execution is called the VirtualMemory.
The binary address that the processor issues either for instruction or data are called the
virtual / Logical address.
The virtual address is translated into physical address by a combination of hardware and
software components.This kind of address translation is done by MMU(Memory Management
Unit).When the desired data are in the main memory ,these data are fetched /accessed
immediately.If the data are not in the main memory,the MMU causes the Operating system to
bring the data into memory from the disk. Transfer of data between disk and main memory is
performed using DMA scheme.
Address Translation:
In address translation,all programs and data are composed of fixed length units called Pages.
The Page consists of a block of words that occupy contiguous locations in the main
memory. The pages are commonly range from 2K to 16K bytes in length.
116
The cache bridge speed up the gap between main memory and secondary storage and
it is implemented in software techniques.Each virtual address generated by the
processor contains virtual Page number(Low order bit) and offset(High order bit) Virtual
Page number+ Offset Specifies the location of a particular byte (or word) within
page.
Page Table:
It contains the information about the main memory address where the page is stored
& the current status of the page.
Page Frame:
An area in the main memory that holds one page is called the page frame.
Page Table Base Register:
It contains the starting address of the page table.
Virtual Page Number+Page Table Base register Gives the address of the corresponding
entry in the page table.ie)it gives the starting address of the page if that page currently
resides in memory
Control Bits in Page Table:
The Control bits specifies the status of the page while it is in main memory.
Function:
The control bit indicates the validity of the page ie)it checks whether the page is actually loaded
in the main memory.
It also indicates that whether the page has been modified during its residency in the
memory;this information is needed to determine whether the page should be written back to
the disk before it is removed from the main memory to make room for another page.
117
Fig:Virtual Memory Address Translation
The Page table information is used by MMU for every read & write access. The Page
table is placed in the main memory but a copy of the small portion of the page table is
located within MMU.
This small portion or small cache is called Translation LookAside Buffer(TLB). This
portion consists of the page table enteries that corresponds to the most recently
accessed pages and also contains the virtual address of the entry.
118
Fig:Use of Associative Mapped TLB
When the operating system changes the contents of page table ,the control bit in TLB
will invalidate the corresponding entry in the TLB.Given a virtual address,the MMU looks
in TLB for the referenced page.If the page table entry for this page is found in TLB,the
physical address is obtained immediately.If there is a miss in TLB,then the required entry
is obtained from the page table in the main memory & TLB is updated.When a program
generates an access request to a page that is not in the main memory ,then Page Fault
will occur.The whole page must be broght from disk into memry before an access can
proceed.
When it detects a page fault,the MMU asks the operating system to generate an interrupt.
The operating System suspend the execution of the task that caused the page fault and
begin execution of another task whose pages are in main memory because the long delay
occurs while page transfer takes place.When the task resumes,either the interrupted
instruction must continue from the point of interruption or the instruction must be
restarted.
If a new page is brought from the disk when the main memory is full,it must replace one of
the resident pages.In that case,it uses LRU algorithm which removes the least referenced
Page.A modified page has to be written back to the disk before it is removed from the main
memory. In that case,write –through protocol is used.
119
MEMORY MANAGEMENT REQUIREMENTS:
Management routines are part of the Operating system.Assembling the OS routine into virtual
address space is called „System Space‟. The virtual space in which the user application program
reside is called the „User Space’.Each user space has a separate page table.The MMU uses the
page table to determine the address of the table to be used in the translation process.
Hence by changing the contents of this register, the OS can switch from one space to another.
The process has two stages. They are,
➢
User State
➢
Supervisor state.
User State:
In this state,the processor executes the user program.
Supervisor State:
When the processor executes the operating system routines,the processor will be in
supervisor state.
Privileged Instruction:
In user state,the machine instructions cannot be executed.Hence a user program is
prevented from accessing the page table of other user spaces or system spaces.
The control bits in each entry can be set to control the access privileges granted to each
program.One program may be allowed to read/write a given page,while the other
programs may be given only red access.
SECONDARY STORAGE:
The Semi-conductor memories donot provide all the storage capability.
The Secondary storage devices provide larger storage requirements.
Some of the Secondary Storage devices are,
➢
Magnetic Disk
➢
Optical Disk
➢
Magnetic Tapes.
Magnetic Disk:
Magnetic Disk system consists o one or more disk mounted on a common spindle.
A thin magnetic film is deposited on each disk, usually on both sides.
The disk are placed in a rotary drive so that the magnetized surfaces move in close proximity
to read /write heads.Each head consists of magnetic yoke & magnetizing coil.
Digital information can be stored on the magnetic film by applying the current pulse of
suitable polarity to the magnetizing coil.Only changes in the magnetic field under the
head can be sensed during the Read operation.Therefore if the binary states 0 & 1 are
represented by two opposite states of magnetization, a voltage is induced in the head
only at 0-1 and at 1-0 transition in the bit stream.A consecutive (long string) of 0‟s &
1‟s are determined by using the clock which is mainly used for synchronization. Phase
120
Encoding or Manchester Encoding is the technique to combine the clocking
information with data.The Read/Write heads must be maintained at a very small
distance from the moving disk surfaces in order to achieve high bit densities.
When the disk are moving at their steady state, the air pressure develops between the disk
surfaces & the head & it forces the head away from the surface.The flexible spring connection
between head and its arm mounting permits the head to fly at the desired distance away from
the surface.
Wanchester Technology:
Read/Write heads are placed in a sealed, air –filtered enclosure called the
Wanchester Technology.
In such units, the read/write heads can operate closure to magnetic track surfaces because the
dust particles which are a problem in unsealed assemblies are absent
Merits:
It have a larger capacity for a given physical size.The data intensity is high because the storage
medium is not exposed to contaminating elements.The read/write heads of a disk system are
movable.
Each surface is divided into concentric tracks.Each track is divided into sectors.
The set of corresponding tracks on all surfaces of a stack of disk form a logical cylinder.
121
The data are accessed by specifying the surface number,track number and the
sector number.
The Read/Write operation start at sector boundaries.Data bits are stored serially
on eachtrack.Each sector usually contains 512 bytes.
Sector header -> contains identification information.It helps to find the desired
sector on the selected track.
ECC (Error checking code)- used to detect and correct errors.An unformatted disk has
no information on its tracks.
The formatting process divides the disk physically into tracks and sectors and this
process may discover some defective sectors on all tracks.The disk controller keeps
a record of such defects.
The disk is divided into logical partitions. They are,
➢
Primary partition
➢
Secondary partition
In the diag, Each track has same number of sectors.So all tracks have same storage
capacity. Thus the stored information is packed more densely on inner track than on
outer track.
Access time
There are 2 components involved in the time delay between receiving an address
and the beginning of the actual data transfer. They are,
➢
Seek time
➢
Rotational delay / Latency
Seek time – Time required to move the read/write head to the proper track.
Latency – The amount of time that elapses after the head is positioned over the
correct track until the starting position of the addressed sector passes under the
read/write head. Seek time + Latency = Disk access time
Typical disk
One inch disk- weight=1 ounce, size ->
comparable to matchbookCapacity -> 1GB
Inch disk has the following
parameter Recording surface=20
Tracks=15000 tracks/surface
Sectors=400.
Each sector stores 512 bytes of data
Capacity of formatted
disk=20x15000x400x512=60x109 =60GB Seek
time=3ms
Platter rotation=10000 rev/min
Latency=3ms
Internet transfer rate=34MB/s
Data Buffer / cache
A disk drive that incorporates the required SCSIcircuit is referred as SCSI drive.
The SCSI can transfer data at higher rate than the disk tracks.An efficient method to
deal with the possible difference in transfer rate between disk and SCSI bus is
123
accomplished by including a data buffer.This buffer is a semiconductor memory.
The data buffer can also provide cache mechanism for the disk (ie) when a
read request arrives at the disk, then controller first check if the data is
available in the cache(buffer). If the data is available in the cache, it can be
accessed and placed on SCSI bus . If it is not available then the data will be
retrieved from the disk.
Disk Controller
The disk controller acts as interface between disk drive and system bus.
The disk controller uses DMA scheme to transfer data between disk and main memory.
When the OS initiates the transfer by issuing Read/Write request, the
controllers register will load the following information. They are,
124
Part-A
2 MARKS QUESTIONS WITH ANSWERS
1. What is General purpose Register?
Ans: Fifteen general-purpose registers are visible at any one time, depending on the current
processor mode. These are R0-R12, SP, LR. The PC (R15) is not considered a general-
purpose register. SP (or R13) is the stack pointer.
2. What is Accumulator?
Ans: In a computer's central processing unit (CPU), an accumulator is a register in which
intermediate arithmetic and logic results are stored.
3. What is Micro operation?
Ans: Usually, micro-operations perform basicoperations on data stored in one or more
registers, including transferring data between registers or between registers and external
buses of the central processing unit (CPU), and performing arithmetic or logical
operations on registers
4. What is Register Transfer language?
Ans: In computer science, register transfer language (RTL) is a kind of intermediate
representation (IR) that is very close to assembly language, such as that which is used in a
compiler. Academic papers and textbooks also often use a form of RTL as an architecture-
neutral assembly language.
5. List some examples for Register Transfer Operations?
Ans: Example Description
R3 ← R1 + R2 Addition
R3 ← R1 - R2 (R1 + R2' + 1) Subtraction
R2 ← R2' Complement (really a logic operation)
R2 ← -R2 (R2' + 1) Negation
R1 ← R1 + 1 Increment
R1 ← R1 - 1 Decrement
6. What is Control Memory?
Ans: Control memory is a random access memory (RAM) consisting of addressable storage
registers. It is primarily used in mini and mainframe computers. It is used as a temporary
storage for data. Access to control memory data requires less time than to main memory; this
speeds up CPU operation by reducing the number of memory references for data storage and
retrieval. Access is performed as part of a control section sequence while the master clock
oscillator is running.
7. What are Logical Micro operations?
Ans: The logic micro-operation is used for specifying binary operation for the strings of bits
stored in the register. Here, each bit of register is treated as a separate binary variable. In
other words, each individual bit of a register operates with corresponding register bit.
Various logical operation like OR, AND, NAND, and other logical operators. Special
symbols are used for representing OR and AND operators with v and ^ respectively.
8. What is Control unit?
Ans: The control unit (CU) is a component of a computer's central processing unit (CPU)
that directs operation of the processor. It tells the computer's memory,
arithmetic/logic unit and input and output devices how to respond to a program's
instructions.
9. What is Control Word?
Ans: It is a word stored in a register (control register) used to control the operation of a
program digital device.
10. List some examples for Arithmetic Micro operations?
Ans: Arithmetic Micro Operations
The basic arithmetic micro operations are
1. Addition
2. Subtraction
125
3. Increment
4. Decrement
The additional arithmetic micro operations are
1. Add with carry
2. Subtract with borrow
3. Transfer/Load
IMPORTANT QUESTIONS
2 MARKS
1. Give a note on control address register? [Jun 2015]
2. Write micro operations for ADD R1, R2? [Dec 2015]
3. What is the function of Control Memory? [Dec 2015]
4. How does high impedance state behaves in bus system?[Dec 2015]
5. Explain the need of bootstrap loader? [III-Dec 2015]
Part-B
10 MARKS
1. (a) Define register transfer language? Explain the instructions with example?[Dec 2015]
(b) With a neat diagram explain 4-bit binary incrementer?
2. Draw and explain the micro program sequencer for the Control memory?[Jun 2015]
3. Write control sequence for fetching a word from memory with neat diagram? [Dec 2015]
4. (a) Draw and explain how to construct a bus line using three state buffers? [Jun 2015,
Dec 2015]
(b) How to multiply and divide a number by 2 using shift operations with example?
5. (a) Explain address sequencing capabilities required for the Control memory?[Dec 2015]
(b) Explain how to construct a common bus for 4 registers of n-bits each using 3 state
buffers?
6. Describe micro instruction sequencing with neat block diagram? [Dec 2015]
126
UNIT-4
INPUT/OUTPUT ORGANIZATION
ACCESSING I/O DEVICES
A simple arrangement to connect I/O devices to a computer is to use a single
bus arrangement, as shown in Figure 4.1. The bus enables all the devices connected to
it to exchange information. Typically, it consists of three sets of lines used to carry
address, data, and control signals. Each I/O device is assigned a unique set of
addresses. When the processor places a particular address on the address lines, the
device that recognizes this address responds to the commands issued on the control
lines. The processor requests either a read or a write operation, and the requested
data are transferred over the data lines. When I/O devices and the memory share
the same address space, the arrangement is called memory-mapped I/O.
With memory-mapped I/O, any machine instruction that accesses memory can
be used to transfer data to or from an I/O device. For example, if DATAIN is the
address of the input buffer associated with the keyboard, the instruction
Move DATAIN, RO
reads the data from DATAIN and stores them into processor register RO. Similarly,
the instruction
Move RO, DATAOUT
Sends the contents of register RO to location DATAOUT, which may be the output
data buffer of a display unit or a printer.
Most computer systems use memory-mapped I/O. Some processors have
special in and out instructions to perform I/O transfers. For example, processors in the
Intel have special I/O instructions and a separate 16-bit address space for I/O
devices. When building a computer system based on these processors, the designer
has the option of connecting I/O devices to use the special I/O address space or
simply incorporating them as part of the memory address space.
127
The latter approach is by far the most common as it leads to simpler software.
One advantage of a separate I/O address space is that I/O devices deal with fewer
address lines. Note that a separate I/O address space does not necessarily mean that
the I/O address lines are physically separate from the memory address lines. A
special signal on the bus indicates that the requested read or write transfer is an
I/O operation. When this signal is asserted, the memory unit ignores the requested
transfer. The I/O devices examine the low-order bits of the address bus to determine
whether they should respond.
Figure 4.2 illustrates the hardware required to connect an I/O device to the
bus. The address decoder enables the device to recognize its address when this
address appears on the address lines. The data register holds the data being
transferred to or from the processor. The status register contains information relevant
to the operation of the I/O device. Both the data and status registers are connected to
the data bus and assigned unique addresses. The address decoder, the data and status
registers, and the control circuitry required to coordinate I/O transfers constitute the
device's interface circuit.
I/O devices operate at speeds that are vastly different from that of the
processor. When a human operator is entering characters at a keyboard, the
processor is capable of executing millions of instructions between successive
character entries. An instruction that reads a character from the keyboard
should be executed only when a character is available in the input buffer of
the keyboard interface. Also, we must make sure that an input character is read
only once.
128
This example illustrates program-controlled I/O, in which the
processor repeatedly checks a status flag to achieve the required
synchronization between the processor and an input or output device. We say
that the processor polls the device. There are two other commonly used
mechanisms for implementing I/O operations: interrupts and direct memory
access. In the case of interrupts, synchronization is achieved by having the I/O
device send a special signal over the bus whenever it is ready for a data
transfer operation. Direct memory access is a technique used for high-speed I/O
devices. It involves having the device interface transfer data directly to or from
the memory, without continuous involvement by the processor.
INTERRUPTS
When a program enters a wait loop, it will repeatedly check the device status. During
this period, the processor will not perform any function.
The Interrupt request line will send a hardware signal called the interrupt signal to the
processor.
On receiving this signal, the processor will perform the useful function during the
waiting period.
The routine executed in response to an interrupt request is called the
interrupt- service routine, which is the PRINT routine in our example. Interrupts bear
considerable resemblance to subroutine calls. Assume that an interrupt
request arrives during execution of instruction I in figure 1
Program 1 Program 2
COMPUTER Routine PRINT routine
129
Figure 1. Transfer of control through the use of interrupts
The p r o c e s s o r f i r s t c omp l ete s e x e c u t i o n o f i n s t ru ct i on i . Then, it
loads the program counter with the address of the first instruction of the interrupt-
service routine. For the time being, let us assume that this address is hardwired in the
processor. After execution of the interrupt-service routine, the processor has to come
back to instruction
i +1. Therefore, when an interrupt occurs, the current contents of the PC, which point
to instruction i+1, must be put in temporary storage in a known location. A Return-
from- interrupt instruction at the end of the interrupt-service routine reloads the
PC from the temporary storage location, causing execution to resume at
instruction i +1. In many processors, the return address is saved on the processor
stack.
We should note that as part of handling interrupts, the processor must inform
the device that its request has been recognized so that it may remove its interrupt-
request signal. This may be accomplished by means of a special control signal on the bus.
An interrupt-acknowledge s i g n a l . The execution of an instruction in the interrupt-
service routine that accesses a status or data register in the device interface implicitly
informs that device that its interrupt request has been recognized.
126
INTERRUPT HARDWARE:
We pointed out that an I/O device requests an interrupt by activating a
bus line called interrupt-request. Most computers are likely to have several I/O
devices that can request an interrupt. A single interrupt-request line may be used
to serve n devices as
All devices are connected to the line via switches to ground. To request
an interrupt, a device closes its associated switch. Thus, i f a l l i n t e r r u p t -request
s i g n a l s INTR1 to INTRn are inactive, that is, if all switches are open, the voltage on
the interrupt- request line will be equal to Vdd. This is the inactive state of the line.
Since the closing of
One or more switches will cause the line voltage to drop to 0, the value of INTR is
the logical OR of the requests from individual devices, that is,
It is customary to use the complemented form INTR, to name the interrupt-request signal
on the common line, because this signal is active when in the low-voltage state.
127
Let us consider in detail the specific case of a single interrupt request from
one device. When a device activates the interrupt-request signal, it keeps this signal
activated until it learns that the processor has accepted its request. This means
that the interrupt- request signal will be active during execution of the interrupt-
service routine, perhaps until an instruction is reached that accesses the device in
question.
The first possibility is to have the processor hardware ignore the interrupt-
request line until the execution of the first instruction of the interrupt-service
routine has been completed. Then, by using an Interrupt-disable instruction as the
first instruction in the interrupt-service routine, the programmer can ensure that
no further interruptions will occur until an Interrupt-enable instruction is executed.
Typically, the Interrupt-enable instruction will be the last instruction in the interrupt-
service routine before the Return- from-interrupt instruction. The processor must
guarantee that execution of the Return- from-interrupt instruction is completed
before further interruption can occur.
The second option, which is suitable for a simple processor with only one
interrupt-request line, is to have the processor automatically disable interrupts
before starting the execution of the interrupt-service routine. After saving the
contents of the PC
and the processor status register (PS) on the stack, the processor performs the
equivalent of executing an Interrupt-disable instruction. It is often the case that
one bit in the PS register, called Interrupt-enable, indicates whether interrupts are
enabled.
In the third option, the processor has a special interrupt-request line for which
the interrupt-handling circuit responds only to the leading edge of the signal. Such
a line is said to be edge-triggered.
Before proceeding to study more complex aspects of interrupts, let us
summarize the sequence of events involved in handling an interrupt request from a
single device.
3. Interrupts are disabled by changing the control bits in the PS (except in the case
of edge-triggered interrupts).
128
4. The device is informed that its request has been recognized, and in
response, it deactivates the interrupt-request signal.
The means by which these problems are resolved vary from one computer to
another, And the approach taken is an important consideration in determining the
computer’s suitability for a given application.
When a request is received over the common interrupt-request line, additional
information is needed to identify the particular device that activated the line.
The information needed to determine whether a device is requesting an
interrupt is available in its status register. When a device raises an interrupt request,
it sets to 1 one of the bits in its status register, which we will call the IRQ bit. For
example, bits KIRQ and DIRQ are the interrupt request bits for the keyboard and
the display, respectively. The simplest way to identify the interrupting device is to
have the interrupt-service routine poll all the I/O devices connected to the bus. The
first device encountered with its IRQ bit set is the device that should be serviced.
An appropriate subroutine is called to provide the requested service.
The polling scheme is easy to implement. Its main disadvantage is the time
spent interrogating the IRQ bits of all the devices that may not be requesting any
service. An alternative approach is to use vectored interrupts, which we describe
next.
129
Vectored Interrupts:-
To reduce the time involved in the polling process, a device requesting an
interrupt may identify itself directly to the processor. Then, the processor
can immediately start executing the corresponding interrupt-service routine.
The term vectored interrupts refers to all interrupt-handling schemes based on this
approach.
A device requesting an interrupt can identify itself by sending a special
code to the processor over the bus. This enables the processor to identify individual
devices even if they share a single interrupt-request line. The code supplied by
the device may represent the starting address of the interrupt-service routine for
that device. The code length is typically in the range of 4 to 8 bits. The remainder of
the address is supplied by the processor based on the area in its memory where the
addresses for interrupt-service routines are located.
Interrupt Nesting: -
130
Proper operation requires that the delay in responding to an interrupt
request from the real-time clock be small in comparison with the interval between
two successive requests. To ensure that this requirement is satisfied in the presence
of other interrupting devices, it may be necessary to accept an interrupt request
from the clock during the execution of an interrupt-service routine for another
device.
This example suggests that I/O devices should be organized in a priority
structure. An interrupt request from a high-priority device should be accepted while the
processor is servicing another request from a lower-priority device.
131
Priority arbitration Circuit
Simultaneous Requests:-
Polling the status registers of the I/O devices is the simplest such
mechanism. In this case, priority is determined by the order in which the devices
are polled. When vectored interrupts are used, we must ensure that only one
device is selected to send its
interrupt vector code. A widely used scheme is to connect the devices to form a
daisy chain, as shown in figure 3a. The interrupt-request line INTR is common to all
devices. The interrupt-acknowledge line, INTA, is connected in a daisy-chain fashion,
such that the INTA signal propagates serially through the devices.
When several d evices raise an interrupt request and the INTR line is activated the
processor responds by setting the INTA line to 1. This signal is received by device 1
132
Device 1 passes the signal on to device 2 only if it does not require any service.
If device 1 has a pending request for interrupt, it blocks the INTA signal and proceeds
to put its identifying code on the data lines. Therefore, in the daisy-chain
arrangement, the device that is electrically closest to the processor has the
highest priority. The second device along the chain has second highest priority,
and so on.
The scheme in figure 3.a requires considerably fewer wires than the
individual connections in figure 2. The main advantage of the scheme in figure 2 is
that it allows the processor to accept interrupt requests from some devices but not
from others, depending upon their priorities. The two schemes may be
combined to produce the more general structure in figure 3b. Devices are
organized in groups, and each group is connected at a different priority level.
Within a group, devices are connected in a daisy chain. This organization is used
in many computer systems.
133
CONTROLLING DEVICE REQUESTS:
Until now, we have assumed that an I/O device interface generates an
interrupt request whenever it is ready for an I/O transfer, for example whenever
the SIN flag is 1. It is important to ensure that interrupt requests are generated
only by those I/O devices that are being used by a given program. Idle devices
must not be allowed to generate interrupt requests, even though they may
be ready to participate in I/O transfer operations. Hence, we need a
mechanism in the interface circuits of individual devices to control whether a device
is allowed to generate an interrupt request.
EXCEPTIONS:
An interrupt is an event that causes the execution of one program to be
suspended and the execution of another program to begin. So far, we have dealt only
with interrupts caused by requests received during I/O data transfers. However,
the interrupt mechanism is used in a number of other situations.
The term exception is often used to refer to any event that causes an
interruption. Hence, I/O interrupts are one example of an exception. We now
describe a few other kinds of exceptions.
134
Recovery from Errors:
Computers use a variety of techniques to ensure that all hardware
components are operating properly. For example, many computers include an error-
checking code in the main memory, which allows detection of errors in the stored
data. If errors occur, the control hardware detects it and informs the processor by
raising an interrupt.
Debugging:
Another important type of exception is used as an aid in debugging
programs. System software usually includes a program called a debugger,
which helps the programmer find errors in a program. The debugger uses
exceptions to provide two important facilities called trace and breakpoints.
135
Breakpoint provides a similar facility, except that the program being
debugged is interrupted only at specific points selected by the user. An
instruction called Trap or Software-interrupt is usually provided for this
purpose. Execution of this instruction results in exactly the same actions as
when a hardware interrupt request is received. While debugging a program, the
user may wish to interrupt program execution after instruction i. The debugging
routine saves instruction i+1 and replaces it with a software interrupt instruction.
When the program is executed and reaches that point, it is interrupted and
the debugging routine is activated. This gives the user a chance to examine
memory and register contents. When the user is ready to continue executing the
program being debugged, the debugging routine restores the saved instruction
that was a location i+1 and executes a Return-from-interrupt instruction.
Privilege Exception:
To protect the operating system of a computer from being corrupted by user
programs, certain instructions can be executed only while the processor is in
supervisor mode. These are called privileged instructions. For example, when
the processor is running in the user mode, it will not execute an instruction that
changes the priority level of the processor or that enables a user program to
access areas in the computer memory that have been allocated to other users.
An attempt to execute such an instruction will produce a privilege exceptions,
causing the processor to switch to the supervisor mode and begin executing an
appropriate routine in the operating system.
A special control unit may be provided to allow the transfer of large block of data at
high speed directly between the external device and main memory , without continous
intervention by the processor. This approach is called DMA.
DMA transfers are performed by a control circuit called the DMA Controller. To initiate
the transfer of a block of words , the processor sends,
Starting address
Number of words in the block
Direction of transfer.
136
When a block of data is transferred , the DMA controller increment the memory
address for successive words and keep track of number of words and it also informs the
processor by raising an interrupt signal.
While DMA control is taking place, the program requested the transfer cannot
continue and the processor can be used to execute another program.
After DMA transfer is completed, the processor returns to the program that requested the
transfer.
R/W =1, DMA controller read data from memory to I/O device.
Done Flag=1, the controller has completed transferring a block of data and is ready to
receive another command.
IE=1, it causes the controller to raise an interrupt (interrupt Enabled) after it has
completed transferring the block of data.
31 30 1 0
Status &
Control Flag
IRQ
DONE
IE R/W
Starting address
Word count
137
Fig: Use of DMA controllers in a computer system
A DMA controller connects a high speed network to the computer bus . The disk
controller two disks, also has DMA capability and it provides two DMA channels.
To start a DMA transfer of a block of data from main memory to one of the disks, the
program write s the address and the word count inf. Into the registers of the
corresponding channel of the disk controller.
When DMA transfer is completed, it will be recorded in status and control registers of
the DMA channel (ie) Done bit=IRQ=IE=1.
Cycle Stealing:
Requests by DMA devices for using the bus are having higher priority than processor
requests .
Top priority is given to high speed peripherals such as , D i s k H i g h speed Network
Interface and Graphics display device.
Since the processor originates most memory access cycles, the DMA controller can be said
to steal the memory cycles from the processor.
This interviewing technique is called Cycle stealing.
burst Mode:
The DMA controller may be given exclusive access to the main memory to
transfer a block of data without interruption. This is known as Burst/Block Mode
Bus Master:
The device that is allowed to initiate data transfers on the bus at any given time is
called the bus master.
Bus Arbitration:
It is the process by which the next device to become the bus master is selected and the
bus mastership is transferred to it.
138
Types:
There are 2 approaches to bus arbitration. They are,
Centralized Arbitration:
Here the processor is the bus master and it may grants bus mastership to one of its
DMA controller.
A DMA controller indicates that it needs to become the bus master by activating the
Bus Request line (BR) which is an open drain line.
The signal on BR is the logical OR of the bus request from all devices connected to it.
When BR is activated the processor activates the Bus Grant Signal (BGI) and indicated
the DMA controller that they may use the bus when it becomes free.
This signal is connected to all devices using a daisy chain arrangement.
If DMA requests the bus, it blocks the propagation of Grant Signal to other devices
and it indicates to all devices that it is using the bus by activating open collector line, Bus
Busy (BBSY).
139
Fig: Sequence of signals during transfer of bus mastership for the devices
The timing diagram shows the sequence of events for the devices connected to the
processor is shown.DMA controller 2 requests and acquires bus mastership and later
releases the bus. During its tenture as bus master, it may perform one or more data
transfer.
After it releases the bus, the processor resources bus mastership
Distributed Arbitration:
It means that all devices waiting to use the bus have equal responsibility in carrying out
the arbitration process.
A bus protocol is the set of rules that govern the behavior of various devices
connected to the bus ie, when to place information in the bus, assert control signals etc.
The bus lines used for transferring data is grouped into 3 types. They are,
• Address line
• Data line
• Control line.
Control signals Specifies that whether read / write operation has to performed.
It also carries timing infn/. (ie) they specify the time at which the processor & I/O devices
place the data on the bus & receive the data from the bus.
During data transfer operation, one device plays the role of a „Master‟.
Master device initiates the data transfer by issuing read / write command on the
bus. Hence it is also called as „Initiator‟.
The device addressed by the master is called as Slave / Target.
Types of Buses:
There are 2 types of buses. They are,
• Synchronous Bus
• Asynchronous Bus.
141
Synchronous Bus:-
In synchronous bus, all devices derive timing information from a common clock line.
Equally spaced pulses on this line define equal time.
During a „bus cycle‟, one data transfer on take place.
The „crossing points‟ indicate the tone at which the patterns change.
A „signal line’ in an indeterminate / high impedance state is represented by an
intermediate half way between the low to high signal levels.
The picture shows two views of the signal except the clock.
One view shows the signal seen by the master & the other is seen by the salve. The master
sends the address & command signals on the rising edge at the beginning of clock period
(t0). These signals do not actually appear on the bus until tam.
Signals do not appear on the bus as soon as they are placed on the bus, due to the
propagation delay in the interface circuits.
Signals reach the devices after a propagation delay which depends on the
characteristics of the bus.
Data must remain on the bus for some time after t2 equal to the hold time of the buffer.
144
Asynchronous Bus:-
An alternate scheme for controlling data transfer on the bus is based on the use of
„handshake‟ between Master & the Slave. The common clock is replaced by two timing
control lines.
They are
Master–ready
Slave ready.
145
The handshake protocol proceed as follows :
At t0 The master places the address and command information on the bus
and all devices on the bus begin to decode the information
At t1 The master sets the Master ready line to 1 to inform the I/O devices
that the address and command information is ready.
The delay t1 – t0 is intended to allow for any skew that may occurs on the bus.
The skew occurs when two signals simultaneously transmitted from one source arrive
at the destination at different time.
Thus to guarantee that the Master ready signal does not arrive at any device a
head of the address and command information the delay t1 – t0 should be larger than
the maximum possible bus skew.
In this diagram, the master place the output data on the data lines and at the same
time it transmits the address and command information.
The selected slave strobes the data to its o/p buffer when it receives the Master-
ready signal and it indicates this by setting the slave – ready signal to 1.
At time t0 to t1 and from t3 to t4, the Master compensates for bus.
A change of state is one signal is followed by a change is the other signal. Hence this
scheme is called as Full Handshake.
It provides the higher degree of flexibility and reliability.
146
INTERFACE CIRCUITS:
Parallel Port:
Parallel port transfers data in the form of a number of bits, normally 8 or 16 to or
from the device.
Serial port transfers and receives data one bit at a time.
Processor communicates with the bus in the same way, whether it is a parallel port
or a serial port.
Conversion from the parallel to serial and vice versa takes place inside the
interface circuit.
Processor is 32-bits and uses memory-mapped I/O and the asynchronous bus protocol.
On the processor side of the interface we have:
- Data lines.
- Address lines
- Control or R/W line.
- Master-ready signal and
- Slave-ready signal.
147
The output of the encoder consists of the bits that represent the encoded character and
one signal called valid,which indicates the key is pressed.
The information is sent to the interface circuits,which contains a data
register,DATAIN and a status flag SIN.
When a key is pressed, the Valid signal changes from 0 to1,causing the ASCII
code to be loaded into DATAIN and SIN set to 1.
The status flag SIN set to 0 when the processor reads the contents of the DATAIN
register.
The interface circuit is connected to the asynchronous bus on which transfers are
controlled using the Handshake signals Master ready and Slave-ready.
SIN signal is generated using a status flag circuit. It is connected to line D 0 of the processor
bus using a three-state driver. Address decoder selects the input interface basedon bits
A1 through A31. Bit A0 determines whether the status or data register is to be read, when
Master-ready is active. In response, the processor activates the Slave-readysignal, when
either the Read-status or Read-datais equal to 1, which depends on line A0.
149
Fig: Output Interface Circuit
Data lines of the processor bus are connected to the DATAOUT register of the interface.
The status flag SOUT is connected to the data line D1 using a three-state driver. The three-
state driver is turned on, when the control Read-status line is1. Address decoder selects the
150
output interface using address lines A1 through A31.Address line A0 determines whether
the data is to be loaded into the DATAOUT register or status flag isto be read. If the Load-
data line is 1, then the
Valid line is set to 1.If the Idle line is 1, then the statusflag SOUT is set to 1.
Data lines to I/O device are bidirectional. Data lines P7 through P0 can be used for both
input, and output. In fact, some lines can be used for input & some for output depending on
the pattern in the Data Direction Register (DDR). Processor places an 8-bit pattern into a
DDR.If a given bit position in the DDR is 1, the corresponding data line acts as an output
line, otherwise it acts as an input line. C1 and C2 control the interaction between the
interface circuit and the I/O devices. Ready and Accept lines are the handshake control
lines on the processor bus side, and are connected to Master-ready & Slave-ready.Input
signal My-address is connected to the output of an address decoder. Three register select
lines that allow up to 8registers to be selected.
151
Serial Port:
Serial port is used to connect the processor to I/O devices that require transmission of data
one bit at a time.
Serial port communicates in a bit-serial fashion on the device side and bit parallel fashion
on the bus side.
Transformation between the parallel and serial formats is achieved with shift registers that
have parallel access capability.
Input shift register accepts input one bit at a time from the I/O device. Once all the 8 bits
are received, the contents of the input shift register are loaded in parallel into DATAIN
register.Output data in the DATAOUT register are loaded into the output shift register. Bits
are shifted out of the output shift register and sent out to the I/O device onebit at a time.
As soon as data from the input shift reg.are loaded into DATAIN, it can start accepting
another 8 bits of data. Input shift register and DATAIN registersare both used at input so
that the input shift register can start receiving anotherset of 8 bits from the input device
after loading the contents to DATAIN, before the processor reads the contents of DATAIN.
This is called as double-buffering.
152
Serial interfaces require fewer wires, and hence serial transmission is convenient for
connecting devices that are physically distant from the computer.
Speed of transmission of the data over a serial interface is known as the “bit rate”.Bit rate
depends on the nature of the devices connected.In order to accommodate devices with a
range of speeds, a serial interface must be able to use a range of clock speeds.
Several standard serial interfaces have been developed:
Universal Asynchronous Receiver Transmitter (UART) for low-speed serial devices.
RS-232-C for connection to communication links.
when a particular design became commercially successful. For example, IBM developed a
standard called ISA (Industry Standard Architecture) for their personal computer
known at the time as PC AT. The popularity of that computer led to other manufacturers
producing ISA -compatible interfaces for their 110 devices, thus making ISA into a de facto
standard.
Some standards have been developed through industrial cooperative efforts, even
among competing companies driven by their common self-interest in having compatible
153
products. In some cases, organizations such as the IEEE (Institute of Electrical and
Electronics Engineers), ANSI (American Nationa1 Standards Institute), or international
bodies such as ISO (International Standards Organization) have blessed these standards
and given them an official status.
In this section, we present three widely used bus standards, PCI (Peripheral
Component Interconnect), SCSI (Small Computer System Interface), and USB (Universal
Serial Bus). The way these standards are used in a typical computer system is illustrated
in Figure 12.3. The PCI standard defines an expansion bus on the motherboard, SCSI and
USB are used for connecting additional devices, both inside and outside the computer
box. The SCSI bus is a high-speed parallel bus intended for devices such as disks and video
displays. The USB bus uses serial transmission to suit the needs of equipment ranging from
keyboards to game controls to internet connections. The figure shows an interface
circuit that enables devices compatible with the earlier ISA standard, such as the popular
IDE (Integrated Device Electronics) disk, to be connected. It also shows a connection to an
Ethernet. The Ethernet is a widely used local area network, providing a high-speed
154
connection among computers in a building or a university campus.
A given computer may use more than one bus standard. A typical Pentium
computer has both a PCI bus and an ISA bus, thus providing the user with a wide range
of devices to choose from.
PCI BUS
The PCI follows a sequence of bus standards that were used primarily in IBM
PCs. Early PCs used the 8-bit XT bus, whose signals closely mimicked those of Intel's
80x86 processors. Later, the 16-bit bus used on the PC AT computers became known as
the ISA bus. Its extended 32-bit version is known as the ISA bus. Other buses developed in
the eighties with similar capabilities are the Micro channel used in IBM PCs and the NuBus
used in Macintosh computers.
The PCI was developed as a low-cost bus that is truly processor independent. Its
design anticipated a rapidly growing demand for bus bandwidth to support high-speed
disks and graphic and video devices, as well as the specialized needs of multiprocessor
systems. As a result, the PCI is still popular as an industry standard almost a decade after
it was first introduced in 1992.
An important feature that the PCI pioneered is a plug-and-play capability for
connecting I/O devices. To connect a new device, the user simply connects the device
interface board to the bus.
Data Transfer
In today's computers, most memory transfers involve a burst of data rather than
just one word. The reason is that modern processors include a cache memory. Data are
transferred between the cache and the main memory in bursts of several words each.
The words involved in such a transfer are stored at successive memory locations. When
the processor (actually the cache controller) specifies an address and requests a
read operation from the main memory, the memory responds by sending a sequence of
data words starting at that address. Similarly, during a write operation, the processor
sends a memory address followed by a sequence of data words, to be written in
successive memory locations starting at that address. The PCI is designed primarily to
155
support this mode of operation. A read or a write operation involving a single word is
simply treated as a burst of length one.
The bus supports three independent address spaces: memory, I/O, and
configuration. The first two are self-explanatory. The I/O address space is intended for
use with processors, such as Pentium, that have a separate I/O address space.
However, the system designer may choose to use memory-mapped I/O even when a
separate I/O address space is available. In fact, this is the approach recommended by the
PCI standard for wider compatibility. The configuration space is intended to give the PCI
its plug-and- play capability. A 4-bit command that accompanies the address identifies
which of the three spaces is being used in a given data transfer operation.
Figure 12.3 shows the main memory of the computer connected directly to the processor
bus. An alternative arrangement that is used often with the PCI bus is shown in Figure 12.4.
The PCI Bridge provides a separate physical connection for the main memory. For electrical
reasons, the bus may be further divided into segments connected via bridges. However,
regardless of which bus segment a device is connected to, it may still be mapped into the
processor's memory address space.
Name Function
156
FRAME # Sent by the indicator to indicate the duration of
AD 32 address / data line
C/BE # 4 command / byte Enable Lines
IRDY, TRDYA Initiator Ready, Target Ready Signals
DEVSEL # Aresponse from the device indicating that it has
recognized its address and is ready for data transfer
transaction.
IDSEL # Initialization Device Select
157
During clock cycle 3, the initiator asserts IRDY #, to indicate that it is ready to receive
data.
If the target has data ready to send then it asserts TRDY #. In our eg, the target sends
3 more words of data in clock cycle 4 to 6.
The indicator uses FRAME # to indicate the duration of the burst, since it read 4
words, the initiator negates FRAME # during clock cycle 5.
After sending the 4th word, the target disconnects its drivers and negates DEVSEL
# during clockcycle 7.
Device Configuration
158
are used, and so on.
The PCI simplifies this process by incorporating in each I/O device interface a
small configuration ROM memory that stores information about that device. The
configuration ROMs of all devices are accessible in the configuration address space. The
PCI initialization software reads these ROMs whenever the system is powered up or
reset. In each case, it determines whether the device is a printer, a keyboard, an Ethernet
interface, or a disk controller. It can further learn about various device options and
characteristics.
The PCI has a configuration ROM memory that stores information about that device.
The configuration ROM‟s of all devices are accessible in the configuration address
space.
The initialization s/w read these ROM‟s whenever the S/M is powered up or reset In each
case, it determines whether the device is a printer, keyboard, Ethernet interface or disk
controller.
Devices are assigned address during initialization process and each device has an w/p
signal called IDSEL # (Initialization device select) which has 21 address lines
(AD) (AD to AD31).
During configuration operation, the address is applied to AD i/p of the device and the
corresponding AD line is set to and all other lines are set to 0.
The connectors can be plugged only in compatible motherboards PCI bus can
operate with either 5 – 33V power supply.
The motherboard can operate with signaling system.
SCSI refers to the standard bus which is defined by ANSI (American National
Standard Institute).
SCSI bus the several options. It may be,
159
Narrow bus It has 8 data lines & transfers 1 byte at a time.
Wide bus It has 16 data lines & transfer 2 byte at a time.
Single-Ended Transmission aEch signal uses separate wire.
HVD (High Voltage Differential) It was 5v (TTL cells)
LVD (Low Voltage Differential) It uses 3.3v
Because of these various options, SCSI connector may have 50, 68 or 80 pins. The data
transfer rate ranges from 5MB/s to 160MB/s 320Mb/s, 640MB/s.
The transfer rate depends on,
Length of the cable
Number of devices connected.
To achieve high transfer rat, the bus length should be 1.6m for SE signaling and
12m for LVD signaling.
The SCSI bus us connected to the processor bus through the SCSI controller. The data are
stored on a disk in blocks called sectors.
Each sector contains several hundreds of bytes. These data will not be stored in contiguous
memory location.
SCSI protocol is designed to retrieve the data in the first sector or any other selected
sectors.
Using SCSI protocol, the burst of data are transferred at high speed. The controller
connected to SCSI bus is of 2 types. They are,
• Initiator
• Target
Initiator:
It has the ability to select a particular target & to send commands specifying the operation
to be performed.
They are the controllers on the processor side.
Target:
The disk controller operates as a target.
It carries out the commands it receive from the initiator. The initiator establishes a logical
connection with the intended target.
Steps:
Consider the disk read operation, it has the following sequence of events.
The SCSI controller acting as initiator, contends process, it selects the target controller &
hands over control of the bus to it.
The target starts an output operation, in response to this the initiator sends a command
specifying the required read operation.
160
The target that it needs to perform a disk seek operation, sends a message to the
initiator indicating that it will temporarily suspends the connection between them. Then it
releases the bus.
The target controller sends a command to disk drive to move the read head to the first
sector involved in the requested read in a data buffer. When it is ready to begin transferring
data to initiator, the target requests control of the bus. After it wins arbitration, it reselects
the initiator controller, thus restoring the suspended connection.
The target transfers the controls of the data buffer to the initiator & then suspends the
connection again. Data are transferred either 8 (or) 16 bits in parallel depending on the
width of the bus.
The target controller sends a command to the disk drive to perform another seek
operation. Then it transfers the contents of second disk sector to the initiator. At the end of
this transfer, the logical connection b/w the two controller is terminated.
As the initiator controller receives the data, if stores them into main memory using
DMA approach.
The SCSI controller sends an interrupt to the processor to inform it that the requested
operation has been completed.
Bus Signals:-
The bus has no address lines.
Instead, it has data lines to identify the bus controllers involved in the selection /
reselection / arbitration process.
For narrow bus, there are 8 possible controllers numbered from 0 to 7. For a wide bus,
there are 16 controllers.
Once a connection is established b/w two controllers, these is no further need for
addressing & the datalines are used to carry the data.
161
All signal names are proceeded by minus sign.
This indicates that the signals are active or that the dataline is equal to 1, when they are in
the low voltage state.
Fig:Arbitration and selection on the SCSI bus.Device 6 wins arbitration and select device 2
162
Selection:
Here Device wons arbitration and it asserts –BSY and –DB6 signals. The Select Target
Controller responds by asserting –BSY.
This informs that the connection that it requested is established.
Reselection:
The connection between the two controllers has been reestablished,with the target in
control the bus as required for data transfer to proceed.
• It provide a simple, low cost & easy to use interconnection s/m that overcomes the
difficulties due to the limited number of I/O ports available on a computer.
• It accommodate a wide range of data transfer characteristics for I/O devices
including telephone & Internet connections.
• Enhance user convenience through ‘Plug & Play’ mode of operation.
Port Limitation:-
Normally the system has a few limited ports.
To add new ports, the user must open the computer box to gain access to the internal
expansion bus & install a new interface card.
The user may also need to know to configure the device & the s/w.
Merits of USB:-
USB helps to add many devices to a computer system at any time without opening the
computer box.
Device Characteristics:-
The kinds of devices that may be connected to a cptr cover a wide range of
functionality.
163
The speed, volume & timing constrains associated with data transfer to & from devices
varies significantly.
Eg:1 Keyboard Since the event of pressing a key is not synchronized to any other event in a
computer system, the data generated by keyboard are called asynchronous. The data
generated from keyboard depends upon the speed of the human operator which is about
100bytes/sec.
USB Architecture:-
USB has a serial bus format which satisfies the low-cost & flexibility requirements.
164
Clock & data information are encoded together & transmitted as a single signal. There are
no limitations on clock frequency or distance arising form data skew, & hence it is possible
to provide a high data transfer bandwidth by using a high
clock frequency.
To accommodate a large no/. of devices that can be added / removed at any time, the
USB has the tree structure.
Each node of the tree has a device called a hub, which acts as an intermediate
control point between the host and the I/O devices. At the root of the tree, a root
hub connects the entire tree to the host computer. The leaves of the tree are the
I/O devices being served (for example, keyboard, Internet connection, speaker, or
digital TV)
In normal operation, a hub copies a message that it receives from its upstream
connection to all its downstream ports. As a result, a message sent by the host
computer is broadcast to all I/O devices, but only the addressed device will respond
to that message. However, a message from an I/O device is sent only upstream
towards the root of the tree and is not seen by other devices. Hence, the USB
165
enables the host to communicate with the I/O devices, but it does not enable these
devices to communicate with each other.
Addressing:
When a USB is connected to a host computer, its root hub is attached to the
processor bus, where it appears as a single device. The host software communicates
with individual devices attached to the USB by sending packets of information,
which the root hub forwards to the appropriate device in the USB tree.
Each device on the USB, whether it is a hub or an I/O device, is assigned a 7-bit
address. This address is local to the USB tree and is not related in any way to the
addresses used on the processor bus.
A hub may have any number of devices or other hubs connected to it, and addresses
are assigned arbitrarily. When a device is first connected to a hub, or when it is
powered on, it has the address 0. The hardware of the hub to which this device is
connected is capable of detecting that the device has been connected, and it
records this fact as part of its own status information. Periodically, the host polls
each hub to collect status information and learn about new devices that may have
been added or disconnected.
When the host is informed that a new device has been connected, it uses a
sequence of commands to send a reset signal on the corresponding hub port, read
information from the device about its capabilities, send configuration information to
the device, and assign the device a unique USB address. Once this sequence is
completed the device begins normal operation and responds only to the new
address.
USB protocols:
All information transferred over the USB is organized in packets, where a packet
consists of one or more bytes of information. There are many types of packets that
perform a variety of control functions.
The information transferred on the USB can be divided into two broad categories:
control and data.
166
A packet consists of one or more fields containing different kinds of information.
The first field of any packet is called the packet identifier, PID, which identifies the
type of that packet.
They are transmitted twice. The first time they are sent with their true values, and
the second time with each bit complemented
The four PID bits identify one of 16 different packet types. Some control packets,
such as ACK (Acknowledge), consist only of the PID byte.
Control packets used for controlling data transfer operations are called token packets.
167
Figure: An output transfer
Isochronous Traffic on USB:
One of the key objectives of the USB is to support the transfer of isochronous data.
Devices that generates or receives isochronous data require a time reference to control the sampling
process.
To provide this reference. Transmission over the USB is divided into frames of equal length.
The root hub generates a Start of Frame control packet (SOF) precisely once every 1 ms to mark the
beginning of a new frame.
The arrival of an SOF packet at any device constitutes a regular clock signal that the device can use for its
own purposes.
To assist devices that may need longer periods of time, the SOF packet carries an 11- bit frame number.
Following each SOF packet, the host carries out input and output transfers for isochronous devices.
This means that each device will have an opportunity for an input or output transfer once every 1 ms.
170
Electrical Characteristics:
Thus, a hub or an I/O device may be powered directly from the bus, or it may have its own external
power connection.
At low speed, 1s and 0s are transmitted by sending a high voltage state (5V) on one or the other o
the two signal wires. For high-speed links, differential transmission is used.
171
Part-A
Part B
1. Explain briefly about INPUT/OUTPUT organization?
2. What are the different Interupt units in I/O organization?
3. What is Auxiliary Memory and explain its types with examples?
4. Explain briefly about Interface circuits?
5. What is Asynchronous data transfer and explain its types?
6. Explain briefly about DMA Controller?
7. What is Main Memory and explain its types?
172
UNIT-5
PIPELING
BASIC CONCEPTS
Consider how the idea of pipelining can be used in a computer. The processor
executes a program by fetching and executing instructions, one after the other. Let F i
and Ei refer to the fetch and execute steps for instruction Ii . Execution of a program
consists of a sequence of fetch and execute steps, as shown in Figure 8.1a.
Now consider a computer that has two separate hardware units, one for fetching
instructions and another for executing them, as shown in Figure 5.1b. The instruction
fetched by the fetch unit is deposited in an intermediate storage buffer, B1. This buffer
is needed to enable the execution unit to execute the instruction while the fetch unit is
fetching the next instruction. The results of execution are deposited in the destination
location specified by the instruction. For the purposes of this discussion, we assume
that both the source and the destination of the data operated on by the instructions
173
are inside the block labeled “Execution unit.”
174
Time
I1 I2 I3
F1 E1 F2 E2 F3 E3
Interstage buffer
B1
Instruction
Execution
fetch unit
unit
Time
Clock cycle 1 2 3 4
Instruction
I1 F1 E1
I2 F2 E2
I3 F3 E3
The computer is controlled by a clock whose period is such that the fetch and
execute steps of any instruction can each be completed in one clock cycle. Operation of
the computer proceeds as in Figure 5.1c. In the first clock cycle, the fetch unit fetches an
instruction I1 (step F1 ) and stores it in buffer B1 at the end of the clock cycle. In the
second clock cycle, the instruction fetch unit proceeds with the fetch operation for
instruction I2 (step F2 ). Meanwhile, the execution unit performs the operation specified by
instruction I1 , which is available to it in buffer B1 (step E1 ).
175
By the end of thesecond clock cycle, the execution of instruction I1 is completed
and instruction I2 is available. Instruction I2 is stored in B1, replacing I1 , which is no
longer needed. Step E2 is performed by the execution unit during the third clock cycle,
while instruction I3 is being fetched by the fetch unit. In this manner, both the fetch
and execute units are kept busy all the time. If the pattern in Figure 8.1c can be sustained
for a long time, the completion rate of instruction execution will be twice that achievable
by the sequential operation depicted in Figure 5.1a.
In summary, the fetch and execute units in Figure 5.1b constitute a two-stage
pipeline in which each stage performs one step in processing an instruction. An inter-
stage storage buffer, B1, is needed to hold the information being passed from one stage
to the next. New information is loaded into this buffer at the end of each clock cycle.
The processing of an instruction need not be divided into only two steps. For
example, a pipelined processor may process each instruction in four steps, as follows:
The sequence of events for this case is shown in Figure 8.2a. Four instructions are
in progress at any given time. This means that four distinct hardware units are needed,
as shown in Figure 8.2b. These units must be capable of performing their tasks
simultane- ously and without interfering with one another. Information is passed from
one unit to the next through a storage buffer. As an instruction progresses through the
pipeline, all the information needed by the stages downstream must be passed along.
For example, during clock cycle 4, the information in the buffers is as follows:
• Buffer B1 holds instruction I3 , which was fetched in cycle 3 and is
being decoded by the instruction-decoding unit.
• Buffer B2 holds both the source operands for instruction I 2 and the
specification of the operation to be performed. This is the information produced by the
decoding hardware in cycle 3. The buffer also holds the information needed for the write
step of instruction I2 (step W2 ). Even though it is not needed by stage E, this information
must be passed on to stage W in the following clock cycle to enable that stage to
perform the required Write operation.
• Buffer B3 holds the results produced by the execution unit and the
destination information for instruction I1 .
172
ROLE OF CACHE MEMORY
Each stage in a pipeline is expected to complete its operation in one clock cycle.
Hence, the clock period should be sufficiently long to complete the task being
performed in any stage. If different units require different amounts of time, the
clock period must allow the longest task to be completed. A unit that completes its
task early is idle for the remainder of the clock period. Hence, pipelining is most
effective in improving
Time
Clock cycle 1 2 3 4 5 6 7
Instruction
I1 F1 D1 E1 W1
I2 F2 D2 E2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
Interstage buffers
D : Decode
F : Fetch instruction E: Execute W : Write
instruction and fetch operation results
operands
B1 B2 B3
173
performance if the tasks being performed in different stages require about the same
amount of time.
This consideration is particularly important for the instruction fetch step, which is
assigned one clock period in Figure 5.2a. The clock cycle has to be equal to or greater
than the time needed to complete a fetch operation. However, the access time of the
main memory may be as much as ten times greater than the time needed to perform
basic pipeline stage operations inside the processor, such as adding two numbers. Thus,
if each instruction fetch required access to the main memory, pipelining would be of
little value.
The use of cache memories solves the memory access problem. In particular, when a
cache is included on the same chip as the processor, access time to the cache is usually
the same as the time needed to perform other basic operations inside the processor. This
makes it possible to divide instruction fetching and processing into steps that
are more or less equal in duration. Each of these steps is performed by a different
pipeline stage, and the clock period is chosen to correspond to the longest one.
PIPELINE PERFORMANCE
The pipelined processor in Figure 5.2 completes the processing of one instruction
in each clock cycle, which means that the rate of instruction processing is four times that
of sequential operation. The potential increase in performance resulting from
pipelining is proportional to the number of pipeline stages. However, this increase
would be achieved only if pipelined operation as depicted in Figure 5.2a could be
sustained without interruption throughout program execution. Unfortunately, this is not
the case.
For a variety of reasons, one of the pipeline stages may not be able to complete its
processing task for a given instruction in the time allotted. For example, stage E in the
four-stage pipeline of Figure 5.2b is responsible for arithmetic and logic operations,
and one clock cycle is assigned for this task. Although this may be sufficient for
most operations, some operations, such as divide, may require more time to complete.
Figure 5.3 shows an example in which the operation specified in instruction I2 requires
three cycles to complete, from cycle 4 through cycle 6. Thus, in cycles 5 and 6, the
Write stage must be told to do nothing, because it has no data to work with. Meanwhile,
the information in buffer B2 must remain intact until the Execute stage has completed
its operation. This means that stage 2 and, in turn, stage 1 are blocked from accepting
new instructions because the information in B1 cannot be overwritten. Thus, steps D 4
and F5 must be postponed as shown.
174
Time
Clock cycle 1 2 3 4 5 6 7 8 9
Instruction
I1 F1 D1 E1 W1
I2 F2 D2 E2 W2
I3 E3 W3
F3 D3
I4 F4 D4 E4 W4
I5 F5 D5 E5
Figure 5.3 Effect of an execution operation taking more than one clock cycle.
Pipelined operation in Figure 5.3 is said to have been stalled for two clock cycles.
Normal pipelined operation resumes in cycle 7. Any condition that causes the pipeline to
stall is called a hazard. We have just seen an example of a data hazard. A data hazard is any
condition in which either the source or the destination operands of an instruction are not
available at the time expected in the pipeline. As a result some operation has to be delayed,
and the pipeline stalls.
The pipeline may also be stalled because of a delay in the availability of an instruct-
tion. For example, this may be a result of a miss in the cache, requiring the instruction to be
fetched from the main memory. Such hazards are often called control hazards or instruction
hazards. The effect of a cache miss on pipelined operation is illustrated in Figure 8.4.
Instruction I1 is fetched from the cache in cycle 1, and its execution proceeds normally.
However, the fetch operation for instruction I 2 , which is started in cycle 2, results in a
cache miss. The instruction fetch unit must now suspend any further fetch re- quests and wait
for I2 to arrive. We assume that instruction I2 is received and loaded into buffer B1 at the
end of cycle 5. The pipeline resumes its normal operation at that point.
175
Time
Clock cycle 1 2 3 4 5 6 7 8 9
Instruction
I1 F1 D1 E1 W1
I2 F2 D2 E2 W2
I3 F3 D3 E3 W3
Time
Clock cycle 1 2 3 4 5 6 7 8 9
Stage
F: Fetch F1 F2 F2 F2 F2 F3
176
of the Execute or Write stage while another instruction is being fetched. If instructions
and data reside in the same cache unit, only one instruction can proceed and the other
instruction is delayed. Many processors use separate instruction and data caches to
avoid this delay.
An example of a structural hazard is shown in Figure 5.5. This figure shows how the
load instruction
Load X(R1),R2
can be accommodated in our example 4-stage pipeline. The memory address, X+[R1], is
computed in step E2 in cycle 4, then memory access takes place in cycle 5. The operand
read from memory is written into register R2 in cycle 6. This means that the execution
step of this instruction takes two clock cycles (cycles 4 and 5). It causes the pipeline to
stall for one cycle, because both instructions I2 and I3 require access to the register file
in cycle 6. Even though the instructions and their data are all available.
177
The pipeline is stalled because one hardware resource, the register file,
cannot handle two operations at once. If the register file had two input ports, that
is, if it allowed two simultaneous write operations, the pipeline would not be
stalled. In general, structural hazards are avoided by providing sufficient hardware
resources on the processor chip.
It is important to understand that pipelining does not result in individual
instructions being executed faster; rather, it is the throughput that increases, where
throughput is measured by the rate at which instruction execution is completed.
Any time one of the stages in the pipeline cannot complete its operation in one clock
cycle, the pipeline stalls, and some degradation in performance occurs. Thus, the
performance level of one instruction completion in each clock cycle is actually the
upper limit for the throughput achievable in a pipelined processor organized as in
Figure 8.2b.
An important goal in designing processors is to identify all hazards that may cause
the pipeline to stall and to find ways to minimize their impact. In the following sections
we discuss various hazards, starting with data hazards, followed by control hazards. In
each case we present some of the techniques used to mitigate their negative effect
on performance.
DATA HAZARDS
A data hazard is a situation in which the pipeline is stalled because the data to
be operated on are delayed for some reason, as illustrated in Figure 8.3. We will
now examine the issue of availability of data in some detail.
Consider a program that contains two instructions, I1 followed by I2 . When
this program is executed in a pipeline, the execution of I2 can begin before the
execution of I1 is completed. This means that the results generated by I1 may not
be available for use by I2 . We must ensure that the results obtained when
instructions are executed in a pipelined processor are identical to those obtained
when the same instructions are executed sequentially. The potential for obtaining
incorrect results when operations are performed concurrently can be demonstrated
by a simple example. Assume that A = 5, and consider the following two operations:
A←3+AB←4×A
When these operations are performed in the order given, the result is B = 32.
But if they are performed concurrently, the value of A used in computing B would
be the original value, 5, leading to an incorrect result. If these two operations are
performed by instructions in a program, then the instructions must be executed
one after the other, because the data used in the second instruction depend on the
result of the first instruction. On the other hand, the two operations
A←5×C
B ← 20 + C
178
can be performed concurrently, because these operations are independent.
MulR2,R3,R4
AddR5,R4,R6
give rise to a data dependency. The result of the multiply instruction is placed
into register R4, which in turn is one of the two source operands of the Add
instruction. Assuming that the multiply operation takes one clock cycle to complete,
execution would proceed as shown in Figure 5.6.
As the Decode unit decodes the Add instruction in cycle 3, it realizes that R4 is
used as a source operand. Hence, the D step of that instruction cannot be
completed until the W step of the multiply instruction has been completed.
Completion of step D2 must be delayed to clock cycle 5, and is shown as step D 2A in
the figure. Instruction I3 is fetched in cycle 3, but its decoding must be delayed
because step D3 cannot precede D2 . Hence, pipelined execution is stalled for two
cycles.
179
5.2.1 OPERAND FORWARDING
The data hazard just described arises because one instruction, instruction I2 in Figure
5.6, is waiting for data to be written in the register file. However, these data are
available at the output of the ALU once the Execute stage completes step E1 . Hence,
the delay can
be reduced, or possibly eliminated, if we arrange for the result of instruction I1 to be
forwarded directly for use in step E2 .
Figure 5.7a shows a part of the processor datapath involving the ALU and the
register file. This arrangement is similar to the three-bus structure except that
registers SRC1, SRC2, and RSLT have been added. These registers constitute the
interstage buffers needed for pipelined operation, as illustrated in Figure 8.7b. With
reference to Figure 8.2b, registers SRC1 and SRC2 are part of buffer B2 and RSLT is
part of B3. The data forwarding mechanism is provided by the blue connection lines. The
two multiplexers connected at the inputs to the ALU allow the data on the
destination bus to be selected instead of the contents of either the SRC1 or SRC2
register.
When the instructions in Figure 5.6 are executed in the datapath of Figure 5.7,
the operations performed in each clock cycle are as follows. After decoding
instruction I2 and detecting the data dependency, a decision is made to use data
forwarding. The operand not involved in the dependency, register R2, is read and
loaded in register SRC1 in clock cycle 3. In the next clock cycle, the product
produced by instruction I1 is available in register RSLT, and because of the forwarding
connection, it can be used in step E2 . Hence, execution of I2 proceeds without
interruption.
180
Figure 5.7 Operand forwarding in a pipelined processor.
181
I1 : Mul R2,R3,R4
NOP NOP
I2 : Add R5,R4,R6
The data dependencies encountered in the preceding examples are explicit and easily
detected because the register involved is named as the destination in instruction I1 and
as a source in I2 . Sometimes an instruction changes the contents of a register other
than the one named as the destination. An instruction that uses an auto increment or
auto decrement addressing mode is an example. In addition to storing new data in its
destination location, the instruction changes the contents of a source register used to
access one of its operands. All the precautions needed to handle data dependencies
involving the destination location must also be applied to the registers affected by an
auto increment or auto decrement operation. When a location other than one explicitly
named in an instruction as a destination operand is affected, the instruction is said to
have a side effect. For example, stack instructions, such as push and pop, produce
similar side effects because they implicitly use the auto increment and auto decrement
addressing modes.
Another possible side effect involves the condition code flags, which are used by
instructions such as conditional branches and add-with-carry. Suppose that registers
R1 and R2 hold a double-precision integer number that we wish to add to another
double- precision number in registers R3 and R4. This may be accomplished as
follows:
Add R1,R3
Add With Carry R2,R4
182
An implicit dependency exists between these two instructions through the carry
flag. This flag is set by the first instruction and used in the second instruction, which
performs the operation
Instructions that have side effects give rise to multiple data dependencies, which
lead to a substantial increase in the complexity of the hardware or software needed
to resolve them. For this reason, instructions designed for execution on pipelined
hardware should have few side effects. Ideally, only the contents of the destination
location, either a register or a memory location, should be affected by any given
instruction. Side effects, such as setting the condition code flags or updating the
contents of an address pointer, should be kept to a minimum. However, that the auto
increment and auto decrement addressing modes are potentially useful. Condition
code flags are also needed for recording such information as the generation of a
carry or the occurrence of overflow in an arithmetic operation. In Section 8.4 we
show how such functions can be provided by other means that are consistent with a
pipelined organization and with the requirements of optimizing compilers.
INSTRUCTION HAZARDS
The purpose of the instruction fetch unit is to supply the execution units with a
steady stream of instructions. Whenever this stream is interrupted, the pipeline
stalls, as Figure 5.4 illustrates for the case of a cache miss. A branch instruction
may also cause the pipeline to stall. We will now examine the effect of branch
instructions and the techniques that can be used for mitigating their impact. We start
with unconditional branches.
UNCONDITIONAL BRANCHES
183
branch penalty. In Figure 5.8, the branch penalty is one clock cycle. For a longer
pipeline, the branch penalty may be higher. For example, Figure 5.9a shows the
effect of a branch instruction on a four-stage pipeline. We have assumed that
the branch ad- dress is computed in step E2 . Instructions I3 and I4 must be
discarded, and the tar- get instruction, Ik , is fetched in clock cycle 5. Thus, the
branch penalty is two clock cycles.
Reducing the branch penalty requires the branch address to be computed earlier
in the pipeline. Typically, the instruction fetch unit has dedicated hardware to
identify a branch instruction and compute the branch target address as quickly as
possible after an instruction is fetched. With this additional hardware, both of
these tasks can be performed in step D 2 , leading to the sequence of events shown
in Figure 5.9b. In this case, the branch penalty is only one clock cycle.
184
Fig:5.9 Branch timing
185
To reduce the effect of these interruptions, many processors employ
sophisticated fetch units that can fetch instructions before they are needed and put
them in a queue.
Typically, the instruction queue can store several instructions. A separate unit,
which we call the dispatch unit, takes instructions from the front of the queue and
sends them to the execution unit. This leads to the organization shown in Figure
5.10. The dispatch unit also performs the decoding function.
To be effective, the fetch unit must have sufficient decoding and processing
capa- bility to recognize and execute branch instructions. It attempts to keep the
instruction queue filled at all times to reduce the impact of occasional delays when
fetching in- structions. When the pipeline stalls because of a data hazard, for
example, the dispatch unit is not able to issue instructions from the instruction
queue. However, the fetch unit continues to fetch instructions and add them to the
queue. Conversely, if there is a delay in fetching instructions because of a branch or a
cache miss, the dispatch unit continues to issue instructions from the instruction
queue.
We have assumed that initially the queue contains one instruction. Every fetch
operation adds one instruction to the queue and every dispatch operation reduces
the queue length by one. Hence, the queue length remains the same for the first
four clock cycles. (There is both an F and a D step in each of these cycles.) Suppose
that instruction I1 introduces a 2-cycle stall. Since space is available in the queue,
the fetch unit continues to fetch instructions and the queue length rises to 3 in
clock cycle 6.
Instruction I5 is a branch instruction. Its target instruction, Ik , is fetched in cycle
7, and instruction I6 is discarded. The branch instruction would normally cause a
stall in cycle 7 as a result of discarding instruction I6 . Instead, instruction I4 is
dispatched from the queue to the decoding stage. After discarding I6 , the queue
length drops to 1 in cycle 8. The queue length will be at this value until another stall
is encountered.
CONDITIONAL BRANCHES AND BRANCH PREDICTION
186
penalty, this large percentage would reduce the gain in performance expected from
pipelining. Fortunately, branch instructions can be handled in several ways to reduce
their negative impact on the rate of execution of instructions.
Delayed Branch
In Figure 5.8, the processor fetches instruction I3 before it determines whether
the current instruction, I2 , is a branch instruction. When execution of I2 is
completed and a branch is to be made, the processor must discard I3 and fetch
the instruction at the branch target. The location following a branch instruction is
called a branch delay slot. There may be more than one branch delay slot, depending
on the time it takes to execute a branch instruction. For example, there are two
branch delay slots in Figure 5.9a and one delay slot in Figure 5.9b. The instructions
in the delay slots are always fetched and at least partially executed before the
branch decision is made and the branch target address is computed.
187
Fig:5.12 Reordering of instructions for a delayed
Fig:5.13 Execution timing showing the delay slot being filled during the last two passes
188
The effectiveness of the delayed branch approach depends on how often it is pos-
sible to reorder instructions as in Figure 5.12. Experimental data collected from many
programs indicate that sophisticated compilation techniques can use one branch delay
slot in as many as 85 percent of the cases. For a processor with two branch delay slots,
the compiler attempts to find two instructions preceding the branch instruction that it
can move into the delay slots without introducing a logical error. The chances of finding
two such instructions are considerably less than the chances of finding one. Thus, if
increasing the number of pipeline stages involves an increase in the number of branch
delay slots, the potential gain in performance may not be fully realized.
Branch Prediction
Another technique for reducing the branch penalty associated with conditional
branches is to attempt to predict whether or not a particular branch will be taken. The
simplest form of branch prediction is to assume that the branch will not take place and to
continue to fetch instructions in sequential address order. Until the branch condition is
evaluated, instruction execution along the predicted path must be done on a speculative
basis. Speculative execution means that instructions are executed before the processor
is certain that they are in the correct execution sequence. Hence, care must be taken that
no processor registers or memory locations are updated until it is confirmed that these
instructions should indeed be executed. If the branch decision indicates otherwise, the
instructions and all their associated data in the execution units must be purged, and the
correct instructions fetched and executed.
189
The figure shows a Compare instruction followed by a Branch>0 instruction. Branch
prediction takes place in cycle 3, while instruction I3 is being fetched. The fetch unit
predicts that the branch will not be taken, and it continues to fetch instruction I 4 as I3
enters the Decode stage. The results of the compare operation are available at the end
of cycle 3. Assuming that they are forwarded immediately to the instruction fetch unit,
the branch condition is evaluated in cycle 4. At this point, the instruction fetch unit real-
izes that the prediction was incorrect, and the two instructions in the execution pipe are
purged. A new instruction, Ik , is fetched from the branch target address in clock cycle 5.
If branch outcomes were random, then half the branches would be taken. Then
the simple approach of assuming that branches will not be taken would save the time
lost to conditional branches 50 percent of the time. However, better performance can
be achieved if we arrange for some branch instructions to be predicted as taken and
others as not taken, depending on the expected program behavior. For example, a branch
instruction at the end of a loop causes a branch to the start of the loop for every pass
through the loop except the last one. Hence, it is advantageous to assume that this
branch will be taken and to have the instruction fetch unit start to fetch instructions at
the branch target address. On the other hand, for a branch instruction at the beginning
of a program loop, it is advantageous to assume that the branch will not be taken.
A decision on which way to predict the result of the branch may be made in
hardware by observing whether the target address of the branch is lower than or higher
than the address of the branch instruction. A more flexible approach is to have the
compiler decide whether a given branch instruction should be predicted taken or not
taken. The branch instructions of some processors, such as SPARC, include a branch
prediction bit, which is set to 0 or 1 by the compiler to indicate the desired behavior.
The instruction fetch unit checks this bit to predict whether the branch will be taken or
not taken.
With either of these schemes, the branch prediction decision is always the same
every time a given instruction is executed. Any approach that has this characteristic is
called static branch prediction. Another approach in which the prediction decision may
change depending on execution history is called dynamic branch prediction.
We have seen that some instructions are much better suited to pipelined execution than
others. For example, instruction side effects can lead to undesirable data dependencies.
In this section, we examine the relationship between pipelined execution and machine
instruction features. We discuss two key aspects of machine instructions — addressing
modes and condition code flags.
190
ADDRESSING MODES
Addressing modes should provide the means for accessing a variety of data structures
simply and efficiently. Useful addressing modes include index, indirect, autoincrement,
and autodecrement. Many processors provide various combinations of these modes to
increase the flexibility of their instruction sets. Complex addressing modes, such as those
involving double indexing, are often encountered.
In choosing the addressing modes to be implemented in a pipelined processor, we must
consider the effect of each addressing mode on instruction flow in the pipeline. Two
important considerations in this regard are the side effects of modes such as
autoincrement and autodecrement and the extent to which complex addressing modes
cause the pipeline to stall. Another important factor is whether a given mode is likely to
be used by compilers.
To compare various approaches, we assume a simple model for accessing operands in the
memory. The load instruction Load X(R1),R2 takes five cycles to complete execution, as
indicated in Figure 5.5. However, the instruction
Load (R1),R2
can be organized to fit a four-stage pipeline because no address computation is required.
Access to memory can take place in stage E. A more complex addressing mode may
require several accesses to the memory to reach the named operand. For example, The
instruction
Load (X(R1)),R2
may be executed as shown in Figure 5.16a, assuming that the index offset, X, is given in
the instruction word. After computing the address in cycle 3, the processor needs to
access memory twice — first to read location X+[R1] in clock cycle 4 and then to read
location [X+[R1]] in cycle 5. If R2 is a source operand in the next instruction, that
instruction would be stalled for three cycles, which can be reduced to two cycles with
operand forwarding, as shown.
191
Figure 5.16 Equivalent operations using complex and simple addressing modes.
To implement the same Load operation using only simple addressing modes
requires several instructions. For example, on a computer that allows three
operand addresses, we can use
Add #X,R1,R2
Load (R2),R2
Load (R2),R2
The Add instruction performs the operation R2 ← X+ [R1]. The two Load instructions
fetch the address and then the operand from the memory. This sequence of
instructions takes exactly the same number of clock cycles as the original, single
Load instruction, as shown in Figure 5.16b.
192
complex hardware to decode and execute them. Also, they are not
convenient for compilers to work with.
The instruction sets of modern processors are designed to take maximum
advantage of pipelined hardware. Because complex addressing modes are not suitable
for pipelined execution, they should be avoided. The addressing modes used in
modern processors often have the following features:
• Access to an operand does not require more than one access to the
memory.
• Only load and store instructions access memory
operands.
• The addressing modes used do not have side
effects.
Three basic addressing modes that have these features are register, register
indirect, and index. The first two require no address computation. In the index
mode, the address can be computed in one cycle, whether the index value is given
in the instruction or in a register. Memory is accessed in the following cycle. None of
these modes has any side effects, with one possible exception. Some architectures,
such as ARM, allow the address computed in the index mode to be written back
into the index register. This is a side effect that would not be allowed under the
guidelines above. Note also that relative addressing can be used; this is a special
case of indexed addressing in which the program counter is used as the index
register.
CONDITION CODES
In many processors, the condition code flags are stored in the processor status
register. They are either set or cleared by many instructions, so that they can be
tested by subsequent conditional branch instructions to change the flow of
program execution. An optimizing compiler for a pipelined processor attempts to
reorder instructions to avoid stalling the pipeline when branches or data
dependencies between successive instructions occur. In doing so, the compiler
must ensure that reordering does not cause a change in the outcome of a
computation. The dependency introduced by the condition-code flags reduces the
flexibility available for the compiler to reorder instructions.
Consider the sequence of instructions in Figure 5.17a, and assume that the ex-
ecution of the Compare and Branch=0 instructions proceeds as in Figure 8.14. The
branch decision takes place in step E2 rather than D2 because it must await the
result of the Compare instruction. The execution time of the Branch instruction can
be reduced by interchanging the Add and Compare instructions, as shown in Figure
5.17b.
193
Figure 5.17 Instruction reordering.
This will delay the branch instruction by one cycle relative to the Compare instruction. As a
result, at the time the Branch instruction is being decoded the result of the Com- pare
instruction will be available and a correct branch decision will be made. There would be no
need for branch prediction. However, interchanging the Add and Com- pare instructions can
be done only if the Add instruction does not affect the condition codes.
These observations lead to two important conclusions about the way condition codes should be
handled. First, to provide flexibility in reordering instructions, the condition-code flags should be
affected by as few instructions as possible. Second, the compiler should be able to specify in
which instructions of a program the condi- tion codes are affected and in which they are not. An
instruction set designed with pipelining in mind usually provides the desired flexibility. Figure
8.17b shows the instructions reordered assuming that the condition code flags are affected only
when this is explicitly stated as part of the instruction OP code. The SPARC and ARM
architectures provide this flexibility.
194
Large Computer Systems
Parallel Processing
Parallel processing is a term used to denote a large class of techniques that are used to
provide simultaneous data-processing tasks for the purpose of increasing the computational
speed of a computer system.
The purpose of parallel processing is to speed up the computer processing capability and
increase its throughput, that is, the amount of processing that can be accomplished during a
given interval of time. The amount of hardware increases with parallel processing, and with it,
the cost of the system increases.
Parallel processing can be viewed from various levels of complexity
At the lowest level, we distinguish between parallel and serial operations by the type of registers
used. e.g. shift registers and registers with parallel load
There are a variety of ways that parallel processing can be classified.
Internal organization of the processors
Interconnection structure between processors
The flow of information through the system
➢
M. J. Flynn considers the organization of a computer system by the number of instructions
and data items that are manipulated simultaneously.
SISD
Represents the organization of a single computer containing a control unit, a processor unit, and a
memory unit. Instructions are executed sequentially and the system may or may not have internal
parallel processing capabilities. parallel processing may be achieved by means of multiple
functional units or by pipeline processing.
SIMD
Represents an organization that includes many processing units under the supervision of common
control unit. All processors receive the same instruction from the control unit but operate on
different items of data. The shared memory unit must contain multiple modules so that it can
communicate with all the processors simultaneously.
MISD & MIMD
MISD structure is only of theoretical interest since no practical system has been constructed using
this organization.
MIMD organization refers to a computer system capable of processing several programs at the same
time. e.g. multiprocessor and multicomputer system
195
Flynn’s classification depends on the distinction between the performance of the control unit and
the data-processing unit. It emphasizes the behavioral characteristics of the computer system
rather than its operational and structural interconnections. One type of parallel processing that
does not fit Flynn’s classification is pipelining.
Array Processing
An array processor is a processor that performs computations on large arrays of data. The term is
used to refer to two different types of processors.Attached array processor: Is an auxiliary
processor. It is intended to improve the performance of the host computer in specific numerical
computation tasks. SIMD array processor: Has a single-instruction multiple-data organization.It
manipulates vector instructions by means of multiple functional units responding to a common
instruction. Attached Array ProcessorIts purpose is to enhance the performance of the computer
by providing vector processing for complex scientific applications.
Parallel processing with multiple functional units
Fig. shows the interconnection of an attached array processor to a host computer. For example,
when attached to a VAX 11 computer, the FSP-164/MAX from Floating-Point Systems increases
the computing power of the VAX to 100megaflops. The objective of the attached array processor
is to provide vector manipulation capabilities to a conventional computer at a fraction of the cost
of supercomputer.
196
Characteristics of multiprocessors
A multiprocessor system is an interconnection of two or more CPUs with memory and input-out
equipment. The term “processor” in multiprocessor can mean either a central processing unit
(CPU) or an input-output processor (IOP). Multiprocessors are classified as multiple instruction
stream, multiple data stream (MIMD) systems The similarity and distinction between
multiprocessor and multicomputer are Similarity Both support concurrent operations Distinction
The network consists of several autonomous computers that may or may not communicate with
eachother. A multiprocessor system is controlled by one operating system that provides interaction
between processors and all the components of the system cooperate in the solution of a problem.
Multiprocessing improves the reliability of the system.
The benefit derived from a multiprocessor organization is an improved system performance.
Multiple indepedent jobs can be made to operate in parallel.
➢
A single job can be partitioned into multiple parallel tasks.
Multiprocessing can improve performance by decomposing a program into parallel executable tasks.
The user can explicitly declare that certain tasks of the program be executed in parallel. This must
be done prior to loading the program by specifying the parallel executable segments.The other is to
provide a compiler with multiprocessor software that can automatically detect parallelism in a
user’s program. Multiprocessor are classified by the way their memory is organized.
197
.5.3.2 NUMA (Non-Uniform Memory Access) Multiprocessors
198
5.5.4 Interconnection Structures :
The components that form a multiprocessor system are CPUs, IOPs connected to input-
output devices, and a memory unit. The interconnection between the components can
have different physical configurations, depending on the number of transfer paths that are
available Between the processors and memory in a shared memory system o Among the
processing elements in a loosely coupled system There are several physical forms available
for establishing an interconnection network.
Time-shared common bus
Multiport memory
Crossbar switch
Hypercube system
199
Fig: Time shared common bus organization
Multiport Memory:
A multiport memory system employs separate buses between each memory module and
each CPU. The module must have internal control logic to determine which port will have
access to memory at any given time. Memory access conflicts are resolved by assigning
fixed priorities to each memory port.
Advanatge.:
The high transfer rate can be achieved because of the multiple paths.
Disadvantage.: It requires expensive memory control logic and a large number of cables
and connections
200
Fig: Multiport memory organization
Crossbar Switch:
Consists of a number of crosspoints that are placed at intersections between processor
buses and memory module paths. The small square in each crosspoint is a switch that
determines the path from a processor to a memory module.
Advanatge.:
Supports simultaneous transfers from all510 memory modules
Disadvantage.: The hardware required to implement the switch can become quite large
and complex. Below fig. shows the functional design of a crossbar switch connected to one
memory module.
201
Multistage Switching Network:The basic component of a multistage network is a two-
input, two-output interchange switch as shown in Fig. below.
Using the 2x2 switch as a building block, it is510possible to build a multistage network to
control the communication between a number of sources and destinations. To see how this
is done, consider
he binary tree shown in Fig. below. Certain request patterns cannot be satisfied
simultaneously. i.e., if P1 à 000~011, then P2 à 100~111 One such topology is the omega
switching network shown in Fig. below
202
Fig: 8 x 8 Omega Switching Network
Some request patterns cannot be connected simultaneously. i.e., any
two sources cannot be connected simultaneously to destination 000
and 001 . In a tightly coupled multiprocessor system, the source is a
processor and the destination is a memory module. In a loosely
coupled multiprocessor system, both the source and destination are
processing elements.
Hypercube System:
The hypercube or binary n-cube multiprocessor structure is a loosely coupled 510
system composed of N=2n processors interconnected in an n-
dimensional binary cube. Each processor forms a node of the cube, in
effect it contains not only a CPU but also local memory and I/O
interface. Each processor address differs from that of each of its n
neighbors by exactly one bit position. Fig. below shows the hypercube
structure for n=1, 2, and 3. Routing messages through an n-cube
structure may take from one to n links from a source node to a
destination node. A routing procedure can be developed by
computing the exclusive-OR of the source node address with the
destination node address. The message is then sent along any one of
the axes that the resulting binary value will have 1 bits corresponding
to the axes on which the two nodes differ. A representative of the
hypercube architecture is the Intel iPSC computer complex. It consists
of 128(n=7) microcomputers, each node consists of a CPU, a floating-
point processor, local memory, and serial communication interface
units.
Fig: Hypercube structures for n=1,2,3
Part A
1. What is Parallel Processing?
2. What is pipelining?
3. What is Arithmetic pipeline?
4. What is Instruction pipeline?
5. What is multiprocessing?
6. What are the Characteristics of multiprocessors?
Part B
1. What are Data Hazards and explain with examples?
2. Explain about different forms of parallel processing?
3. What are Instruction Hazards? Explain with examples?
4. What is arry processing ? Explain its Characteristics?
5.Explain Interconnection Networks? with a neat diagrams?