Complete CO Notes 5UNITS

Download as pdf or txt
Download as pdf or txt
You are on page 1of 218

JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY ANANTAPUR

B. Tech II - II sem (C.S.E) T Tu C


3 1 3
(15A05402) COMPUTER ORGANIZATION

Course Objectives:

• To learn the fundamentals of computer organization and its relevance to classical and
modern problems of computer design
• To make the students understand the structure and behavior of various functional
modules of a computer.
• To understand the techniques that computers use to communicate with I/O devices
• To study the concepts of pipelining and the way it can speed up processing.
• To understand the basic characteristics of multiprocessors
Course Outcomes:

• Ability to use memory and I/O devices effectively


• Able to explore the hardware requirements for cache memory and virtual memory
• Ability to design algorithms to exploit pipelining and multiprocessors

Unit I:
Basic Structure of Computer: Computer Types, Functional Units, Basic operational Concepts,
Bus Structure, Software, Performance, Multiprocessors and Multicomputer.
Machine Instructions and Programs: Numbers, Arithmetic Operations and Programs,
Instructions and Instruction Sequencing, Addressing Modes, Basic Input/output Operations, Stacks
and Queues, Subroutines, Additional Instructions.

Unit II:
Arithmetic: Addition and Subtraction of Signed Numbers, Design and Fast Adders, Multiplication
of Positive Numbers, Signed-operand Multiplication, Fast Multiplication, Integer Division,
Floating-Point Numbers and Operations.
Basic Processing Unit: Fundamental Concepts, Execution of a Complete Instruction, Multiple-
Bus Organization, Hardwired Control, Multiprogrammed Control.

Unit III:
The Memory System: Basic Concepts, Semiconductor RAM Memories, Read-Only Memories,
Speed, Size and Cost, Cache Memories, Performance Considerations, Virtual Memories, Memory
Management Requirements, Secondary Storage.
Unit IV:
Input/output Organization: Accessing I/O Devices, Interrupts, Processor Examples, Direct
Memory Access, Buses, Interface Circuits, Standard I/O Interfaces.

Unit V:
Pipelining: Basic Concepts, Data Hazards, Instruction Hazards, Influence on Instruction Sets.
Large Computer Systems: Forms of Parallel Processing, Array Processors, The Structure of
General-Purpose, Interconnection Networks.

Textbook:
1) “Computer Organization”, Carl Hamacher, Zvonko Vranesic, Safwat Zaky, McGraw Hill
Education, 5th Edition, 2013.

Reference Textbooks:
1. Computer System Architecture, M.Morris Mano, Pearson Education, 3rd Edition.
2. Computer Organization and Architecture, Themes and Variations, Alan Clements, CENGAGE
Learning.
3. Computer Organization and Architecture, Smruti Ranjan Sarangi, McGraw Hill Education.
4. Computer Architecture and Organization, John P.Hayes, McGraw Hill Education.
CONTENTS
1 Unit-I: Page No

1.1 Introduction
1.2 Unit-I notes 1-43
1.3 Part A Questions
1.4 Part B Questions

2 Unit-II:

2.1 Introduction
2.2 Unit-I notes 43- 83
2.3 Part A Questions
2.4 Part B Questions

3 Unit-III:

3.1 Introduction
3.2 Unit-I notes 84-126
3.3 Part A Questions
3.4 Part B Questions

4 Unit-IV:

4.1 Introduction
4.2 Unit-I notes 127-172
4.3 Part A Questions
4.4 Part B Questions

5 Unit-V:

5.1 Introduction
5.2 Unit-I notes 172-205
5.3 Part A Questions
5.4 Part B Questions
UNIT-1
BASIC STRUCTURE OF COMPUTERS

Computer types
A computer can be defined as a fast electronic calculating machine that
accepts the (data) digitized input information process it as per the list of
internally stored instructions and produces the resulting information.
List of instructions are called programs & internal storage is called computer
memory.

The different types of computers are


1. Personal computers: - This is the most common type found in homes,
schools, Business offices etc., It is the most common type of desk top computers
with processing and storage units along with various input and output devices.
2. Note book computers: - These are compact and portable versions of PC
3. Work stations: - These have high resolution input/output (I/O)
graphics capability, but with same dimensions as that of desktop computer.
These are used in engineering applications of interactive design work.
4. Enterprise systems: - These are used for business data processing in medium
to large corporations that require much more computing power and storage
capacity than work stations. Internet associated with servers has become a
dominant worldwide source of all types of information.
5. Super computers: - These are used for large scale numerical
calculations required in the applications like weather forecasting etc.,
Functional unit
A computer consists of five functionally independent main parts input,
memory, arithmetic logic unit (ALU), output and control unit.

1
Fig : Functional units of computer
Input device accepts the coded information as source program i.e. high
level language. This is either stored in the memory or immediately used by the
processor to perform the desired operations. The program stored in the memory
determines the processing steps. Basically the computer converts one source
program to an object program. i.e. into machine language.
Finally the results are sent to the outside world through output device.
All of these actions are coordinated by the control unit.
Input unit: -
The source program/high level language program/coded information/simply
data is fed to a computer through input devices keyboard is a most common type.
Whenever a key is pressed, one corresponding word or number is translated into its
equivalent binary code over a cable & fed either to memory or processor.
Joysticks, trackballs, mouse, scanners etc are other input devices.
Memory unit: -
Its function into store programs and data. It is basically to two types
1. Primary memory
2. Secondary memory
1. Primary memory: - Is the one exclusively associated with the processor and
operates at the electronics speeds programs must be stored in this memory while they
are being executed. The memory contains a large number of semiconductors
storage cells. Each capable of storing one bit of information. These are processed
in a group of fixed site called word.
To provide easy access to a word in memory, a distinct address is associated
with each word location. Addresses are numbers that identify memory location.
Number of bits in each word is called word length of the computer.
Programs must reside in the memory during execution. Instructions and data can be
written into the memory or read out under the control of processor.
Memory in which any location can be reached in a short and fixed
amount of time after specifying its address is called random-access memory (RAM).
The time required to access one word in called memory access time.
Memory which is only readable by the user and contents of which can’t be altered
is called read only memory (ROM) it contains operating system.
Caches are the small fast RAM units, which are coupled with the processor
and are aften contained on the same IC chip to achieve high performance. Although
primary storage is essential it tends to be expensive.

2
2 Secondary memory: - Is used where large amounts of data & programs have to
be stored, particularly information that is accessed infrequently.
Examples: - Magnetic disks & tapes, optical disks (ie CD-ROM’s), floppies etc.,
Arithmetic logic unit (ALU):-
Most of the computer operators are executed in ALU of the processor like
addition, subtraction, division, multiplication, etc. the operands are brought into the
ALU from memory and stored in high speed storage elements called register. Then
according to the instructions the operation is performed in the required sequence.
The control and the ALU are may times faster than other devices connected to
a computer system. This enables a single processor to control a number of external
devices such as key boards, displays, magnetic and optical disks, sensors and other
mechanical controllers.
Output unit:-
These actually are the counterparts of input unit. Its basic function is to send
the processed results to the outside world.
Examples:- Printer, speakers, monitor etc.
Control unit:-
It effectively is the nerve center that sends signals to other units and senses
their states. The actual timing signals that govern the transfer of data between
input unit, processor, memory and output unit are generated by the control unit.
Basic operational concepts
To perform a given task an appropriate program consisting of a list of
instructions is stored in the memory. Individual instructions are brought from the
memory into the processor, which executes the specified operations. Data to be
stored are also stored in the memory.
Examples: - Add LOCA, R0
This instruction adds the operand at memory location LOCA, to operand in
register R0 & places the sum into register. This instruction requires the
performance of several steps,
1. First the instruction is fetched from the memory into the processor.
2. The operand at LOCA is fetched and added to the contents of R0
3. Finally the resulting sum is stored in the register R0
The preceding add instruction combines a memory access operation with an ALU
Operations. In some other type of computers, these two types of operations are
performed by separate instructions for performance reasons.
Load LOCA, R1
Add R 1 , R0

3
Transfers between the memory and the processor are started by sending
the address of the memory location to be accessed to the memory unit and
issuing the appropriate control signals. The data are then transferred to or from the
memory.
The fig shows how memory & the processor can be connected. In addition to
the ALU & the control circuitry, the processor contains a number of registers used for
several different purposes.
The instruction register (IR):- Holds the instructions that is currently being executed.
Its output is available for the control circuits which generates the timing signals
that control the various processing elements in one execution of instruction.
The program counterPC:-
This is another specialized register that keeps track of execution of a program.
It contains the memory address of the next instruction to be fetched and executed.
Besides IR and PC, there are n-general purpose registers R0 through Rn-1.

Memory

MAR MDR
Control

PC R0

R1
Processo
IR

ALU
Rn - 1

n general purpose
registers

Figure :Connections between the processor and the memory.

4
The other two registers which facilitate communication with memory are: -
1. MAR – (Memory Address Register):- It holds the address of the location to
be accessed.
2. MDR – (Memory Data Register):- It contains the data to be written into or
read out of the address location.
Operating steps are
1. Programs reside in the memory & usually get these through the I/P unit.
2. Execution of the program starts when the PC is set to point at the first
instruction of the program.
3. Contents of PC are transferred to MAR and a Read Control Signal is sent to
the memory.
4. After the time required to access the memory elapses, the address word is read
out of the mmory and loaded into the MDR.
5. Now contents of MDR are transferred to the IR & now the instruction is
ready to be decoded and executed.
6. If the instruction involves an operation by the ALU, it is necessary to obtain
the required operands.
7. An operand in the memory is fetched by sending its address to MAR &
Initiating a read cycle.
8. When the operand has been read from the memory to the MDR, it is
transferred from MDR to the ALU.
9. After one or two such repeated cycles, the ALU can perform the
desired operation.
10. If the result of this operation is to be stored in the memory, the result is
sent to MDR.
11. Address of location where the result is stored is sent to MAR & a write cycle
is initiated.
12. The contents of PC are incremented so that PC points to the next instruction
that is to be executed.
Normal execution of a program may be preempted (temporarily
interrupted) if some devices require urgent servicing, to do this one device raises an
Interrupt signal.
An interrupt is a request signal from an I/O device for service by the
processor. The processor provides the requested service by executing an
appropriate interrupt service routine.
The Diversion may change the internal stage of the processor its state must be
saved in the memory location before interruption. When the interrupt-routine
service is completed the state of the processor is restored so that the
interrupted program may continue.

5
Bus structure
The simplest and most common way of interconnecting various parts of
the computer. To achieve a reasonable speed of operation, a computer must be
organized so that all its units can handle one full word of data at a given time.A
group of lines that serve as a connecting port for several devices is called a bus.
In addition to the lines that carry the data, the bus must have lines for address
and control purpose. Simplest way to interconnect is to use the single bus as shown

Fig: single bus structure


Since the bus can be used for only one transfer at a time, only two units
can actively use the bus at any given time. Bus control lines are used to arbitrate
multiple requests for use of one bus.
Single bus structure is
• Low cost
• Very flexible for attaching peripheral devices
Multiple bus structure certainly increases, the performance but also
increases the cost significantly.
All the interconnected devices are not of same speed & time, leads to a bit of a
problem. This is solved by using cache registers (ie buffer registers). These buffers
are electronic registers of small capacity when compared to the main memory but
of comparable speed.
The instructions from the processor at once are loaded into these buffers and
then the complete transfer of data at a fast rate will take place.

Performance
The most important measure of the performance of a computer is how
quickly it can execute programs. The speed with which a computer executes program
is affected by the design of its hardware. For best performance, it is necessary to
design the compiles, the machine instruction set, and the hardware in a coordinated
way.

6
The total time required to execute the program is elapsed time is a measure of
the performance of the entire computer system. It is affected by the speed of the
processor, the disk and the printer. The time needed to execute a instruction is
called the processor time.
Just as the elapsed time for the execution of a program depends on all units in a
computer system, the processor time depends on the hardware involved in the
execution of individual machine instructions. This h a r d w a r e comprises the
processor and the memory which are usually connected by the bus as shown in the
fig c.

Fig : The processor cache

The pertinent parts of the fig. c are repeated in fig. d which includes the
cache memory as part of the processor unit.
Let us examine the flow of program instructions and data between the
memory and the processor. At the start of execution, all program instructions and the
required data are stored in the main memory. As the execution proceeds,
instructions are fetched one by one over the bus into the processor, and a copy is
placed in the cache later if the same instruction or data item is needed a second time,
it is read directly from the cache.
The processor and relatively small cache memory can be fabricated on a
single IC chip. The internal speed of performing the basic steps of instruction
processing on chip is very high and is considerably faster than the speed at which
the instruction and data can be fetched from the main memory. A program will
be executed faster if the movement of instructions and data between the main
memory and the processor is minimized, which is achieved by using the cache.
For example:- Suppose a number of instructions are executed repeatedly over a
short period of time as happens in a program loop. If these instructions are
available in the cache, they can be fetched quickly during the period of repeated use.
The same applies to the data that are used repeatedly.

7
Processor clock: -
Processor circui ts are controlle d by a timing signal called clock. The
clock designer the regular time intervals called clock cycles. To execute a machine
instruction the processor divides the action to be performed into a sequence of
basic steps that each step can be completed in one clock cycle. The length P of one
clock cycle is an important parameter that affects the processor performance.
Processor used in today’s personal computer and work station have a clock rates
that range from a few hundred million to over a billion cycles per second.

Basic performance equation


We now focus our attention on the processor time component of the total
elapsed time. Let ‘T’ be the processor time required to execute a program that has
been prepared in some high-level language. The compiler generates a machine
language object program that corresponds to the source program. Assume that
complete execution of the program requires the execution of N machine cycle
language instructions. The number N is the actual number of instruction execution
and is not necessarily equal to the number of machine cycle instructions in the
object program. Some instruction may be executed more than once, which in the
case for instructions inside a program loop others may not be executed all,
depending on the input data used.
Suppose that the average number of basic steps needed to execute one
machine cycle instruction is S, where each basic step is completed in one clock cycle.
If clock rate is ‘R’ cycles per second, the program execution time is given by
T=N*S/R
this is often referred to as the basic performance equation.
We must emphasize that N, S & R are not independent parameters changing one
may affect another. Introducing a new feature in the design of a processor will
lead to improved performance only if the overall result is to reduce the value of T.
Pipelining and super scalar operation: -
We assume that instructions are executed one after the other. Hence the
value of S is the total number of basic steps, or clock cycles, required to execute
one instruction. A substantial improvement in performance can be achieved by
overlapping the execution of successive instructions using a technique called
pipelining.
Consider Add R1 R2 R3
This adds the contents of R1 & R2 and places the sum into R3.

8
The contents of R1 & R2 are first transferred to the inputs of ALU. After
the addition operation is performed, the sum is transferred to R3. The processor can
read the next instruction from the memory, while the addition operation is being
performed. Then of that instruction also uses, the ALU, its operand can be transferred
to the ALU inputs at the same time that the add instructions is being transferred to
R3.
In the ideal case if all instructions are overlapped to the maximum degree
possible the execution proceeds at the rate of one instruction completed in each
clock cycle. Individual instructions still require several clock cycles to complete. But
for the purpose of computing T, effective value of S is 1.

A higher degree of concurrency can be achieved if multiple instructions


pipelines are implemented in the processor. This means that multiple functional
units are used creating parallel paths through which different instructions can be
executed in parallel with such an arrangement, it becomes possible to start
the execution of several instructions in every clock cycle. This mode of operation is
called superscalar execution. If it can be sustained for a long time during program
execution the effective value of S can be reduced to less than one. But the
parallel execution must preserve logical correctness of programs, that is the results
produced must be same as those produced by the serial execution of program
instructions. Now a days may processor are designed in this manner.
Clock rate
These are two possibilities for increasing the clock rate ‘R’.
1. Improving the IC technology makes logical circuit faster, which reduces the
time of execution of basic steps. This allows the clock period P, to be reduced
and the clock rate R to be increased.
2. Reducing the amount of processing done in one basic step also makes it
possible to reduce the clock period P. however if the actions that have to be
performed by an instructions remain the same, the number of basic steps
needed may increase.
Increase in the value ‘R’ that are entirely caused by improvements in IC
technology affects all aspects of the processor’s operation equally with the
exception of the time it takes to access the main memory. In the presence of cache
the percentage of accesses to the main memory is small. Hence much of the
performance gain excepted from the use of faster technology can be realized.
Instruction set CISC & RISC:-
Simple instructions require a small number of basic steps to execute.
Complex instructions involve a large number of steps. For a processor that has
only simple instruction a large number of instructions may be needed to
perform a given programming task. This could lead to a large value of ‘N’ and a small

9
value of ‘S’ on the other hand if individual instructions perform more complex
operations, a fewer instructions will be needed, leading to a lower value of N and a
larger value of S. It is not obvious if one choice is better than the other.
But complex instructions combined with pipelining (effective value of S= 1)
would achieve one best performance. However, it is much easier to implement
efficient pipelining in processors with simple instruction sets.
Performance measurements
It is very important to be able to access the performance of a computer, comp
designers use performance estimates to evaluate the effectiveness of new features.
The previous argument suggests that the performance of a computer is
given by the execution time T, for the program of interest.

Inspite of the performance equation being so simple, the evaluation of ‘T’


is highly complex. Moreover the parameters like the clock speed and various
architectural features are not reliable indicators of the expected performance.
Hence measurement of computer performance using bench mark programs
is done to make comparisons possible, standardized programs must be used.

The performance measure is the time taken by the computer to execute a


given bench mark. Initially some attempts were made to create artificial programs that
could be used as bench mark programs. But synthetic programs do not properly
predict the performance obtained when real application programs are run.

A non profit organization called SPEC- system performance evaluation


corporation selects and publishes bench marks.

The program selected range from game playing, compiler, and data
base applications to numerically intensive programs in astrophysics and quantum
chemistry. In each case, the program is compiled under test, and the running time on a
real computer is measured. The same program is also compiled and run on one
computer selected as reference.
The ‘SPEC’ rating is computed as follows.

Running time on the reference computer

SPEC rating = -------------------------------------------------------------


Running time on the computer under test

10
If the SPEC rating = 50
Means that the computer under test is 50 times as fast as the ultra sparc
10. This is repeated for all the programs in the SPEC suit, and the geometric mean of
the result is computed.
Let SPECi be the rating for program ‘i’ in the suite. The overall SPEC rating
for the computer is given by

Where ‘n’ = number of programs in suite.


Since actual execution time is measured the SPEC rating is a measure of the
combined effect of all factors affecting performance, including the compiler, the OS,
the processor, the memory of comp being tested.

Multiprocessor & microprocessors:-


• Large computers that contain a number of processor units
are called multiprocessor system.
• These systems either execute a number of different application tasks in
parallel or execute subtasks of a single large task in parallel
• All processors usually have access to all memory locations in such
system & hence they are called shared memory multiprocessor systems.
• The high performance of these systems comes with much increased
complexity and cost.
• In contrast to multiprocessor systems, it is also possible to use an
interconnected group of complete computers to achieve high total
computational power. These computers normally have access to their
own memory units when the tasks they are executing need to
communicate data they do so by exchanging messages over a
communication network. This properly distinguishes them from shared
memory multiprocessors, leading to name message-passing multi
computer.

Number Representation
We obviously need to represent both positive and negative numbers. Three
systems are used for representing such numbers :
• Sign-and-magnitude
• 1’s-complement
• 2’s-complement
In all three systems, the leftmost bit is 0 for positive numbers and 1 for negative
numbers. Fig 2.1 illustrates all three representations using 4-bit numbers. Positive

11
values have identical representations in al systems, but negative values have
different representations. In the sign-and-magnitude systems, negative values are
represented by changing the most significant bit (b3 in figure 2.1) from 0 to 1 in the
B vector of the corresponding positive value. For example, +5 is represented by
0101, and -5 is represented by 1101. In 1’s-complement representation, negative
values are obtained by complementing each bit of the corresponding positive
number. Thus, the representation for -3 is obtained by complementing each bit in the
vector 0011 to yield 1100. clearly, the same operation, bit complementing, is done in
converting a negative number to the corresponding positive value. Converting
either way is referred to as forming the 1’s-complement of a given number.
Finally, in the 2’s-complement system, forming the 2’s-complement of a number is
done by subtracting that number from 2n.

Hence, the 2’s complement of a number is obtained by adding 1 to the 1’s complement
ofthat number.

Addition of Positive numbers:-


Consider adding two 1-bit numbers. The results are shown in figure 2.2.
Note that the sum of 1 and 1 requires the 2-bit vector 10 to represent the value 2.
We say that the sum is 0 and the carry-out is 1. In order to add multiple-bit numbers,
we use a method analogous to that used for manual computation with decimal
numbers. We add bit pairs starting from the low-order (right) and of the bit vectors,
propagating carries toward the
high-order (left) end.

0 1 0 1
+0 +0 +1 +1

0 1 1 Carry-out 10

Figure 2.2 Addition of 1-bit numbers.

1.12 Memory locations and addresses


Number and character operands, as well as instructions, are stored in the
memory of a computer. The memory consists of many millions of storage cells, each of
which can store a bit of information having the value 0 or 1. Because a single bit
represents a very small amount of information, bits are seldom handled individually.
The usual approach is to deal with them in groups of fixed size. For this purpose,
the memory is organized so that a group of n bits can be stored or retrieved in a

12
single, basic operation. Each group of n bits is referred to as a word of
information, and n is called the word length. The memory of a computer can be
schematically represented as a collection of words as shown in figure (a).

Modern computers have word lengths that typically range from 16 to 64


bits. If the word length of a computer is 32 bits, a single word can store a 32-bit 2’s
complement number or four ASCII characters, each occupying 8 bits. A unit of 8 bits is
called a byte.

Accessing the memory to store or retrieve a single item of information,


either a word or a byte, requires distinct names or addresses for each item
location. It is customary to use numbers from 0 through 2K-1, for some suitable
values of k, as the addresses of successive locations in the memory. The 2 k
addresses constitute the address space of the computer, and the memory can have
up to 2k addressable locations. 24-bit address generates an address space of 224
(16,777,216) locations. A 32-bit address creates an address space of 232 or 4G (4
giga) locations.

BYTE ADDRESSABILITY:-
We now have three basic information quantities to deal with: the bit, byte
and word. A byte is always 8 bits, but the word length typically ranges from 16 to
64 bits. The most practical assignment is to have successive addresses refer to
successive byte
Fig a Memory words

13
(b) Four characters

Locations in the memory. This is the assignment used in most modern computers,
and is the one we will normally use in this book. The term byte-addressable memory
is use for this assignment. Byte locations have addresses 0,1,2, …. Thus, if the word
length of the machine is 32 bits, successive words are located at addresses 0,4,8,….,
with each word consisting of four bytes.

BIG-ENDIAN AND LITTLE-ENDIAN ASIGNMENTS:-


There are two ways that byte addresses can be assigned across words, as
shown in fig b. The name big-endian is used when lower byte addresses are used
for the more significant bytes (the leftmost bytes) of the word. The name little-
endian is used for the opposite ordering, where the lower byte addresses are used
for the less significant bytes (the rightmost bytes) of the word.
In addition to specifying the address ordering of bytes within a word, it is
also necessary to specify the labeling of bits within a byte or a word. The same
ordering is also used for labeling bits within a byte, that is, b7, b6, …., b0, from left
to right.

14
WORD ALIGNMENT:-
In the case of a 32-bit word length, natural word boundaries occur at
addresses 0,4, 8, …, as shown in above fig. We say that the word locations have
aligned addresses . in general, words are said to be aligned in memory if they begin at a
byte address that is a multiple of the number of bytes in a word. The memory of bytes
in a word is a power of 2. Hence, if the word length is 16 (2 bytes), aligned words
begin at byte addresses 0,2,4,…, and for a word length of 64 (23 bytes), aligned
words begin at bytes addresses 0,8,16 ….
There is no fundamental reason why words cannot begin at an arbitrary
byte address. In that case, words are said to have unaligned addresses. While
the most common case is to use aligned addresses, some computers allow the
use of unaligned word addresses.

ACCESSING NUMBERS, CHARACTERS, AND CHARACTER STRINGS:-


A number usually occupies one word. It can be accessed in the memory
by specifying its word address. Similarly, individual characters can be accessed by their
byte address.
In many applications, it is necessary to handle character strings of variable
length. The beginning of the string is indicated by giving the address of the byte
containing its first character. Successive byte locations contain successive

15
characters of the string. There are two ways to indicate the length of the string.
A special control character with the meaning “end of string” can be used as the last
character in the string, or a separate memory word location or processor register can
contain a number indicating the length of the string in bytes.

Instructions and instruction sequencing


A computer must have instructions capable of performing four types
of operations.
• Data transfers between the memory and the processor registers
• Arithmetic and logic operations on data
• Program sequencing and control
• I/O transfers
REGISTER TRANSFER NOTATION:-
Transfer of information from one location in the computer to another.
Possible locations that may be involved in such transfers are memory locations
that may be involved in such transfers are memory locations, processor registers,
or registers in the I/O subsystem. Most of the time, we identify a location by a
symbolic name standing for its hardware binary address.
Example, names for the addresses of memory locations may be LOC, PLACE, A,
VAR2; processor registers names may be R0, R5; and I/O register names may be
DATAIN, OUTSTATUS, and so on. The contents of a location are denoted by
placing square brackets around the name of the location. Thus, the expression
R1<= [LOC]
Means that the contents of memory location LOC are transferred into processor
registerR1.As another example, consider the operation that adds the contents of
registers R1 and R2, and then places their sum into register R3. This action is indicated
as
R3<= [R1] [R2]
This type of notation is known as Register Transfer Notation (RTN). Note
that the right-hand side of an RTN expression always denotes a value, and the left-
hand side is the name of a location where the value is to be places, overwriting the old
contents of that location.

ASSEMBLY LANGUAGE NOTATION:-


Another type of notation to represent machine instructions and programs.
For this, we use an assembly language format. For example, an instruction that
causes the transfer described above, from memory location LOC to processor
register R1, is specified by the statement
Move LOC, R1

16
The contents of LOC are unchanged by the execution of this instruction, but
the old contents of register R1 are overwritten.
The second example of adding two numbers contained in processor
registers R1 and R2 and placing their sum in R3 can be specified by the assembly
language statement
Add R1, R2, R3

BASIC INSTRUCTIONS:-
The operation of adding two numbers is a fundamental capability in
any computer. The statement
C=A+B
In a high-level language program is a command to the computer to add
the current values of the two variables called A and B, and to assign the sum to
a third variable, C. When the program containing this statement is compiled, the
three variables, A, B, and C, are assigned to distinct locations in the memory. We
will use the variable names to refer to the corresponding memory location
addresses. The contents of these locations represent the values of the three
variables. Hence, the above high-level language statement requires the action.
C <= [A] + [B]
To carry out this action, the contents of memory locations A and B are
fetched from the memory and transferred into the processor where their sum is
computed. This result is then sent back to the memory and stored in location C.

Let us first assume that this action is to be accomplished by a single


machine instruction. Furthermore, assume that this instruction contains the
memory addresses of the three operands – A, B, and C. This three-address
instruction can be represented symbolically as
Add A, B, C
Operands A and B are called the source operands, C is called the
destination operand, and Add is the operation to be performed on the operands. A
general instruction of this type has the format.
Operation Source1, Source 2, Destination
If k bits are needed for specify the memory address of each operand, the
encoded form of the above instruction must contain 3k bits for addressing purposes
in addition to the bits needed to denote the Add operation.

17
An alternative approach is to use a sequence of simpler instructions to
perform the same task, with each instruction having only one or two operands.
Suppose that two- address instructions of the form
Operation Source, Destination
Are available. An Add instruction of this typeis
Add A, B
Which performs operation B <= [A] + [B].
A single two-address instruction cannot be used to solve our original problem,
which is to add the the contents of locations A and B, without destroying either of
them, and to place the sum in location C. The problem can be solved by using another
two-address instruction that copies the contents of one memory location into
another. Such an instruction is
Move B, C
Which performs the operations C< = [B], leaving the contents of location B unchanged.
Using only one-address instructions, the operation C< = [A] + [B] can
be performed by executing the sequence of instructions
Load A
Add B
Store C
Some early computers were designed around a single accumulator
structure. Most modern computers have a number of general-purpose processor
registers – typically 8 to 32, and even considerably more in some cases. Access to
data in these registers is much faster than to data stored in memory locations because
the registers are inside the processor.
Let Ri represent a general-purpose register. The
instructions

Load A, Ri
Store Ri, A and
Add A, Ri
Are generalizations of the Load, Store, and Add instructions for the single-
accumulator case, in which register Ri performs the function of the accumulator.
When a processor has several general-purpose registers, many
instructions involve only operands that are in the register. In fact, in many modern
processors, computations can be performed directly only on data held in
processor registers.Instructions such as

18
Add Ri, Rj
Or
Add Ri, Rj, Rk
In both of these instructions, the source operands are the contents of
registers Ri and Rj. In the first instruction, Rj also serves as the destination register,
whereas in the second instruction, a third register, Rk, is used as the destination.
It is often necessary to transfer data between different locations. This is achieved with
the instruction
Move Source,
Destination
When data are moved to or from a processor register, the Move instruction
can be used rather than the Load or Store instructions because the order of the
source and destination operands determines which operation is intended.
In processors where arithmetic operations are allowed only on operands that are
processor registers, the C = A + B task can be performed by the instruction sequence
Move A, Ri
Move B, R
Add Ri, Rj
Move Rj, C
In processors where one operand may be in the memory but the other must
be in register, an instruction sequence for the required task would be
Move A, Ri
Add B, Ri
Move Ri, C
The speed with which a given task is carried out depends on the time it takes to
transfer instructions from memory into the processor and to access the
operands referenced by these instructions. Transfers that involve the memory are
much slower than transfers within the processor.
We have discussed three-, two-, and one-address instructions. It is also
possible
to use instructions in which the locations of all operands are defined implicitly.
Such instructions are found in machines that store operands in a structure called a
pushdown stack. In this case, the instructions are called zero-address instructions.

INSTRUCTION EXECUTION AND STRAIGHT-LINE SEQUENCING:-


In the preceding discussion of instruction formats, we used to task
C<=[A]+[B]

19
fig 2.8 shows a possible program segment for this task as it appears in the memory of
a computer. We have assumed that the computer allows one memory operand

f ig: 2 . 8 AProgram for C<= A+ B


per instruction and has a number of processor registers. The three instructions of the
program are in successive word locations, starting at location i. since each
instruction is 4 bytes long, the second and third instructions start at addresses i + 4
and i8.
Let us consider how this program is executed. The processor contains a
register called the program counter (PC), which holds the address of the
instruction to be executed next. To begin executing a program, the address of its first
instruction (I in our example) must be placed into the PC. Then, the processor
control circuits use the information in the PC to fetch and execute instructions,
one at a time, in the order of increasing addresses. This is called straight-line
sequencing. During the execution of each instruction, the PC is incremented by 4 to
point to the next instruction. Thus, after the Move instruction at location i + 8 is
executed, the PC contains the value i + 12, which is the address of the first
instruction of the next program segment.

Executing a given instruction is a two-phase procedure. In the first phase,


called instruction fetch, the instruction is fetched from the memory location whose
address is in the PC.

20
This instruction is placed in the instruction register (IR) in the processor.
The instruction in IR is examined to determine which operation is to be
performed. The specified operation is then performed by the processor. This
often involves fetching operands from the memory or from processor registers,
performing an arithmetic or logic operation, and storing the result in the destination
location.

BRANCHING:-
Consider the task of adding a list of n numbers. Instead of using a long list of
add instructions, it is possible to place a single add instruction in a program loop, as
shown in fig b. The loop is a straight-line sequence of instructions executed as
many times as needed. It starts at location LOOP and ends at the instruction
Branch > 0. During each pass through this loop, the address of the next list entry is
determined, and that entry is fetched and added to

21
Fig b Using a loop to add n numbers

Assume that the number of entries in the list, n, is stored in memory


location N, as shown. Register R1 is used as a counter to determine the number of
time the loop is executed. Hence, the contents of location N are loaded into register
R1 at the beginning of the program. Then, within the body of the loop, the
instruction.

22
Decrement R1
Reduces the contents of R1 by 1 each time through the loop.
This type of instruction loads a new value into the program counter. As a
result, the processor fetches and executes the instruction at this new address,
called the branch target, instead of the instruction at the location that follows the
branch instruction in sequential address order. A conditional branch instruction
causes a branch only if a specified condition is satisfied. If the condition is not
satisfied, the PC is incremented in the normal way, and the next instruction in
sequential address order is fetched and
executed.
Branch > 0 LOOP

(branch if greater that 0) is a conditional branch instruction that causes a


branch to location LOOP if the result of the immediately preceding instruction, which is
the decremented value in register R1, is greater that zero. This means that the
loop is repeated, as long as there are entries in the list that are yet to be added to
R0. at the end of the nth pass through the loop, the Decrement instruction
produces a value of zero, and hence, branching does not occur.

CONDITION CODES:-
The processor keeps track of information about the results of various
operations for use by subsequent conditional branch instructions. This is
accomplished by recording the required information in individual bits, often called
condition code flags. These flags are usually grouped together in a special
processor register called the condition code register or status register. Individual
condition code flags are set to 1 or cleared to 0, depending on the outcome of the
operation performed.
Four commonly used flags are
N(negative) Set to 1 if the result is negative; otherwise, cleared to 0
Z(zero) Set to 1 if the result is 0; otherwise, cleared to 0
V(overflow) Set ot1 if arithmetic overflow occurs; otherwise, cleared to 0
C(carry) Set to 1 if a carry-out results from the operation; otherwise, cleared to 0
The instruction Branch > 0, discussed in the previous section, is an example of
a branch instruction that tests one or more of the condition flags. It causes a
branch if the value tested is neither negative nor equal to zero. That is, the branch is
taken if neither N nor Z is 1. The conditions are given as logic expressions
involving the condition code flags.
In some computers, the condition code flags are affected automatically by
instructions that perform arithmetic or logic operations. However, this is not

23
always the case. A number of computers have two versions of an Add instruction.
GENERATING MEMORY ADDRESSES:-
Let us return to fig b. The purpose of the instruction block at LOOP is to add a
different number from the list during each pass through the loop. Hence, the
Add instruction in the block must refer to a different address during each pass.
How are the addresses to be specified ? The memory operand address cannot be
given directly in a single Add instruction in the loop. Otherwise, it would need to be
modified on each pass through the loop.
The instruction set of a computer typically provides a number of such
methods, called addressing modes. While the details differ from one computer to
another, the underlying concepts are the same.

MACHINE INSTRUCTIONS AND PROGRAMS


1.15.1 Addressing modes:
In general, a program operates on data that reside in the computer’s
memory. These data can be organized in a variety of ways. If we want to keep
track of students’ names, we can write them in a list. Programmers use
organizations called data structures to represent the data used in computations.
These include lists, linked lists, arrays, queues, and so on.

Programs are normally written in a high-level language, which enables the


programmer to use constants, local and global variables, pointers, and arrays.
The different ways in which the location of an operand is specified in an
instruction are referred to as addressing modes.

Name Assembler syntax Addressing function

Immediate # Value Operand = Value


Register Ri EA = Ri
Absolute (Direct) LOC EA = LOC

Indirect (Ri) EA = [Ri]


(LOC) EA = [LOC]
Index X(Ri) EA = [Ri] + X
Base with index (Ri, Rj) EA = [Ri] + [Rj]
Base with index X (Ri, Rj) EA = [Ri] + [Rj] + X
and offset

Relative X(PC) EA = [PC] + X


Autoincrement (Ri)+ EA = [Ri]; Increment Ri
Autodecrement -(Ri) Decrement Ri; EA = [Ri]

24
Table 2.1 Generic addressing modes

25
EA = effective address
Value = a signed number

IMPLEMENTATION OF VARIABLE AND CONSTANTS:-


Variables and constants are the simplest data types and are found in almost every
computer program. In assembly language, a variable is represented by allocating a
register or memory location to hold its value. Thus, the value can be changed as
needed using appropriate instructions.

Register mode - The operand is the contents of a processor register; the name
(address)of the register is given in the instruction.

Absolute mode – The operand is in a memory location; the address of this location
is given explicitly in the instruction. (In some assembly languages, this mode is called
Direct)
The instruction
Move LOC, R2

Processor registers are used as temporary storage locations where the data is
a register are accessed using the Register mode. The Absolute mode can represent
global variables in a program. A declaration such as
Integer A, B;

Immediate mode – The operand is given explicitly in the instruction.


For example, the instruction
Move 200immediate, R0

Places the value 200 in register R0. Clearly, the Immediate mode is only used to
specify the value of a source operand. Using a subscript to denote the Immediate mode
is not appropriate in assembly languages. A common convention is to use the sharp sign
(#) in front of the value to indicate that this value is to be used as an immediate
operand. Hence, we write the instruction above in the form
Move #200, R0

INDIRECTION AND POINTERS:-


In the addressing modes that follow, the instruction does not give the operand
or its address explicitly, Instead, it provides information from which the memory address
of the operand can be determined. We refer to this address as the effective address
(EA) of the operand.

Indirect mode – The effective address of the operand is the contents of a register
or memory location whose address appears in the instruction.

26
To execute the Add instruction in fig (a), the processor uses the value which is in
register R1, as the effective address of the operand. It requests a read operation from the
memory to read the contents of location B. the value read is the desired operand,
which the processor adds to the contents of register R0. Indirect addressing through a
memory location is also possible as shown in fig (b). In this case, the processor first
reads the contents of memory location A, then requests a second read operation using
the value B as an address to obtain the operand

Fig (a) Through a general-purpose register

(b) Through a memory


location

27
Move N,R1

Move #NUM, R2
Clear R0
LOOP ADD (R2), R0
ADD #4, R2
DECREMENT R1
Branch > 0 LOOP
Move R0, SUM

Fig : use of indirect addressing in the program


The register or memory location that contains the address of an operand is
called a pointer. Indirection and the use of pointers are important and powerful
concepts in programming.
In the program shown Register R2 is used as a pointer to the numbers in the
list, and the operands are accessed indirectly through R2. The initialization section
of the program loads the counter value n from memory location N into R1 and
uses the immediate addressing mode to place the address value NUM1, which is the
address of the first number in the list, into R2. Then it clears R0 to 0. The first two
instructions in the loop implement the unspecified instruction block starting at
LOOP. The first time through the loop, the instruction Add (R2), R0 fetches the
operand at location NUM1 and adds it to R0. The second Add instruction adds 4 to
the contents of the pointer R2, so that it will contain the address value NUM2 when
the above instruction is executed in the second pass through the loop.
Where B is a pointer variable. This statement may be compiled into
Move B, R1
Move (R1), A
Using indirect addressing through memory, the same action can be achieved with
Move (B), A
Indirect addressing through registers is used extensively. The above
program shows the flexibility it provides. Also, when absolute addressing is not
available, indirect addressing through registers makes it possible to access global
variables by first loading the operand’s address in a register.

INDEXING AND ARRAYS:-


A different kind of flexibility for accessing operands is useful in dealing
with lists and arrays.

28
Index mode – the effective address of the operand is generated by adding a
constant value to the contents of a register.
The register use may be either a special register provided for this
purpose, or, more commonly, it may be any one of a set of general-purpose registers
in the processor. In either case, it is referred to as index register. We indicate the
Index mode symbolically
as
X (Ri)
Where X denotes the constant value contained in the instruction and Ri is
the name of the register involved. The effective address of the operand is given by

EA = X + [Rj]

The contents of the index register are not changed in the process of
generating the effective address. In an assembly language program, the constant X
may be given either as an explicit number or as a symbolic name representing a
numerical value.
Fig a illustrates two ways of using the Index mode. In fig a, the index
register, R1, contains the address of a memory location, and the value X defines
an offset (also called a displacement) from this address to the location where the
operand is found. An alternative use is illustrated in fig b. Here, the constant X
corresponds to a memory address, and the contents of the index register define
the offset to the operand. In either case, the effective address is the sum of two
values; one is given explicitly in the instruction, and the other is stored in a register.

Fig (a) Offset is given as a constant

29
Fig (b) Offset is in the index register

In the most basic form of indexed addressing several variations of this basic form
provide a very efficient access to memory operands in practical programming
situations. For example, a second register may be used to contain the offset X, in
which case we can write the Index mode as
(Ri, Rj)

The effective address is the sum of the contents of registers Ri and


Rj. The second register is usually called the base register. This form of indexed
addressing provides more flexibility in accessing operands, because both components of
the effective address can be changed.
Another version of the Index mode uses two registers plus a constant, which
can be denoted as
X(Ri, Rj)

In this case, the effective address is the sum of the constant X and the contents
of registers Ri and Rj. This added flexibility is useful in accessing multiple
components inside each item in a record, where the beginning of an item is
specified by the (Ri, Rj) part of the addressing mode. In other words, this mode
implements a three-dimensional array.

30
RELATIVE ADDRESSING:-
We have defined the Index mode using general-purpose processor
registers. A useful version of this mode is obtained if the program counter, PC, is
used instead of a general purpose register. Then, X(PC) can be used to address a
memory location that is X bytes away from the location presently pointed to by the
program counter.

Relative mode – The effective address is determined by the Index mode using
the program counter in place of the general-purpose register Ri.
This mode can be used to access data operands. But, its most common use is
to specify the target address in branch instructions. An instruction such as

Branch > 0 LOOP

Causes program execution to go to the branch target location identified by


the name LOOP if the branch condition is satisfied. This location can be
computed by specifying it as an offset from the current value of the program counter.
Since the branch target may be either before or after the branch instruction, the offset
is given as a signed number.

Autoincrement mode – the effective address of the operand is the contents of a


register specified in the instruction. After accessing the operand, the contents of this
register are automatically to point to the next item in a list.
(Ri)+
Autodecrement mode – the contents of a register specified in the instruction are
first automatically decremented and are then used as the effective address of the
operand.
-(Ri)
Move N, R1
Move #NUM1, R2
Clear R0
LOOP Add (R2)+, R0
Decrement R1
Branch>0 LOOP
Move R0, SUM

Fig c The Autoincrement addressing mode used in the program

31
Basic input/output operations
We now examine the means by which data are transferred between the
memory of a computer and the outside world. Input/Output (I/O) operations are
essential, and the way they are performed can have a significant effect on the
performance of the computer.
Consider a task that reads in character input from a keyboard and
produces character output on a display screen. A simple way of performing such I/O
tasks is to use a method known as program-controlled I/O. The rate of data transfer
from the keyboard to a computer is limited by the typing speed of the user, which is
unlikely to exceed a few characters per second. The rate of output transfers from
the computer to the display is much higher. It is determined by the rate at which
characters can be transmitted over the link between the computer and the display
device, typically several thousand characters per second. However, this is still
much slower than the speed of a processor that can execute many millions of
instructions per second. The difference in speed between the processor and I/O
devices creates the need for mechanisms to synchronize the transfer of data
between them.

Fig a Bus connection for processor, keyboard, and display

32
The keyboard and the display are separate device as shown in fig a. the action
of striking a key on the keyboard does not automatically cause the corresponding
character to be displayed on the screen. One block of instructions in the I/O program
transfers the character into the processor, and another associated block of
instructions causes the character to be displayed.

Striking a key stores the corresponding character code in an 8-bit buffer register
associated with the keyboard. Let us call this register DATAIN, as shown in fig a. To
inform the processor that a valid character is in DATAIN, a status control flag, SIN, is
set to 1. A program monitors SIN, and when SIN is set to 1, the processor reads the
contents of DATAIN. When the character is transferred to the processor, SIN is
automatically cleared to 0. If a second character is entered at the keyboard, SIN is again
set to 1, and the processor repeats.

An analogous process takes place when characters are transferred from


the processor to the display. A buffer register, DATAOUT, and a status control flag,
SOUT, are used for this transfer. When SOUT equals 1, the display is ready to
receive a character.

In order to perform I/O transfers, we need machine instructions that can


check the state of the status flags and transfer data between the processor and
the I/O device. These instructions are similar in format to those used for moving
data between the processor and the memory. For example, the processor can
monitor the keyboard status flag SIN and transfer a character from DATAIN to
register R1 by the following sequence of operations.

Stacks and queues

A computer program often needs to perform a particular subtask using


the familiar subroutine structure. In order to organize the control and information
linkage between the main program and the subroutine, a data structure called a stack
is used. This section will describe stacks, as well as a closely related data structure
called a queue.

Data operated on by a program can be organized in a variety of ways. We have


already encountered data structured as lists. Now, we consider an important data
structure known as a stack. A stack is a list of data elements, usually words or
bytes, with the accessing restriction that elements can be added or removed at
one end of the list only.

33
This end is called the top of the stack, and the other end is called the bottom.
Another descriptive phrase, last-in-first-out (LIFO) stack, is also used to describe this
type of storage mechanism; the last data item placed on the stack is the first one
removed when retrieval begins. The terms push and pop are used to describe placing
a new item on the stack and removing the top item from the stack, respectively

Fig b shows a stack of word data items in the memory ocomputer. It contains
numerical values, with 43 at the bottom and -28 at the top
A processor register is used to keep track of the address of the element of the stack
that is at the top at any given time. This register is called the stack pointer (SP). It
could be one of the general-purpose registers or a register dedicated to this function.
Fig b A stack of words in the memory

Another useful data structure that is similar to the stack is called a queue.
Data are stored in and retrieved from a queue on a first-in-first-out (FIFO) basis. Thus,
if we assume that the queue grows in the direction of increasing addresses in the
memory, which is a common practice, new data are added at the back (high-
address end) and retrieved from the front (low-address end) of the queue.
There are two important differences between how a stack and a queue are
implemented. One end of the stack is fixed (the bottom), while the other end
rises and falls as data are pushed and popped. A single pointer is needed to point to
the top of the gher addresses as data are added at the back and removed from the
front. So two pointers are needed to keep track of the two ends of the queue.

34
Another difference between a stack and a queue is that, without further control, a
queue would continuously move through the memory of a computer in the
direction of higher addresses. One way to limit the queue to a fixed region in
memory is to use a circular buffer. Let us assume that memory addresses from
BEGINNING to END are assigned to the queue.

The first entry in the queue is entered into location BEGINNING, and successive
entries are appended to the queue by entering them at successively higher addresses.
By the time the back of the queue reaches END, space will have been created at the
beginning if some items have been removed from the queue. Hence, the back
pointer is reset to the value BEGINNING and the process continues. As in the case of
a stack, care must be taken to detect when the region assigned to the data structure is
either completely full or completely empty.

Subroutines
In a given program, it is often necessary to perform a particular subtask
many times on different data-values. Such a subtask is usually called a subroutine. For
example, a subroutine may evaluate the sine function or sort a list of values into
increasing or decreasing order.
It is possible to include the block of instructions that constitute a
subroutine at every place where it is needed in the program. However, to save
space, only one copy of the instructions that constitute the subroutine is placed in
the memory, and any program that requires the use of the subroutine simply
branches to its starting location. When a program branches to a subroutine we say
that it is calling the subroutine.
The instruction that performs this branch operation is named a Call instruction.
After a subroutine has been executed, the calling program must
resume execution, continuing immediately after the instruction that called the
subroutine. The subroutine is said to return to the program that called it by executing
a Return instruction.

The way in which a computer makes it possible to call and return from
subroutines is referred to as its subroutine linkage method. The simplest
subroutine linkage method is to save the return address in a specific location,
which may be a register dedicated to this function. Such a register is called the
link register. When the subroutine completes its task, the Return instruction returns to
the calling program by branching indirectly through the link register.

35
The Call instruction is just a special branch instruction that performs the
following operations

• Store the contents of the PC in the link register


• Branch to the target address specified by the instruction
The Return instruction is a special branch instruction that performs the operation

Branch to the address contained in the link register

SUBROUTINE NESTING AND THE PROCESSOR STACK:-


A common programming practice, called subroutine nesting, is to have one
subroutine call another. In this case, the return address of the second call is also
stored in the link register, destroying its previous contents. Hence, it is essential
to save the contents of the link register in some other location before calling another
subroutine. Otherwise, the return address of the first subroutine will be lost.
Subroutine nesting can be carried out to any depth. Eventually, the
last subroutine called completes its computations and returns to the subroutine
that called it. The return address needed for this first return is the last on
generated in the nested call sequence. That is, return addresses are generated and
used in a last-in-first-out order. This suggests that the return addresses associated
with subroutine calls should be pushed onto a stack. A particular register is
designated as the stack pointer, SP, to be used in this operation. The stack
pointer points to a stack called the processor stack. The Call instruction pushes
the contents of the PC onto the processor stack and loads the subroutine address
into the PC. The Return instruction pops the return address from the processor stack
into the PC.

PARAMETER PASSING:-
When calling a subroutine, a program must provide to the subroutine the
parameters, that is, the operands or their addresses, to be used in the computation.
Later, the subroutine returns other parameters, in this case, the results of the
computation. This exchange of information between a calling program and a subroutine
is referred to as parameter passing. Parameter passing may be accomplished in
several ways. The parameters may be placed in registers or in memory locations,
where they can be accessed by the subroutine. Alternatively, the parameters may be
placed on the processor stack used for saving the return address.

The purpose of the subroutines is to add a list of numbers. Instead of passing


36
the actual list entries, the calling program passes the address of the first number
in the list. This technique is called passing by reference. The second parameter is
passed by value, that is, the actual number of entries, n, is passed to the subroutine.

THE STACK FRAME:-


Now, observe how space is used in the stack in the example. During
execution of the subroutine, six locations at the top of the stack contain entries that
are needed by the subroutine. These locations constitute a private workspace for

37
the subroutine, created at the time the subroutine is entered and freed up when the
subroutine returns control to the calling program. Such space is called a stack frame.

In addition to the stack pointer SP, it is useful to have another pointer


register, called the frame pointer (FP), for convenient access to the parameters
passed to the subroutine and to the local memory variables used by the
subroutine. These local variables are only used within the subroutine, so it is
appropriate to allocate space for them in the stack frame associated with the
subroutine. We assume that four parameters are passed to the subroutine, three local
variables are used within the subroutine, and registers R0 and R1 need to be saved
because they will also be used within the subroutine.

The pointers SP and FP are manipulated as the stack frame is built, used, and
dismantled for a particular of the subroutine. We begin by assuming that SP point to
the old top-of-stack (TOS) element in fig b. Before the subroutine is called, the
calling program pushes the four parameters onto the stack. The call instruction is then
executed, resulting in the return address being pushed onto the stack. Now, SP points to
this return address, and the first instruction of the subroutine is about to be
executed. This is the point at which the frame pointer FP is set to contain the proper
memory address. Since FP is usually a general-purpose register, it may contain
information of use to the Calling program. Therefore, its contents are saved by pushing
them onto the stack. Since the SP now points to this position, its contents are copied
into FP.

Thus, the first two instructions executed in the subroutine are

Move FP, -(SP)


Move SP, FP

After these instructions are executed, both SP and FP point to the saved FP
contents.
Subtract #12, SP

Finally, the contents of processor registers R0 and R1 are saved by pushing them
onto the stack. At this point, the stack frame has been set up as shown in the fig.

The subroutine now executes its task. When the task is completed, the subroutine
pops the saved values of R1 and R0 back into those registers, removes the local variables
from the stack frame by executing the instruction.

Add #12, SP

38
And pops the saved old value of FP back into FP. At this point, SP points to the
return address, so the Return instruction can be executed, transferring control back to the
calling program.
1.19 Logic instructions

Logic operations such as AND, OR, and NOT, applied to individual bits, are the basic
building blocks of digital circuits, as described. It is also useful to be able to
perform logic operations is software, which is done using instructions that apply
these operations to all bits of a word or byte independently and in parallel. For
example, the instruction
Not dst

SHIFT AND ROTATE INSTRUCTIONS:-


There are many applications that require the bits of an operand to be shifted
right or left some specified number of bit positions. The details of how the shifts are
performed depend on whether the operand is a signed number or some more general
binary-coded information. For general operands, we use a logical shift. For a
number, we use an arithmetic shift, which preserves the sign of the number.

Logical shifts:-
Two logical shift instructions are needed, one for shifting left (LShiftL) and
another for shifting right (LShiftR). These instructions shift an operand over a number
of bit positions specified in a count operand contained in the instruction. The
general form of a logical left shift instruction is

LShiftL count, dst

Rotate Operations:-
In the shift operations, the bits shifted out of the operand are lost, except
for the last bit shifted out which is retained in the Carry flag C. To preserve all
bits, a set of rotate instructions can be used. They move the bits that are shifted out
of one end of the operand back into the other end. Two versions of both the left and
right rotate instructions.

40
Part-A
2 MARKS QUESTIONS WITH ANSWERS
1. Defina a BUS and its types?
Ans: These are the wires or electronics path ways that joins various components together to
communicate with each other. This network of wires or electronics pathways is known as 'BUS'.
Thus BUS is simply a set of wires of lines that connects various components inside a computer.
Computer's BUS can be divided into two types :
• Internal Bus
• External Bus

2. Define Register and its types?


Ans: Register are used to quickly accept, store, and transfer data and instructions that are being
used immediately by the CPU, there are various types of Registers those are used for various
purpose. Among of the some Mostly used Registers named as AC or Accumulator, Data Register
or DR, the AR or Address Register, program counter (PC), Memory Data Register (MDR) Index
register, Memory Buffer Register.
3. What is RAM?
Ans: A random-access memory device allows data items to be accessed (read or written) in
almost the same amount of time irrespective of the physical location of data inside the memory.
In contrast, with other direct-access data storage media such as hard disks, CD-RWs, DVD-
RWs and the older drum memory, the time required to read and write data items varies
significantly depending on their physical locations on the recording medium, due to mechanical
limitations such as media rotation speeds and arm movement.
4. What is ROM?
Ans: Read-only memory (ROM) is a type of non-volatile memory used in computers and other
electronic devices. Data stored in ROM can only be modified slowly, with difficulty, or not at all,
so it is mainly used to store firmware (software that is closely tied to specific hardware and
unlikely to need frequent updates).
5. What is the difference between RAM & ROM?
Ans: RAM is Random Access Memory.
ROM is Read Only Memory.
RAM is the memory available for the operating system, programs and processes to
use when the computer is running.
ROM is the memory that comes with your computer that is pre-written to hold the
instructions for booting-up the computer.

RAM requires a flow of electricity to retain data (e.g. the computer powered on).
ROM will retain data without the flow of electricity (e.g. when computer is powered
off).
RAM is a type of volatile memory. Data in RAM is not permanently written. When you
power off your computer the data stored in RAM is deleted.
ROM is a type of non- volatile memory. Data in ROM is permanently written and is not
erased when you power off your computer.
6. What are the types of RAM?
Ans: There are different types of RAM, including DRAM (Dynamic Random Access Memory)
and SRAM (Static Random Access Memory).
Dynamic RAM : loses its stored information in a very short time (for milli sec.) even when power
supply is on. D-RAM’s are cheaper & lower.
Static RAM uses a completely different technology. S-RAM retains stored information only as
long as the power supply is on. Static RAM’s are costlier and consume more power. 41
7. What is Flash EEPROM?
Ans: EEPROM (Electrically Erasable PROM) : The chip can be erased & reprogrammed on the
board easily byte by byte. It can be erased with in a few milliseconds. There is a limit on the
number of times the EEPROM’s can be reprogrammed, i.e.; usually around 10,000 times.
8. Define addressing Mode and its types?
Ans: The various types of addressing modes for microprocessors are as follows:

- Direct Mode: In this mode of addressing the instruction in itself includes memory access. This
can be accessed by the CPU directly.

- Indirect Mode: In this type of addressing the address which has been specified within the
instruction contains the address of the operands.

- Register direct and indirect modes: In this mode as the name suggests no addresses are
specified in this mode instead registers are specified.

- Immediate Mode: In this type of addressing mode the operand itself is the data.

- Relative Mode: In this mode the operand specified in the instruction is an offset address and
not the actual address.
9. What are the functional units of a computer?
Ans: The computer system is divided into three separate units for its operation.
They are
1) arithmetic logical unit
2) control unit.
3) central processing unit.

10. What is control unit?


Ans: The control unit (CU) is a component of a computer's central processing unit (CPU) that
directs operation of the processor. It tells the computer's memory, arithmetic/logic unit and
input and output devices how to respond to a program's instructions.

11. What are the examples for input and output devices?
Input Devices Output Devices
1.Digital camera 1. Monitor
2. Gamepad 2. Printer
3. Joystick 3. Plotters
4. Keyboard 4. Projector
5. Microphone 5. LCD Projection Panels

12. Write short notes on Memory


Ans: Memory is an essential element of a computer. Without its memory, a computer is of
hardly any use. Memory plays an important role in saving and retrieving data. The performance
of the computer system depends upon the size of the memory. Memory is of following types:
1. Primary Memory / Volatile Memory.
2. Secondary Memory / Non Volatile Memory.

42
IMPORTANT QUESTIONS
2 MARKS
1. Differentiate between computer organization and architecture? [Jun 2015]
2. List any five assembly language instructions? [Jun 2015]
3. What are the phases in instruction cycle? [Jun 2015]
4. Describe basic function of assembler and compiler? [Dec 2015]
5. What is use of program counter? [Dec 2015]

Part-B
10 MARKS
1. List and explain the functional units of a computer with a neat diagram? [Jun 2015]
2. Explain the computer levels of programming languages? [Jun 2015]
3. Draw and explain Von Neumann Architecture? [Dec 2015]
4. Describe following instructions with suitable examples? [Dec 2015]
(a) RAR (b) ADC (c) JZ (d) PUSH (e) SHL
5. (A). How the data can be read from memory? Explain with timing diagrams for memory read
and memory write operations? [Dec 2015]
(B). Explain briefly about three categories of computer programming languages?
6. (A). Explain briefly about the two commonly used organizations for multi-byte data
organization?
(B). What are the different addressing modes used in assembly language instructions?[Dec
2015]

43
UNIT-2
ARTHEMATIC AND BASIC PROCESSING UNIT

2.1ADDITION AND SUBTRACTION OF SIGNED NUMBERS:

In figure-1, the function table of a full-adder is shown; sum and carryout are the
outputs for adding equally weighted bits x i and yi, in two numbers X and Y.
The logic expressions for these functions are also shown, along with an example of
addition of the4-bit unsigned numbers 7 and 6. Note that each stage of the addition
process must accommodate a carry-in bit. We use ci, to represent the carry-in to the
ith stage, which is the same as the carryout from the (i - 1) th stage.
The logic expression for si in Figure-1 can be implemented with a 3-input XOR gate.
The carryout function, ci +1 is implemented with a two-level AND-OR logic
circuit. A convenient symbol for the complete circuit for a single stage of addition,
called a full adder (FA), is as shown in the figure-1a.
A cascaded connection of such n full adder blocks, as shown in Figure 1b, forms
a parallel adder & can be used to add two n-bit numbers. Since the carries must
propagate, or ripple, through this cascade, the configuration is called an n-bit ripple-
carry adder. 44
The carry-in, Co, into the least-significant-bit (LSB) position [Ist stage] provides a
convenient means of adding 1 to a number. Take for instance; forming the 2's-
complement of a number involves adding 1 to the 1’s-complement of the number.
The carry signals are also useful for interconnecting k adders to form an adder
capable of handling input numbers that are kn bits long, as shown in Figure-1c.

45
FIG: Addition of binary vectors.

FIG: 4 - Bit parallel Adder.

46
DESIGN OF FAST ADDERS:

In an n-bit parallel adder (ripple-carry adder), there is too much delay in


developing the outputs, so through sn-1 and cn. On many occasions this delay is
not acceptable; in comparison with the speed of other processor components and
speed of the data transfer between registers and cache memories. The delay
through a network depends on the integrated circuit technology used in fabricating
the network and on the number of gates in the paths from inputs to outputs
(propagation delay). The delay through any combinational logic network
constructed from gates in a particular technology is determined by adding up the
number of logic-gate delays along the longest signal propagation path through the
network. In the case of the n-bit ripple-carry adder, the longest path is from inputs
x0, y0, and c0 at the least-significant-bit (LSB) position to outputs cn and sn-1 at the
most-significant-bit (MSB) position.
Using the logic implementation indicated in Figure-1, cn-1 is available in 2(n—1)
gate delays, and sn-1 is one XOR gate delay later. The final carry-out, cn is available
after 2n gate delays. Therefore, if a ripple-carry adder is used to
implement the addition/subtraction unit shown in Figure-3, all sum bits are available
in 2n gate delays, including the delay through the XOR gates on the Y input. Using
the implementation cn-1 for overflow, this indicator is available after 2n+2 gate
delays. In summary, in a

parallel adder an nth stage adder can not complete the addition process before
all its previous stages have completed the addition even with input bits ready. This
is because, the carry bit from previous stage has to be made available for
addition of the present stage.

In practice, a number of design techniques have been used to implement


high- speed adders. In order to reduce this delay in adders, an augmented logic gate
network structure may be used. One such method is to use circuit designs for fast
propagation of carry signals (carry prediction).

Carry-Look ahead Addition:

As it is clear from the previous discussion that a parallel adder is


considerably slow & a fast adder circuit must speed up the generation of the
carry signals, it is necessary to make the carry input to each stage readily available
along with the input bits. This can be achieved either by propagating the previous carry
or by generating a carry depending on the input bits & previous carry. The logic
expressions for si (sum) and c i+1 (carry-out) of stage ith are

47
The above expressions Gi and Pi are called carry generate and propagate
functions for stage i. If the generate function for stage i is equal to 1, then ci+1 = 1,
independent of the input carry, ci. This occurs when both xi and yi are 1. The
propagate function means that an input carry will produce an output carry when
either xi or yi or both equal to 1. Now, using Gi & Pi functions we can decide carry
for ith stage even before its previous stages have completed their addition
operations. All Gi and Pi functions can be formed independently and in parallel in
only one gate delay after the Xi and Yi inputs are applied to an n-bit adder. Each bit
stage contains an AND gate to form Gi, an OR gate to form Pi
and a three-input XOR gate to form si. However, a much simpler circuit can be
derive
by considering the propagate function Pi = xi which differs from Pi = xi + yi only

when xi = yi =1 where Gi = 1 (so it does not matter whether Pi is 0 or 1). Then, the
basic diagram in Figure-5 can be used in each bit stage to predict carry ahead of any
stage completing its addition.
Consider the
ci+1expression,

48
This is because, Ci = (Gi-1 + Pi-1Ci-1).

Further, Ci-1 = (Gi-2 + Pi-2Ci-2) and so on. Expanding in this fashion, the final
carry expression can be written as below;

C i+1 = Gi + PiG i-1 + PiP i-1 G i-2 + … + Pi P i-1 … P 1G0 + Pi P i-1 …


P0G0

Thus, all carries can be obtained in three gate delays after the input signals Xi,
Yi and Cin are applied at the inputs. This is because only one gate delay is
needed to
develop all Pi and Gi signals, followed by two gate delays in the AND-OR circuit
(SOP

expression) for c + After a further XOR gate delay, all sum bits are available.

49
Therefore, independent of n, the number of stages, the n-bit addition process
requires only four gate delays.

Now, consider the design of a 4-bit parallel adder. The carries can be implemented as

;i = 0

;i = 1

;i = 2

;i = 3
The complete 4-bit adder is shown in Figure 5b where the B cell indicates Gi, Pi & Si
generator. The carries are implemented in the block labeled carry look-ahead
logic. An adder implemented in this form is called a carry look ahead adder. Delay
through the adder is 3 gate delays for all carry bits and 4 gate delays for all sum bits.
In comparison, note that a 4-bit ripple-carry adder requires 7 gate delays for S3(2n-
1) and 8 gate delays(2n) for c4.

If we try to extend the carry look- ahead adder of Figure 5b for longer
operands, we run into a problem of gate fan-in constraints. From the final expression
for Ci+1 & the carry expressions for a 4 bit adder, we see that the last AND gate and
the OR gate require a fan-in of i + 2 in generating cn-1. For c4 (i = 3)in the 4-bit adder,
a fan-in of 5 is required. This puts the limit on the practical implementation. So the
adder design shown in Figure
4b cannot be directly extended to longer operand sizes. However, if we cascade a
number of 4-bit adders, it is possible to build longer adders without the practical
problems of fan- in. An example of a 16 bit carry look ahead adder is as shown in
figure 6. Eight 4-bit carry look-ahead adders can be connected as in Figure-2 to form
a 32-bit adder.

50
FIG: 16 bit carry-look ahead adder.

MULTIPLICATION OF POSITIVE NUMBERS:

Consider the multiplication of two integers as in Figure-6a in binary


number system. This algorithm applies to unsigned numbers and to positive signed
numbers. The product of two n-digit numbers can be accommodated in 2n digits, so
the product of the two 4-bit numbers in this example fits into 8 bits. In the binary
system, multiplication by the multiplier bit ‘1’ means the multiplicand is entered in
the appropriate position to be added to the partial product. If the multiplier bit is
‘0’, then 0s are entered, as indicated in the third row of the shown example.

Binary multiplication of positive operands can be implemented in a


combinational (speed up) two-dimensional logic array, as shown in Figure 7. Here, M-
indicates multiplicand, Q- indicates multiplier & P- indicates partial product.
The basic component in each cell is a full adder FA. The AND gate in each cell
determines whether a multiplicand bit mj, is added to the incoming partial-product
bit, based on the value of the multiplier bit, qi. For i in the range of 0 to 3, if qi
= 1, add the multiplicand (appropriately shifted) to the incoming partial product,
PPi, to generate the outgoing partial product, PP(i+ 1) & if qi = 0, PPi is passed
vertically downward unchanged. The initial partial product PPO is all 0s. PP4 is the
desired product. The multiplicand is shifted left one position per row by the
diagonal signal path. Since the multiplicand is shifted and added to the partial
product depending on the multiplier bit, the method is referred as SHIFT & ADD

51
method. The multiplier array & the components of each bit cell are indicated in
the diagram, while the flow diagram shown explains the multiplication
procedure.

FIG-7a P7, P6, P5,…,P0 – product.

FIG-7b

The following SHIFT & ADD method flow chart depicts the multiplication logic for
unsigned numbers.

52
Despite the use of a combinational network, there is a considerable
amount of delay associated with the arrangement shown. Although the preceding
combinational multiplier is easy to understand, it uses many gates for multiplying
numbers of practical size, such as 32- or 64-bit numbers. The worst case signal
propagation delay path is from the upper right corner of the array to the high-order
product bit output at the bottom left corner of the array. The path includes the two
cells at the right end of each row, followed by all the cells in the bottom row.
Assuming that there are two gate delays from the inputs to the outputs of a full
adder block, the path has a total of 6(n - 1) - 1 gate delays, including the initial AND
gate delay in all cells, for the n x n array. In the delay expression, (n-1) because,
only the AND gates are actually needed in the first row of the array because the
incoming (initial) partial product PPO is zero.

53
Multiplication can also be performed using a mixture of combinational array
techniques (similar to those shown in Figure 7) and sequential techniques requiring less
combinational logic. Multiplication is usually provided as an instruction in the
machine instruction set of a processor. High-performance processor (DS processors)
chips use an appreciable area of the chip to perform arithmetic functions on both
integer and floating- point operands. Sacrificing an area on-chip for these
arithmetic circuits increases the speed of processing. Generally, processors built for
real time applications have an on- chip multiplier.

ww.vtucs
Another simplest way to perform multiplication is to use the adder circuitry in
the ALU for a number of sequential steps. The block diagram in Figure 8a
shows the hardware arrangement for sequential multiplication. This circuit
performs multiplication by using single n-bit adder n times to implement the spatial
addition performed by the n

54
rows of ripple-carry adders. Registers A and Q combined to hold PPi while multiplier
bit qi generates the signal Add/No-add. This signal controls the addition of the
multiplicand M to PPi to generate PP(i + 1). The product is computed in n cycles. The
partial product grows in length by one bit per cycle from the initial vector, PPO, of
n 0s in register A. The carry-out from the adder is stored in flip-flop C. To begin
with, the multiplier is loaded into register Q, the multiplicand into register M and
registers C and A are cleared to 0. At the end of each cycle C, A, and Q are shifted
right one bit position to allow for growth of the partial product as the multiplier is
shifted out of register Q. Because of this shifting, multiplier bit qi, appears at the LSB
position of Q to generate the Add/No-add signal at the correct time, starting with
qo during the first cycle, q1 during the second cycle, and so on. After they are used,
the multiplier bits are discarded by the right-shift operation. Note that the carry-out
from the adder is the leftmost bit of PP(i + 1), and it must be held in the C flip-flop
to be shifted right with the contents of A and Q. After n cycles, the high-order half-
of- the product is held in register A and the low-order half is in register Q. The
multiplication example used above is shown in Figure 8b as it would be performed
by this hardware arrangement.

Using this sequential hardware structure, it is clear that a multiply instruction


takes much more time to execute than an Add instruction. This is because of the
sequential circuits associated in a multiplier arrangement. Several techniques have been
used to speed up multiplication; bit pair recoding, carry save addition, repeated addition,
etc.

SIGNED-OPERAND MULTIPLICATION:

Multiplication of 2's-complement signed operands, generating a double-length


product is still achieved by accumulating partial products by adding versions of the
multiplicand as decided by the multiplier bits. First, consider the case of a
positive multiplier and a negative multiplicand. When we add a negative
multiplicand to a partial product, we must extend the sign-bit value of the multiplicand
to the left as far as the product will extend. In Figure 9, for example, the 5-bit signed
operand, - 13, is the multiplicand, and +11, is the 5 bit multiplier & the expected
product -143 is 10-bit wide. The sign extension of the multiplicand is shown in red
color. Thus, the hardware discussed earlier can be used for negative multiplicands if
it provides for sign extension of the partial products.

55
For a negative multiplier, a straightforward solution is to form the 2's-
complement of both the multiplier and the multiplicand and proceed as in the case of a
positive multiplier. This is possible because complementation of both operands does
not change the value or the sign of the product. In order to take care of both
negative and positive multipliers, BOOTH algorithm can be used.

1 0 0 1 1 (-13) X 0 1 0 1 1 (+11)
111111 0 0 11
111110 0 1 1
000000 00
111001 1
000000
110111 0 0 0 1 (-143)

Booth Algorithm
The Booth algorithm generates a 2n-bit product and both positive and negative
2's-complement n-bit operands are uniformly treated. To understand this
algorithm, consider a multiplication operation in which the multiplier is positive
and has a single block of 1s, for example, 0011110(+30). To derive the product, as in
the normal standard procedure, we could add four appropriately shifted
versions of the multiplicand,. However, using the Booth algorithm, we can reduce
the number of required operations by
regarding this multiplier as the difference between numbers 32 & 2 as shown
below;

0 1 0 0 0 0 0 (32)
0 0 0 0 0 1 0 (-2)
0 0 1 1 1 1 0 (30)

This suggests that the product can be generated by adding 25 times


the multiplicand to the 2's-complement of 21 times the multiplicand. For convenience,
we can describe the sequence of required operations by recoding the preceding
multiplier as 0

56
+1000 - 10. In general, in the Booth scheme, -1 times the shifted multiplicand is
selected when moving from 0 to 1, and +1 times the shifted multiplicand is selected
when moving from 1 to 0, as the multiplier is scanned from right to left.

Figure 10 illustrates the normal and the Booth algorithms for the said
example. The Booth algorithm clearly extends to any number of blocks of 1s in a
multiplier, including the situation in which a single 1 is considered a block. See
Figure 11a for another example of recoding a multiplier. The case when the least
significant bit of the multiplier is 1 is handled by assuming that an implied 0 lies to
its right. The Booth algorithm can also be used directly for negative multipliers, as
shown in Figure 11a.

57
To verify the correctness of the Booth algorithm for negative multipliers, we use
the following property of negative-number representations in the 2's-complement

The Booth technique for recoding multipliers is summarized in Figure 12a.


Thetransformation 011... 110 =>• +100... .0 -10 is called skipping over Is. This term is
derivedfrom the case in which the multiplier has its 1 s grouped into a few contiguous
blocks. Only a few versions of the shifted multiplicand (the summands) must be added to
generate the product, thus speeding up the multiplication operation. However, in the worst
case—that of alternating 1 s and 0s in the multiplier — each bit of the multiplier selects a
summand.
In fact, this results in more summands than if the Booth algorithm were not used. A 16-bit,
worst-case multiplier, an ordinary multiplier, and a good multiplier are shown in Fig 12a.

58
Fig:12.Booth multiplier recoding table.

Fig :12.aBooth recoded multipliers

T he Booth algorithm has two attractive features. First, it handles both positive andnegative
multipliers uniformly. Second, it achieves some efficiency in the number ofadditions
required when the multiplier has a few large blocks of 1 s. The speed gained byskipping

59
over 1s depends on the data. On average, the speed of doing multiplication withthe Booth
algorithm is the same as with the normal algorithm.

FAST MULIPLICATION:

There are two techniques for speeding up the multiplication operation. The first
technique guarantees that the maximum number of summands (versions of the
multiplicand) that must be added is n/2 for n-bit operands. The second technique reduces
the time needed to add the summands (carry-save addition of summands
method).

Bit-Pair Recoding of Multipliers:

.
This bit-pair recoding technique halves the maximum number of summands. It
is derived from the Booth algorithm. Group the Booth-recoded multiplier bits in
pairs, and observe the following: The pair (+1 -1) is equivalent to the pair (0 +1). That
is, instead of adding —1 times the multiplicand M at shift position i to + 1 x M at
position i + 1, the same result is obtained by adding +1 x M at position I Other
examples are: (+1 0) is equivalent to (0 +2),(-l +1) is equivalent to (0 —1). and so
on. Thus, if the Booth-recoded multiplier is examined two bits at a time, starting from
the right, it can be rewritten in a form that requires at most one version of the
multiplicand to be added to the partial product for each pair of multiplier bits.

selection decisions for all possibilities. The multiplication operation in figure


11a is shown in Figure 15. It is clear from the example that the bit pair recoding
method requires only n/2 summands as against n summands in Booth’s algorithm

60
FIG – 15

61
INTEGER DIVISION:

Positive-number multiplication operation is done manually in the way it is done in


a logic circuit. A similar kind of approach can be used here in discussing integer division.
First, consider positive-number division. Figure 16 shows examples of decimal division
and its binary form of division. First, let us try to divide 2 by13, and it does not work.
Next, let us try to divide 27 by 13. Going through the trials, we enter 2 as the quotient
and perform the required subtraction. The next digit of the dividend, 4, is brought
down, and we finish by deciding that 13 goes into 14 once and the remainder is 1.
Binary division is similar to this, with the quotient bits only 0 and 1.
A circuit that implements division by this longhand method operates as follows:
It positions the divisor appropriately with respect to the dividend and performs a
subtrac- tion. If the remainder is zero or positive, a quotient bit of 1 is determined, the
remainder is extended by another bit of the dividend, the divisor is repositioned, and sub-
traction is performed. On the other hand, if the remainder is negative, a quotient
bit of 0 is determined, the dividend is restored by adding back the divisor, and the
divisor H repositioned for another subtraction

62
FIG – 17: Binary Division

Restoring Division:

Figure 17 shows a logic circuit arrangement that implements restoring division. Note
its similarity to the structure for multiplication that was shown in Figure 8. An n-bit
positive divisor is loaded into register M and an n-bit positive dividend is loaded into
register Q at the start of the operation. Register A is set to 0. After the division is
complete, the n-bit quotient is in register Q and the remainder is in register A. The
required subtractions are facilitated by using 2's-complement arithmetic. The extra
bit position at the left end of both A and M accommodates the sign bit during
subtractions. The following algorithm performs restoring division.
Do the following ntimes:
1. Shift A and Q left one binary position.
2. Subtract M from A, and place the answer back in A.
3. If the sign of A is 1, set q0 to 0 and add M back to A (that is, restore
A);otherwise, set q0to 1.

63
FIG – 18: Restoring Division

No restoring Division:

The restoring-division algorithm can be improved by avoiding the need for


restoring A after an unsuccessful subtraction. Subtraction is said to be unsuccessful if
the result is negative. Consider the sequence of operations that takes place after
the subtraction operation in the preceding algorithm. If A is positive, we shift left and
subtract M, that is, we perform 2A - M. If A is negative, we restore it by performing
A + M, and then we shift it left and subtract M. This is equivalent to performing
2A + M. The q0 bit is appropriately set to 0 or 1 after the correct operation
has been performed. We can summarize this in the following algorithm for no
restoring division.
Step 1: Do the following times:
1. If the sign of A is 0, shift A and Q left one bit position and subtract M
fromA; otherwise, shift A and Q left and add M to A.
2. Now, if the sign of A is 0, set q 0 to 1; otherwise, set q0 to 0.
Step 2: If the sign of A is 1, add M to A.

64
Step 2 is needed to leave the proper positive remainder in A at the end of
the n cycles of Step 1. The logic circuitry in Figure 17 can also be used to
perform this algorithm. Note that the Restore operations are no longer needed,
and that exactly one Add or Subtract operation is performed per cycle. Figure 19
shows how the division example in Figure 18 is executed by the no restoring-division
algorithm.
There are no simple algorithms for directly performing division on signed operands
that are comparable to the algorithms for signed multiplication. In division, the
operands can be preprocessed to transform them into positive values. After using
one of the algorithms
just discussed, the results are transformed to the correct signed values, as necessary

FIG – 19: Non-restoring Division

65
FLOATING-POINT NUMBERS AND OPERATIONS:

Floating – point arithmetic is an automatic way to keep track of the radix


point. The discussion so far was exclusively with fixed-point numbers which are
considered as integers, that is, as having an implied binary point at the right end
of the number. It is also possible to assume that the binary point is just to the right of
the sign bit, thus representing a fraction or any where else resulting in real numbers. In
the 2's-complement system, the signed value F, represented by the n-bit binary
fraction
B = b0.b - 1b -2 …..b-(n-1) is given by

F(B) = -bo x 2° + b-1 x 2-1 +b-2x2-2 + ... + b-(n-X) x 2-{n~l) where the range of F is

-1 ≤ F ≤ 1 -2-(n-1). Consider the range of values representable in a 32-bit, signed,


fixed- point format. Interpreted as integers, the value range is approximately 0 to
±2.15 x 109. If

we consider them to be fractions, the range is approximately ±4.55 x 10 -10 to ±1.


Neither of these ranges is sufficient for scientific calculations, which might involve
parameters like Avogadro's number (6.0247 x 1023 mole-1) or Planck's constant
(6.6254 x 10-27erg s). Hence, we need to easily accommodate both very large
integers and very small fractions. To do this, a computer must be able to represent
numbers and operate on them in such a way that the position of the binary point is
variable and is automatically adjusted as computation proceeds. In such a case, the
binary point is said to float, and the numbers are called floating-point numbers. This
distinguishes them from fixed-point numbers, whose binary point is always in the
same position.
Because the position of the binary point in a floating-point number is
variable, it must be given explicitly in the floating-point representation. For example,
in the familiar decimal scientific notation, numbers may be written as 6.0247 x
1023, 6.6254 -10-27, -1.0341 x 102, -7.3000 x 10-14, and so on. These numbers
are said to be given to fivesignificant digits. The scale factors (1023, 10-27, and so
on) indicate the position of the decimal point with respect to the significant digits.
By convention, when the decimal point is placed to the right of the first (nonzero)
significant digit, the number is said to be normalized. Note that the base, 10, in the
scale factor is fixed and does not need to appear explicitly in the machine
representation of a floating-point number. The sign, the significant digits, and the
exponent in the scale factor constitute the representation. We are thus motivated

66
to define a floating-point number representation as one in which a number is
represented by its sign, a string of significant digits, commonly called the
mantissa, and an exponent to an implied base for the scale factor.

BASIC PROCESSING UNIT

The heart of any computer is the central processing unit (CPU). The CPU
executes all the machine instructions and coordinates the activities of all other
units during the execution of an instruction. This unit is also called as the
Instruction Set Processor (ISP). By looking at its internal structure, we can
understand how it performs the tasks of fetching, decoding, and executing
instructions of a program. The processor is generally called as the central processing
unit (CPU) or micro processing unit (MPU).An high-performance processor can be built
by making various functional units operate in parallel. High-performance processors
have a pipelined organization where the execution of one instruction is started
before the execution of the preceding instruction is completed. In another
approach, known as superscalar operation, several instructions are fetched and
executed at the same time. Pipelining and superscalar architectures provide a very
high performance for any processor.
A typical computing task consists of a series of steps specified by a sequence of
machine instructions that constitute a program. A program is a set of
instructions performing a meaningful task. An instruction is command to the
processor & is executed by carrying out a sequence of sub-operations called as micro-
operations. Figure 1 indicates various blocks of a typical processing unit. It consists of
PC, IR, ID, MAR, MDR, a set of register arrays for temporary storage, Timing and
Control unit as main units.

FUNDAMENTAL CONCEPTS:

Execution of a program by the processor starts with the fetching of


instructions one at a time, decoding the instruction and performing the operations
specified. From memory, instructions are fetched from successive locations until a
branch or a jump instruction is encountered. The processor keeps track of the
address of the memory location containing the next instruction to be fetched using
the program counter (PC) or Instruction Pointer (IP). After fetching an instruction, the
contents of the PC are updated to point to the next instruction in the sequence. But,
when a branch instruction is to be executed, the PC will be loaded with a different
(jump/branch address).

67
Fig-1

Instruction register, IR is another key register in the processor, which is


used to hold the op-codes before decoding. IR contents are then transferred to
an instruction decoder (ID) for decoding. The decoder then informs the control unit
about the task to be executed. The control unit along with the timing unit
generates all necessary control signals needed for the instruction execution.
Suppose that each instruction comprises 2 bytes, and that it is stored in one
memory word. To execute an instruction, the processor has to perform the following
three steps:
1. Fetch the contents of the memory location pointed to by the PC. The
contents of this location are interpreted as an instruction code to be executed.
Hence, they are loaded into the IR/ID. Symbolically, this operation can be written as
IR [(PC)]
2. Assuming that the memory is byte addressable, increment the contents of
the PC by 2, that is,
PC [PC] + 2

3. Decode the instruction to understand the operation & generate the


control signals necessary to carry out the operation.

4. Carry out the actions specified by the instruction in the IR.

68
In cases where an instruction occupies more than one word, steps 1 and 2 must
be repeated as many times as necessary to fetch the complete instruction. These two
steps together are usually referred to as the fetch phase; step 3 constitutes the
decoding phase; and step 4 constitutes the execution phase.
To study these operations in detail, let us examine the internal organization of
the processor. The main building blocks of a processor are interconnected in a
variety of ways. A very simple organization is shown in Figure 2. A more complex
structure that provides high performance will be presented at the end.

Figure shows an organization in which the arithmetic and logic unit (ALU) and
all the registers are interconnected through a single common bus, which is
internal to the processor. The data and address lines of the external memory bus are
shown in Figure 7.1 connected to the internal processor bus via the memory data
register, MDR, and the memory address register, MAR, respectively. Register MDR
has two inputs and two outputs. Data may be loaded into MDR either from the
memory bus or from the internal processor bus. The data stored in MDR may be
placed on either bus. The input of MAR is connected to the internal bus, and its
output is connected to the external bus. The control lines of the memory bus are

69
connected to the instruction decoder and control logic block. This unit is responsible
for issuing the signals that control the operation of all the units inside the processor
and for interacting with the memory bus.
The number and use of the processor registers RO through R(n - 1) vary
considerably from one processor to another. Registers may be provided for general-
purpose use by the programmer. Some may be dedicated as special-purpose
registers, such as index registers or stack pointers. Three registers, Y, Z, and TEMP in
Figure 2, have not been mentioned before. These registers are transparent to the
programmer, that is, the programmer need not be concerned with them because
they are never referenced explicitly by any instruction. They are used by the
processor for temporary storage during execution of some instructions. These
registers are never used for storing data generated by one instruction for later
use by another instruction.
The multiplexer MUX selects either the output of register Y or a constant value 4 to be
provided as input A of the ALU. The constant 4 is used to increment the contents of
the program counter. We will refer to the two possible values of the MUX
control input Select as Select4 and Select Y for selecting the constant 4 or register Y,
respectively.
As instruction execution progresses, data are transferred from one register to
another, often passing through the ALU to perform some arithmetic or logic
operation. The instruction decoder and control logic unit is responsible for
implementing the actions specified by the instruction loaded in the IR register. The
decoder generates the control signals needed to select the registers involved and
direct the transfer of data. The registers, the ALU, and the interconnecting bus
are collectively referred to as the data path.
With few exceptions, an instruction can be executed by performing one or more of the
following operations in some specified sequence:
1. Transfer a word of data from one processor register to another or to the ALU

2. Perform an arithmetic or a logic operation and store the result in a


processor register
3. Fetch the contents of a given memory location and load them into a
processor register
4. Store a word of data from a processor register into a given memory location

We now consider in detail how each of these operations is implemented, using the
simple processor model in Figure.
Instruction execution involves a sequence of steps in which data are transferred from
one register to another. For each register, two control signals are used to place the
contents of that register on the bus or to load the data on the bus into the register.

70
This is represented symbolically in Figure 3. The input and output of register Ri are
connected to the bus via switches controlled by the signals Riin and Riout
respectively. When Riin is set to 1, the data on the bus are loaded into Ri.
Similarly, when Riout, is set to 1, the contents of register Riout are placed on the
bus. While Riout is equal to 0, the bus can be used for transferring data from
other registers.
Suppose that we wish to transfer the contents of register RI to register R4. This can be
accomplished as follows:
1. Enable the output of register R1out by setting Rlout, tc 1. This places the
contents of R1 on the processor bus.
2. Enable the input of register R4 by setting R4in to 1. This loads data from
the processor bus into register R4.
All operations and data transfers within the processor take place within time
periods defined by the processor clock. The control signals that govern a particular
transfer are asserted at the start of the clock cycle. In our example, Rl out and R4in
are set to 1. The registers consist of edge-triggered flip-flops. Hence, at the next
active edge of the clock, the flip-flops that constitute R4 will load the data present
at their inputs. At the same time, the control signals Rlout and R4in will return to
0. We will use this simple model of the timing of data transfers for the rest of this
chapter. However, we should point out that other schemes are possible. For
example, data transfers may use both the rising and falling edges of the clock. Also,
when edge-triggered flip-flops are not used, two or more clock signals may be
needed to guarantee proper transfer of data. This is known as multiphase
clocking.
An implementation for one bit of register Ri is shown in Figure 7.3 as an
example. A two-input multiplexer is used to select the data applied to the input of
an edge-triggered D flip-flop. When the control input Riin is equal to 1, the
multiplexer selects the data on the bus. This data will be loaded into the flip-flop at
the rising edge of the clock. When Riin is equal to 0, the multiplexer feeds back the
value currently stored in the flip-flop.
The Q output of the flip-flop is connected to the bus via a tri-state gate. When
Riout, isequal to 0, the gate's output is in the high-impedance (electrically
disconnected) state. This corresponds to the open-circuit state of a switch. When
Riout, = 1, the gate drives the bus to 0 or 1, depending on the value of Q.

71
EXECUTION OF A COMPLETE INSTRUCTION:
Let us now put together the sequence of elementary operations required to execute
one instruction. Consider the instruction

Add (R3), R1

which adds the contents of a memory location pointed to by R3 to register R1.


Executing this instruction requires the following actions:
1. Fetch the instruction.
2. Fetch the first operand (the contents of the memory location pointed to by R3).
3. Perform the addition.
4 .Load the result into Rl.

The listing shown in figure above indicates the sequence of control steps
required to perform these operations for the single-bus architecture of Figure
2. Instruction execution proceeds as follows. In step 1, the instruction fetch
operation is initiated by loading the contents of the PC into the MAR and sending a
Read request to the memory. The Select signal is set to Select4, which causes the
multiplexer MUX to select the constant 4. This value is added to the operand at
input B, which is the contents of the PC, and the result is stored in register Z. The
updated value is moved from register Z back into the PC during step 2, while waiting
for the memory to respond. In step 3, the word fetched from the memory is loaded
into the IR. Steps 1 through 3 constitute the instruction fetch phase, which is the same
for all instructions. The instruction decoding circuit interprets the contents of the
IR at the beginning of step 4. This enables the control circuitry to activate the
control signals for steps 4 through 7, which constitute the execution phase. The
contents of register R3 are transferred to the MAR in step 4, and a memory read
operation is initiated.

72
Then the contents of Rl are transferred to register Y in step 5, to prepare for the
addition operation. When the Read operation is completed, the memory
operand is available in register MDR, and the addition operation is performed in step
6. The contents of MDR are gated to the bus, and thus also to the B input of the ALU,
and register Y is selected as the second input to the ALU by choosing Select Y.
The sum is stored in register Z, then transferred to Rl in step 7. The End signal
causes a new instruction fetch cycle to begin by returning to step 1.
This discussion accounts for all control signals in Figure 7.6 except Y in step
2. There is no need to copy the updated contents of PC into register Y when
executing the Add instruction. But, in Branch instructions the updated value of the
PC is needed to compute the Branch target address. To speed up the execution of
Branch instructions, this value is copied into register Y in step 2. Since step 2 is part
of the fetch phase, the same action will be performed for all instructions. This
does not cause any harm because register Y is not used for any other purpose at
that time.

Branch Instructions:
A branch instruction replaces the contents of the PC with the branch
target address. This address is usually obtained by adding an offset X, which is
given in the branch instruction, to the updated value of the PC. Listing in figure 8
below gives a control sequence that implements an unconditional branch
instruction. Processing starts, as usual, with the fetch phase. This phase ends when
the instruction is loaded into the IR in step 3. The offset value is extracted from
the IR by the instruction decoding circuit, which will also perform sign extension if
required. Since the value of the updated PC is already available in register Y, the
offset X is gated onto the bus in step 4, and an addition operation is performed.
The result, which is the branch target address, is loaded into the PC in step 5.
The offset X used in a branch instruction is usually the difference between the
branch target address and the address immediately following the branch instruction.

Fig 8

73
For example, if the branch instruction is at location 2000 and if the branch
target address is 2050, the value of X must be 46. The reason for this can be readily
appreciated from the control sequence in Figure 7. The PC is incremented during
the fetch phase, before knowing the type of instruction being executed. Thus, when
the branch address is computed in step 4, the PC value used is the updated value,
which points to the instruction following the branch instruction in the memory.
Consider now a conditional branch. In this case, we need to check the status of
the condition codes before loading a new value into the PC. For example, for a Branch-
on- negative (Branch<0) instruction, step 4 is replaced with

Offset-field-of-IRout Add, Zin, If N = 0 then End


Thus, if N = 0 the processor returns to step 1 immediately after step 4. If N = 1, step 5
is performed to load a new value into the PC, thus performing the branch operation.

MULTIPLE-BUS ORGANIZATION:
The resulting control sequences shown are quite long because only one data
item can be transferred over the bus in a clock cycle. To reduce the number of
steps needed, most commercial processors provide multiple internal paths that
enable several transfers to take place in parallel.
Figure 7 depicts a three-bus structure used to connect the registers and the ALU
of a processor. All general-purpose registers are combined into a single block
called the register file. In VLSI technology, the most efficient way to implement
a number of registers is in the form of an array of memory cells similar to those used
in the imple- mentation of random-access memories (RAMs) described in Chapter 5.
The register file in Figure 9 is said to have three ports. There are two outputs, allowing
the contents of two different registers to be accessed simultaneously and have their
contents placed on buses A and B. The third port allows the data on bus C to be
loaded into a third register during the same clock cycle.
Buses A and B are used to transfer the source operands to the A and B
inputs of the ALU, where an arithmetic or logic operation may be performed. The
result is transferred to the destination over bus C. If needed, the ALU may simply pass
one of its two input operands unmodified to bus C. We will call the ALU control
signals for such an operation R=A or R=B. The three-bus arrangement obviates the
need for registers Y and Z in Figure 2.

A second feature in Figure 9 is the introduction of the Incremental unit,


which is used to increment the PC by 4.. The source for the constant 4 at the
ALU input multiplexer is still useful. It can be used to increment other
addresses, such as the
memory addresses in Load Multiple and Store Multiple instructions.

74
Fig: multibus organization

Consider Add R4,R5,R6

The control sequence for executing this instruction is given in Figure. In step
1, the contents of the PC are passed through the ALU, using the R=B control signal, and
loaded into the MAR to start a memory read operation. At the same time the
PC is incremented by 4. Note that the value loaded into MAR is the original contents
of the PC. The incremented value is loaded into the PC at the end of the clock

75
cycle and will not affect the contents of MAR. In step 2, the processor waits for
MFC and loads the data received into MDR, then transfers them to IR in step 3.
Finally, the execution phase of the instruction requires only one control step to
complete, step 4.
By providing more paths for data transfer a significant reduction in the
number of clock cycles needed to execute an instruction is achieved.

HARDWIRED CONTROL:
To execute instructions, the processor must have some means of generating the
con- trol signals needed in the proper sequence. Computer designers use a wide
variety of techniques to solve this problem. The approaches used fall into one of
two categories: hardwired control and micro programmed control. We discuss each of
these techniques in detail, starting with hardwired control in this section.
Consider the sequence of control signals given in Figure 7. Each step in
this sequence is completed in one clock period. A counter may be used to keep
track of the control steps, as shown in Figure 11. Each state, or count, of this
counter corresponds to one control step. The required control signals are
determined by the following information:
1. Contents of the control step counter
2. Contents of the instruction register
3. Contents of the condition code flags
4. External input signals, such as MFC and interrupt requests

Fig :Hardwired control

76
To gain insight into the structure of the control unit, we start with a simplified
view of the hardware involved. The decoder/encoder block in Figure is a
combinational circuit that generates the required control outputs, depending on the
state of all its inputs. By separating the decoding and encoding functions, we obtain the
more detailed block diagram in Figure 12. The step decoder provides a separate signal
line for each step, or time slot, in the control sequence. Similarly, the output of the
instruction decoder consists of a separate line for each machine instruction. For any
instruction loaded in the IR, one of the output lines INS1 through INSm is set to 1, and
all other lines are set to 0. (For design details of decoders, refer to Appendix A.) The
input signals to the encoder block in Figure 12 are combined to generate the
individual control signals Yin, PCout, Add, End, and so on. An example of how the
encoder generates the Zin control signal for the processor organization in Figure 2
is given in Figure 13. This circuit implements the logic function

Zin=T1+T6 - ADD + T4-BR+---

This signal is asserted during time slot Ti for all instructions, during T6 for an
Add instruction, during T4 for an unconditional branch instruction, and so on.
The logic function for Zin is derived from the control sequences in Figures 7 and
8. As another example, Figure 14 gives a circuit that generates the End control
signal from the logic function

End = T7 • ADD + T5 • BR + (T5 • N + T4 • N) • BRN + • • •

The End signal starts a new instruction fetch cycle by resetting the control step
counter to its starting value. Figure 12 contains another control signal called RUN.
When

Fig 12

77
set to 1, RUN causes the counter to be incremented by one at the end of every
clock cycle. When RUN is equal to 0, the counter stops counting. This is needed
whenever the WMFC signal is issued, to cause the processor to wait for the reply from
the memory.

Fig 13a
The control hardware shown can be viewed as a state machine that changes
from one state to another in every clock cycle, depending on the contents of the
instruction register, the condition codes, and the external inputs. The outputs of the
state machine are the control signals. The sequence of operations carried out by this
machine is determined by the wiring of the logic elements, hence the name
"hardwired." A controller that uses this approach can operate at high speed.
However, it has little flexibility, and the complexity of the instruction set it can
implement is limited.

Fig 13b

78
MICROPROGRAMMED CONTROL

ALU is the heart of any computing system, while Control unit is its brain. The
design of a control unit is not unique; it varies from designer to designer.
Some of the commonly used control logic design methods are;
• Sequence Reg & Decoder method

• Hard-wired control method

• PLA control method

• Micro-program control method

Fig 15

79
The control signals required inside the processor can be generated using
a control step counter and a decoder/ encoder circuit. Now we discuss an alternative
scheme, called micro programmed control, in which control signals are generated
by a program similar to machine language programs.

First, we introduce some common terms. A control word (CW) is a word whose
individual bits represent the various control signals in Figure 12. Each of the control
steps in the control sequence of an instruction defines a unique combination of Is and
Os in the CW. The CWs corresponding to the 7 steps of Figure 6 are shown in Figure 15.
We have assumed that Select Y is represented by Select = 0 and Select4 by Select = 1.
A sequence of CWs corresponding to the control sequence of a machine
instruction constitutes the micro routine for that instruction, and the individual
control words in this micro routine are referred to as microinstructions.

The micro routines for all instructions in the instruction set of a computer
are stored in a special memory called the control store. The control unit can
generate the control signals for any instruction by sequentially reading the CWs of the
corresponding micro routine from the control store. This suggests organizing the
control unit as shown in Figure 16. To read the control words sequentially from
the control store, a micro program counter (µPC) is used. Every time a new
instruction is loaded into the IR, the output of the block labeled "starting address
generator" is loaded into the µPC. The µPC is then automatically incremented by
the clock, causing successive microinstructions to be read from the control store.
Hence, the control signals are delivered to various parts of the processor in the
correct sequence.
One important function of the control unit cannot be implemented by the simple
organization in Figure 16. This is the situation that arises when the control unit
is required to check the status of the condition codes or external inputs to choose
between alternative courses of action. In the case of hardwired control, this situation
is handled by including an appropriate logic function, in the encoder circuitry. In
micro programmed control, an alternative approach is to use conditional
branch microinstructions. In addition to the branch address, these
microinstructions specify which of the external inputs, condition codes, or, possibly,
bits of the instruction register should be checked as a condition for branching to take
place.
The instruction Branch <0 may now be implemented by a micro routine such as
that shown in Figure 17. After loading this instruction into IR, a
branch

80
Fig 16

Fig 17

microinstruction transfers control to the corresponding micro routine, which is


assumed to start at location 25 in the control store. This address is the output of
staring address generator block codes. If this bit is equal to 0, a branch takes place to
location 0 to fetch a new machine instruction. Otherwise, the microinstruction at
location 0 to fetch a new machine instruction. Otherwise the microinstruction at
location 27 loads this address into the PC

81
Fig 18

To support micro program branching, the organization of the control unit


should be modified as shown in Figure 18. The starting address generator block of
Figure 16 becomes the starting and branch address generator. This block loads a
new address into the µPC when a microinstruction instructs it to do so. To allow
implementation of a conditional branch, inputs to this block consist of the external
inputs and condition codes as well as the contents of the instruction register. In
this control unit, the µPC is incremented every time a new microinstruction is
fetched from the micro program memory, except in the following situations:
1. When a new instruction is loaded into the IR, the µPC is loaded with the
starting address of the micro routine for that instruction.
2. When a Branch microinstruction is encountered and the branch condition is
satis- fied, the µPC is loaded with the branch address.
3. When an End microinstruction is encountered, the µPC is loaded with the
address of the first CW in the micro routine for the instruction fetch cycle

Microinstructions
Having described a scheme for sequencing microinstructions, we now take a
closer look at the format of individual microinstructions. A straightforward way to
structure Microinstruction is to assign one bit position to each control signal, as in
Figure 15.
However, this scheme has one serious drawback — assigning individual bits to
each control signal results in long microinstructions because the number of required
signals is usually large. Moreover, only a few bits are set to 1 (to be used for active
gating) in any given microinstruction, which means the available bit space is poorly
used. Consider again the simple processor of Figure 2, and assume that it contains
only four general- purpose registers, R0, Rl, R2, and R3. Some of the connections

82
in this processor are permanently enabled, such as the output of the IR to the decoding
circuits and both inputs to the ALU. The remaining connections to various registers require a
total of 20 gating signals. Additional control signals not shown in the figure are also
needed, including the Read, Write, Select, WMFC, and End signals. Finally, we must specify
the function to be performed by the ALU. Let us assume that 16 functions are provided,
including Add, Subtract, AND, and XOR. These functions depend on the particular ALU used
and do not necessarily have a one-to-one correspondence with the machine instruction OP
codes. In total, 42 control signals are needed.
If we use the simple encoding scheme described earlier, 42 bits would be needed in
each microinstruction. Fortunately, the length of the microinstructions can be reduced
easily. Most signals are not needed simultaneously, and many signals are mutually
exclusive. For example, only one function of the ALU can be activated at a time. The
source for a data transfer must be unique because it is not possible to gate the contents of
two different registers onto the bus at the same time. Read and Write signals to the
memory cannot be active simultaneously. This suggests that signals can be grouped so that
all mutually exclusive signals are placed in the same group. Thus, at most one micro
operation per group is specified in any microinstruction. Then it is possible to use a binary
coding scheme to represent the signals within a group. For example, four bits suffice to
represent the 16 available functions in the ALU. Register output control signals can be
placed in a group consisting of PCout, MDRout, Zout, Offsetout, R0out Rlout, R2out, R3out,
and TEMPout. Any one of these can be selected by a unique 4-bit code.

Further natural groupings can be made for the remaining signals. Figure 19 shows an
example of a partial format for the microinstructions, in which each group occupies a
field large enough to contain the required codes. Most fields must include one inactive code
for the case in which no action is required. For example, the all-zero pattern in Fl indicates
that none of the registers that may be specified in this field should have its contents placed
on the bus. An inactive code is not needed in all fields.

For example, F4 contains 4 bits that specify one of the 16 operations performed in the
ALU. Since no spare code is included, the ALU is active during the execution of every
microinstruction. However, its activity is monitored by the rest of the machine through
register Z, which is loaded only when the Z in signal is activated.
Grouping control signals into fields requires a little more hardware because decoding
circuits must be used to decode the bit patterns of each field into individual control
signals. The cost of this additional hardware is more than offset by the reduced
number of bits in each microinstruction, which results in a smaller
control store. In Figure19, only 20 bits are needed to store the patterns for the 42
signals.

So far we have considered grouping and encoding only mutually exclusive


79
control signals. We can extend this idea by enumerating the patterns of required
signals in all possible microinstructions. Each meaningful combination of active
control signals can then be assigned a distinct code that represents the
microinstruction

Fig 19

Such full encoding is likely to further reduce the length of micro words but also to
increase the complexity of the required decoder circuits.

80
Part-A
2 MARKS QUESTIONS WITH ANSWERS
1. What is instruction cycle?
Ans: An instruction cycle (sometimes called fetch-decode-execute cycle) is the basic
operation cycle of a computer. It is the process by which a computer retrieves a program instruction
from its memory, determines what actions the instruction requires, and carries out those actions.
2. What are the lists of memory reference instructions?
Ans: Memory reference instructions
For the instructions to be carried out in a sequential manner we need the proper definition of the
micro operations to be executed under it. So we need a precise defined form of them. As we know
that instructions are read from the memory into the registers so the term memory reference
instructions came into the picture.
1. AND to AC
2. ADD to AC
3. LDA: Load to AC
4. STA: Store AC
5. BUN: Branch Unconditionally
6. BSA: Branch and Save Return Address
7. ISZ: Increment and Skip if Zero
3. What are the types of interrupts?
Ans: when a user Request for another Process then this will create disturbance for the Running
Process. This is also called as the Interrupt.
Types of Interrupts
Generally there are three types o Interrupts those are Occurred For Example
1) Internal Interrupt
2) Software Interrupt.
3) External Interrupt.

4. What are program control instructions?


Ans: The program control instructions direct the flow of a program and allow the flow of the
program to change.
1. THE JUMP GROUP
2. Unconditional Jump
3. Conditional Jumps
4. LOOP
5. What is I/O interrupt?
Ans: Interrupt I/O : A way of controlling input/output activity in which a peripheral or terminal that
needs to make or receive a data transfer sends a signal that causes a program interrupt to be set. At a
time appropriate to the priority level of the I/O interrupt, relative to the total interrupt system, the
processor enters an interrupt service routine (ISR). The function of the routine will depend upon the
system of interrupt levels and priorities that is implemented in the processor.
6. What are the types of registers used in instruction cycle?
Ans: Register are used to quickly accept, store, and transfer data and instructions that are being used
immediately by the CPU, there are various types of Registers those are used for various purpose.
Among of the some Mostly used Registers named as AC or Accumulator, Data Register or DR, the
AR or Address Register, program counter (PC), Memory Data Register (MDR) ,Index register,
Memory Buffer Register.
7. What is processor register?
Ans: In computer architecture, a processor register is a small amount of storage available as part of a
81
digital processor, such as a central processing unit (CPU). Such registers are typically addressed by
mechanisms other than main memory and can be accessed faster.
8. What is program counter?
Ans: A program counter is a register in a computer processor that contains the address (location) of
the instruction being executed at the current time. As each instruction gets fetched, the program
counter increases its stored value by 1.
9. What are the lists of data transfer instructions?
Ans: Data Transfer Instruction
Data transfer instruction move data from one place in the computer to another without changing the
data content. The most common transfers are between memory and processes registers, between
processes register & input or output, and between processes register themselves
10. What is Interrupt cycle?
Ans: An instruction cycle (sometimes called fetch-and-execute cycle, fetch-decode-execute cycle, or
FDX) is the basic operation cycle of a computer. It is the process by which a computer retrieves a
program instruction from its memory, determines what actions the instruction requires, and carries
out those actions. This cycle is repeated continuously by the central processing unit (CPU), from
boot upto when the computer is shut down.
11. What is software interrupt?
Ans: A software interrupt is a type of interrupt that is caused either by a special instruction in the
instruction set or by an exceptional condition in the processor itself. A software interrupt is invoked
by software, unlike a hardware interrupt, and is considered one of the ways to communicate with the
kernel or to invoke system calls, especially during error or exception handling.
12. What are the types of Hardware Interrupts?
Ans: Hardware Interrupts: If the signal for the processor is from external device or hardware is
called hardware interrupts. Example: from keyboard we will press the key to do some action this
pressing of key in keyboard will generate a signal which is given to the processor to do action, such
interrupts are called hardware interrupts. Hardware interrupts can be classified into two types they
are
• Maskable Interrupt: The hardware interrupts which can be delayed when a much highest
priority interrupt has occurred to the processor.
• Non Maskable Interrupt: The hardware which cannot be delayed and should process by the
processor immediately.

IMPORTANT QUESTIONS
2 MARKS
1. Define Interrupt? List the types of interrupts? [Jun 2015]
2. Write about register transfer notations? [Jun 2015]
3. Define the terms guard bits and rounding with respect to floating point operations? [Dec 2015]
4. Design an adder to add two 4 bit numbers? [Dec 2015]
5. Explain software and hardware interrupts? [III-Dec 2015]
6. Define effective address? [III-Dec 2015]
7. Write any two main features of Booth’s algorithm? [III-Dec 2015]
Part- B

10 MARKS
1. Draw the flow chart for Booth’s algorithm for multiplication of signed 2’s complement numbers
and explain with an example?
[Jun 2015]

82
2. What is an addressing mode? A two word instruction is stored on memory at an address designated
by symbol W. The address field of instruction (stored at W+1) is designated by the symbol Y. The
operand used during the execution of the instruction is stored at an address symbolized by Z. An
index register contains the value X. State how Z is calculated from the other addresses if the
addressing mode of the instruction is (i) Direct (ii) Indirect (iii) Relative (iv) Indexed
[Jun 2015]
3. Multiply 7 and 3 using Booth’s Algorithm? [Dec 2015]
4. Perform division of 1000 and 0011 using restoring division algorithm? [Dec 2015]
5. Explain the way that the interrupts are handled by the computer by means of flow chart? [III-Dec
2015]
6. (A). Explain LDA and STA memory reference instructions with suitable examples?
(B). Describe briefly about the register organization for floating point operations? [III-Dec 2015]

83
UNIT 3
MEMORY SYSTEM

BASIC CONCEPTS

The maximum size of the memory that can be used in any computer is determined by
the addressing scheme.
For example, a 16-bit computer that generates 16-bit addresses is capable of addressing
upto 216 =64K memory locations. If a machine generates 32-bit addresses, it can
access upto 232 = 4G memory locations. This number represents the size of address
space of the computer.

Address Memory Locations


16 Bit 216 = 64 K
32 Bit 232 = 4G (Giga)
40 Bit 264 = IT (Tera)

If the smallest addressable unit of information is a memory word, the machine is


called word-addressable. If individual memory bytes are assigned distinct addresses, the
computer is called byte-addressable. Most of the commercial machines are byte-
addressable. For example in a byte-addressable 32-bit computer, each memory word
contains 4 bytes. A possible word-address assignment would be:

Word Address Byte Address


0 0 1 2 3
4 4 5 6 7
8 8 9 10 11
. …..
. …..
. …..

With the above structure a READ or WRITE may involve an entire memory word
or it may involve only a byte. In the case of byte read, other bytes can also be read but
ignored by the CPU. However, during a write cycle, the control circuitry of the MM must
ensure that only the specified byte is altered. In this case, the higher-order 30 bits can
specify the word and the lower-order 2 bits can specify the byte within the word.

82
CPU-Main Memory Connection – A block schematic: -

Processor k-bit address bus Memory


MAR

k
n-bit data bus Up to addressable
MDR locations

Control lines word length= n bits


CU

From the system standpoint, the Main Memory (MM) unit can be viewed as a “block
box”. Data transfer between CPU and MM takes place through the use of two CPU
registers, usually called MAR (Memory Address Register) and MDR (Memory Data
Register). If MAR is K bits long and MDR is ‘n’ bits long, then the MM unit may contain
upto 2k addressable locations and each location will be ‘n’ bits wide, while the word
length is equal to ‘n’ bits. During a “memory cycle”, n bits of data may be transferred
between the MM and CPU. This transfer takes place over the processor bus, which has k
address lines (address bus), n data lines (data bus) and control lines like Read, Write,
Memory Function completed (MFC), Bytes specifiers etc (control bus). For a read
operation, the CPU loads the address into MAR, set READ to 1 and sets other control
signals if required. The data from the MM is loaded into MDR and MFC is set to 1. For a
write operation, MAR, MDR are suitably loaded by the CPU, write is set to 1 and other
control signals are set suitably. The MM control circuitry loads the data into appropriate
locations and sets MFC to 1. This organization is shown in the following block schematic

Some Basic

Concepts Memory

Access Times: -

It is a useful measure of the speed of the memory unit. It is the time that elapses
between the initiation of an operation and the completion of that operation (for
example, the time between READ and MFC).
83
Memory Cycle Time :-
It is an important measure of the memory system. It is the minimum time delay
required between the initiations of two successive memory operations (for example, the
time between two successive READ operations). The cycle time is usually slightly longer
than the access time.

RANDOM ACCESS MEMORY (RAM):


A memory unit is called a Random Access Memory if any location can be accessed
for a READ or WRITE operation in some fixed amount of time that is independent of the
location’s address. Main memory units are of this type. This distinguishes them from
serial or partly serial access storage devices such as magnetic tapes and disks which are
used as the secondary storage device.

Cache Memory:-
The CPU of a computer can usually process instructions and data faster than they
can be fetched from compatibly priced main memory unit. Thus the memory cycle time
become the bottleneck in the system. One way to reduce the memory access time is to
use cache memory. This is a small and fast memory that is inserted between the larger,
slower main memory and the CPU. This holds the currently active segments of a program
and its data. Because of the locality of address references, the CPU can, most of the
time, find the relevant information in the cache memory itself (cache hit) and
infrequently needs access to the main memory (cache miss) with suitable size of the
cache memory, cache hit rates of over 90% are possible leading to a cost-effective
increase in the performance of the system.

Memory Interleaving: -
This technique divides the memory system into a number of memory modules
and arranges addressing so that successive words in the address space are placed in
different modules. When requests for memory access involve consecutive addresses,
the access will be to different modules. Since parallel access to these modules is
possible, the average rate of fetching words from the Main Memory can be increased.
Virtual Memory: -
In a virtual memory System, the address generated by the CPU is referred to as
a virtual or logical address. The corresponding physical address can be different and the
required mapping is implemented by a special memory control unit, often called
the memory management unit. The mapping function itself may be changed during
program execution according to system requirements.

84
Because of the distinction made between the logical (virtual) address space and
the physical address space; while the former can be as large as the addressing
capability of the CPU, the actual physical memory can be much smaller. Only the active
portion of the virtual address space is mapped onto the physical memory and the rest
of the virtual address space is mapped onto the bulk storage device used. If the
addressed information is in the Main Memory (MM), it is accessed and execution
proceeds. Otherwise, an exception is generated, in response to which the memory
management unit transfers a contigious block of words containing the desired word
from the bulk storage unit to the MM, displacing some block that is currently inactive. If
the memory is managed in such a way that, such transfers are required relatively
infrequency (ie the CPU will generally find the required information in the MM), the
virtual memory system can provide a reasonably good performance and succeed in
creating an illusion of a large memory with a small, in expensive MM.

INTERNAL ORGANIZATION OF MEMORY CHIPS:

Memory cells are usually organized in the form of array, in which each cell is
capable of storing one bit of information. Each row of cells constitutes a memory word
and all cells of a row are connected to a common line called as word line. The cells in
each column are connected to Sense / Write circuit by two bit lines. Figure 3..1 shows
the possible arrangements of memory cells.

The Sense / Write circuits are connected to data input or output lines of the chip.
During a write operation, the sense / write circuit receives input information and stores
it in the cells of the selected word. The data input and data output of each sense / write
circuits are connected to a single bidirectional data line that can be connected to a data
bus of the cpu.

R / W → specifies the required operation.

CS → Chip Select input selects a given chip in the multi-chip memory system

85
86
SEMI CONDUCTOR RAM MEMORIES:
Semi-Conductor memories are available is a wide range of speeds. Their cycle
time ranges from 100ns to 10ns. When first introduced in the late 1960s, they were
much more expensive. But now they are very cheap, and used almost exclusively in
implementing main memories.
STATIC MEMORIES:
Memories that consist of circuits capable of retaining their state as long as power is
applied are known as static memory.
Static random-access memory (SRAM) is a type of semiconductor memory that uses
bistable latching circuitry to store each bit. The term static differentiates it from dynamic
RAM (DRAM) which must be periodically refreshed. SRAM exhibits data remanence, but
is still volatile in the conventional sense that data is eventually lost when the memory is
not powered. Figure 3.2 shows the implementation of static RAM.

fig: 3.2Static RAM cell

87
Two inverters are cross connected to form a latch. The latch is connected to two bit lines
by transistors T1 and T2. These transistors act as switches that can be opened / closed
under the control of the word line. When the word line is at ground level, the transistors
are turned off and the latch retains its state.
Read Operation:
In order to read the state of the SRAM cell, the word line is activated to close switches T 1
and T2. If the cell is in state 1, the signal on bit line b is high and the signal on the bit line
b is low. Thus b and bꞌ are complement of each other. Sense / write circuit at the end of
the bit line monitors the state of b and bꞌ and set the output according.
Write Operation:
The state of the cell is set by placing the appropriate value on bit line b and its
complement on bꞌ and then activating the word line. This forces the cell into the
corresponding state. The required signal on the bit lines are generated by Sense / Write
circuit.
CMOS RAM CELL
Transistor pairs (T3, T5) and (T4, T6) form the inverters in the latch. In state 1, the voltage
at point X is high by having T5, T6 on and T4, T5 are OFF. Thus T1 and T2 returned ON
(Closed), bit line b and bꞌ will have high and low signals respectively. The CMOS requires
5V (in older version) or 3.3.V (in new version) of power supply voltage.

Figure3.3 : CMOS cell (Complementary Metal oxide Semi-Conductor):

88
Merit:
• It has low power consumption because the current flows in the cell only when the
cell is being accessed.
• Static RAMs can be accessed quickly. It access time is few nanoseconds.
Demerit:
• SRAMs are said to be volatile memories because their contents are lost when the
power is interrupted.
Asynchronous DRAMS:
Less expensive RAM‟s can be implemented if simpler cells are used. Such cells cannot
retain their state indefinitely. Hence they are called Dynamic RAM‘s (DRAM). The
information stored in a dynamic memory cell in the form of a charge on a capacitor and
this charge can be maintained only for a few milliseconds. The contents must be
periodically refreshed by restoring this capacitor charge to its full value.

Figure 3.4: A single transistor dynamic Memory cell

In order to store information in the cell, the transistor T is turned on and the appropriate
voltage is applied to the bit line, which charges the capacitor. After the transistor is
turned off, the capacitor begins to discharge which is caused by the capacitor‘s own
leakage resistance. Hence the information stored in the cell can be retrieved correctly
before the threshold value of the capacitor drops down, as shown in Figure 3.4.
If charge on capacitor > threshold value →Bit line will have logic value 1.

89
If charge on capacitor < threshold value → Bit line will set to logic value 0.
During a read operation, the transistor is turned on and a sense amplifier connected to
the bit line detects whether the charge on the capacitor is above the threshold value.
A 16-megabit DRAM chip configured as 2M x 8, is show n in Figure 3.5.

Figure 3.5: Internal organization of a 2M X 8 dynamic Memory chip.

DESCRIPTION:
The 4 bit cells in each row are divided into 512 groups of 8. 21 bit address is
needed to access a byte in the memory (12 bit to select a row, and 9 bits specify the
group of 8 bits in the selected row).
A (0-8) → Row address of a byte.
A (9-20) → Column address of a byte.
During Read/ Write operation, the row address is applied first. It is loaded into the row
address latch in response to a signal pulse on Row Address Strobe (RAS) input of the
chip. When a Read operation is initiated, all cells on the selected row are read and
refreshed. Shortly after the row address is loaded, the column address is applied to the
address pins and loaded into Column Address Strobe (CAS). The information in this latch
is decoded and the appropriate group of 8 Sense/Write circuits is selected. R/W =1(read

90
The output values of the selected circuits are transferred to the data lines D0 - D7.
R/W=0 (write operation). The information on D0 - D7 is transferred to the selected
circuits.
RAS and CAS are active low so that they cause the latching of address when they change
from high to low. This is because they are indicated by RAS and CAS. To ensure that the
contents of a DRAM‘s are maintained, each row of cells must be accessed periodically.
Refresh operation usually perform this function automatically. A specialized memory
controller circuit provides the necessary control signals RAS and CAS that govern the
timing. The processor must take into account the delay in the response of the memory.
Such memories are referred to as Asynchronous DRAM‘s.
Fast Page Mode:
Transferring the bytes in sequential order is achieved by applying the consecutive
sequence of column address under the control of successive CAS signals. This scheme
allows transferring a block of data at a faster rate. The block of transfer capability is
called as Fast Page Mode.
Synchronous DRAM:
Here the operations are directly synchronized with clock signal. The address and data
connections are buffered by means of registers. The output of each sense amplifier is
connected to a latch. A Read operation causes the contents of all cells in the selected
row to be loaded in these latches. The Figure 3.6 shows the structure of SDRAM.

Figure 3.6: Synchronous DRAM

91
Data held in the latches that correspond to the selected columns are transferred into
the data output register, thus becoming available on the data output pins.

Figure 3.7: Timing Diagram Burst Read of Length 4 in an SDRAM

First, the row address is latched under control of RAS signal. The memory
typically takes 2 or 3 clock cycles to activate the selected row. Then the column address is
latched under the control of CAS signal. After a delay of one clock cycle, the first set of data bits
is placed on the data lines. The SDRAM automatically increments the column address to access
the next 3 sets of bits in the selected row, which are placed on the data lines in the next 3 clock
cycles. A timing diagram for a typical burst of read of length 4 is shown in Figure 3.7.

Latency and Bandwidth:


A good indication of performance is given by two parameters. They are,

• Latency
• Bandwidth
Latency refers to the amount of time it takes to transfer a word of data to or from the
memory. For a transfer of single word, the latency provides the complete indication of
memory performance. For a block transfer, the latency denotes the time it takes to
transfer the first word of data.
Bandwidth is defined as the number of bits or bytes that can be transferred in one
second. Bandwidth mainly depends upon the speed of access to the stored data and on
the number of bits that can be accessed in parallel.

92
Double Data Rate SDRAM (DDR-SDRAM):
The standard SDRAM performs all actions on the rising edge of the clock signal.
The double data rate SDRAM transfer data on both the edges (loading edge, trailing
edge). The Bandwidth of DDR-SDRAM is doubled for long burst transfer. To make it
possible to access the data at high rate, the cell array is organized into two banks. Each
bank can be accessed separately. Consecutive words of a given block are stored in
different banks. Such interleaving of words allows simultaneous access to two words
that are transferred on successive edge of the clock.
Larger Memories: Dynamic Memory System:
The physical implementation is done in the form of Memory Modules. If a large
memory is built by placing DRAM chips directly on the main system printed circuit board
that contains the processor, often referred to as Motherboard; it will occupy large
amount of space on the board. These packaging consideration have led to the
development of larger memory units known as SIMM‘s and DIMM‘s

SIMM-Single Inline memory Module


DIMM-Dual Inline memory Module

93
MEMORY SYSTEM CONSIDERATION:
To reduce the number of pins, the dynamic memory chips use multiplexed
address inputs. The address is divided into two parts. They are,

High Order Address Bit (Select a row in cell array and it is provided first and latched
into memory chips under the control of RAS signal).
Low Order Address Bit (Selects a column and they are provided on same address
pins and latched using CAS signals).
The Multiplexing of address bit is usually done by Memory Controller Circuit, as
shown in Figure 3.8.

Figure 3.8: Use of Memory Controller

The Controller accepts a complete address and R/W signal from the processor, under
the control of a request signal which indicates that a memory access operation is
needed. The Controller then forwards the row and column portions of the address to the
memory and generates RAS and CAS signals. It also sends R/W and CS signals to the
memory. The CS signal is usually active low, hence it is shown as CS.

Refresh Overhead:
All dynamic memories have to be refreshed. In DRAM, the period for refreshing all
rows is 16ms whereas 64ms in SDRAM.

Rambus Memory:
The usage of wide bus is expensive. Rambus developed the implementation of

94
narrow bus. Rambus technology is a fast signaling method used to transfer information
between chips. Instead of using signals that have voltage levels of either 0 or Vsupply to
represent the logical values, the signals consist of much smaller voltage swings around a
reference voltage Vref. The reference Voltage is about 2V and the two logical values are
represented by 0.3V swings above and below Vref.

This type of signaling is generally is known as Differential Signaling. Rambus provides


a complete specification for the design of communication links (Special Interface circuits)
called as Rambus Channel. Rambus memory has a clock frequency of 400MHZ. The data
are transmitted on both the edges of the clock so that the effective data transfer rate is
800MHZ.
The circuitry needed to interface to the Rambus channel is included on the chip. Such
chips are known as Rambus DRAMs (RDRAM). Rambus channel has,
• 9 Data lines (1-8 Transfer the data, 9th line → Parity checking).
• Control line
• Power line

A two channel Rambus has 18 data lines which have no separate address lines. It is also
called as Direct RDRAM‘s. Communication between processor or some other device that
can serves as a master and RDRAM modules are served as slaves, is carried out by means of
packets transmitted on the data lines.
There are 3 types of packets. They are,

• Request
• Acknowledge
• Data

READ ONLY MEMORY (ROM):


Both SRAM and DRAM chips are volatile, which means that they lose the stored
information if power is turned off. Many applications require Non-volatile memory
(which retains the stored information if power is turned off).
E.g.: Operating System software has to be loaded from disk to memory which requires
the program that boots the Operating System. i.e., it requires non-volatile memory. Non-
volatile memory is used in embedded system. Since the normal operation involves only
reading of stored data, a memory of this type is called ROM.

95
Figure 3.9: ROM cell
At Logic value ‗0‘ → Transistor (T) is connected to the ground point (P). Transistor switch
is closed and voltage on bit line nearly drops to zero.
At Logic value ‗1‘ → Transistor switch is open. The bit line remains at high voltage. To
read the state of the cell, the word line is activated. A Sense circuit at the end of the bit
line generates the proper output value.
Different types of non-volatile memory are:

• PROM
• EPROM
• EEPROM
• Flash Memory

PROM: Programmable ROM:


PROM allows the data to be loaded by the user. Programmability is achieved by
inserting a fuse at point P in a ROM cell. Before it is programmed, the memory contains
all 0‘s. The user can insert 1‘s at the required location by burning out the fuse at these
locations using high current pulse. This process is irreversible.

M
e • It provides flexibility.
r • It is faster.
i • It is less expensive because they can be programmed directly by the user.
t
:

EPROM - Erasable reprogrammable ROM:


EPROM allows the stored data to be erased and new data to be loaded. In an
EPROM cell, a connection to ground is always made at ‗P‘ and a special transistor is

96
used, which has the ability to function either as a normal transistor or as a disabled

97
permanently open switch, by injecting charge into it that becomes trapped inside.
Erasure requires dissipating the charges trapped in the transistor of memory
cells. This can be done by exposing the chip to ultraviolet light, so that EPROM chips are
mounted in packages that have transparent windows.
Merits:
• It provides flexibility during the development phase of digital system.
• It is capable of retaining the stored information for a long time.
Demerits:
• The chip must be physically removed from the circuit for reprogramming and its
entire contents are erased by UV light.

EEPROM:- Electrically Erasable ROM:


EEPROM (also written E2PROM and pronounced "e-e-prom," "double-e prom,"
"e-squared," or simply "e-prom") stands for Electrically Erasable Programmable Read-
Only Memory and is a type of non-volatile memory used in computers and other
electronic devices to store small amounts of data that must be saved when power is
removed, e.g., calibration tables or device configuration.

When larger amounts of static data are to be stored (such as in USB flash drives) a
specific type of EEPROM such as flash memory is more economical than traditional
EEPROM devices. EEPROMs are realized as arrays of floating-gate transistors. EEPROM is
user modifiable read only memory (ROM) that can be erased and reprogrammed
(written to) repeatedly through the application of higher than normal electrical voltage
generated externally or internally in the case of modern EEPROMs.
EPROM usually must be removed from the device for erasing and programming, whereas
EEPROMs can be programmed and erased in circuit. Originally, EEPROMs were limited to
single byte operations which made them slower, but modern EEPROMs allow multi-byte
page operations. It also has a limited life - that is, the number of times it could be
reprogrammed was limited to tens or hundreds of thousands of times.
That limitation has been extended to a million write operations in modern EEPROMs.
In an EEPROM that is frequently reprogrammed while the computer is in use, the life of
the EEPROM can be an important design consideration. It is for this reason that
EEPROMs were used for configuration information, rather than random access memory.
Merits:
It can be both programmed and erased electrically.
It allows the erasing of all cell contents selectively.
Demerits:
It requires different voltage for erasing, writing and reading the stored data.

98
FLASH MEMORY:
In EEPROM, it is possible to read and write the contents of a single cell. In Flash
device, it is possible to read the contents of a single cell but it is only possible to write
the entire contents of a block. Prior to writing, the previous contents of the block are
erased.
E.g.: In MP3 player, the flash memory stores the data that represents sound.
Single flash chips cannot provide sufficient storage capacity for embedded system
application. There are 2 methods for implementing larger memory modules consisting of
number of chips. They are,
• Flash Cards
• Flash Drives.
Merits:
• Flash drives have greater density which leads to higher capacity and low cost per
bit.
• It requires single power supply voltage and consumes less power in their
operation.
Flash Cards:
One way of constructing larger module is to mount flash chips on a small card.
Such flash card have standard interface. The card is simply plugged into a conveniently
accessible slot. Its memory size is of 8, 32,64MB. E.g.: A minute of music can be stored in
1MB of memory. Hence 64MB flash cards can store an hour of music.
Flash Drives:
Larger flash memory module can be developed by replacing the hard disk drive. The
flash drives are designed to fully emulate the hard disk. The flash drives are solid state
electronic devices that have no movable parts.
Merits:
• They have shorter seek and access time which results in faster response.
They have low power consumption which makes them attractive for battery driven
application.

• They are insensitive to vibration.


Demerit:

• The capacity of flash drive (<1GB) is less than hard disk (>1GB).
• It leads to higher cost per bit.
• Flash memory will deteriorate after it has been written a number of times
(typically at least 1 million times.)

99
SPEED, SIZE COST:

Magnetic Disk:

A huge amount of cost effective storage can be provided by magnetic disk. The main
memory can be built with DRAM which leaves SRA‘s to be used in smaller units where
speed is of essence.

MEMORY HIERARCHY

To this point in our study of systems, we have relied on a simple model of a computer
system as a CPU that executes instructions and a memory system that holds instructions
and data for the CPU. In our simple \model, the memory system is a linear array of bytes,
and the CPU can access each memory location in a constant amount of time. While this is an
effective model as far as it goes, it does not reflect the way that modern systems really
work. In practice, a memory system is a hierarchy of storage devices with different
capacities, costs, and access times. CPU registers hold the most frequently used data. Small,
fast cache memories nearby the CPU act as staging areas for a subset of the data and
instructions stored in the relatively slow main memory.

100
The main memory stages data stored on large, slow disks, which in turn often serve as
staging areas for data stored on the disks or tapes of other machines connected by
networks.

Memory hierarchies work because well written programs tend to access the storage at
any particular level more frequently than they access the storage at the next lower level.
So the storage at the next level can be slower, and thus larger and cheaper per bit. The
overall effect is a large pool of memory that costs as much as the cheap storage near the
bottom of the hierarchy, but that serves data to programs at the rate of the fast storage
near the top of the hierarchy.
As a programmer, you need to understand the memory hierarchy because it has a big
impact on the performance of your applications. If the data your program needs are
stored in a CPU register, then they can be accessed in zero cycles during the execution of
the instruction. If stored in a cache, 1 to 30 cycles. If stored in main memory, 50 to 200
cycles.
And if stored in disk tens of millions of cycles!. The entire computer memory can be
realized as the hierarchy shown in the Figure 14.10.
Here, then, is a fundamental and enduring idea in computer systems: If you understand
how the system moves data up and down the memory hierarchy, then you can write
your application programs so that their data items are stored higher in the hierarchy,
where the CPU can access them more quickly. This idea centers on a fundamental
property of computer programs known as locality.

Programs with good locality tend to access the same set of data items over and over
again, or they tend to access sets of nearby data items. Programs with good locality tend
to access more data items from the upper levels of the memory hierarchy than programs
with poor locality, and thus run faster.

For example, the running times of different matrix multiplication kernels that perform
thesame number of arithmetic operations, but have different degrees of locality, can
vary by a factor of 20!

101
Figure : Memory Hierarchy
Types of Cache Memory:
The Cache memory is of 2 types. They are:
• Primary /Processor Cache (Level1 or L1 cache)
• Secondary Cache (Level2 or L2 cache)
Primary Cache → It is always located on the processor chip.
Secondary Cache → It is placed between the primary cache and the rest of the
memory.
The main memory is implemented using the dynamic components (SIMM, RIMM,
and DIMM). The access time for main memory is about 10 times longer than the access
time for L1 cache.

CACHE MEMORY
Cache memory is an integral part of every system now. Cache memory is random
access memory (RAM) that a computer microprocessor can access more quickly than it
can access regular RAM. As the microprocessor processes data, it looks first in the cache
memory and if it finds the data there (from a previous reading of data), it does not have
to do the more time consuming reading of data from larger memory.

102
The effectiveness of cache mechanism is based on the property of Locality of
reference.
Locality of Reference:

During some time period and remainder of the program is accessed relatively
infrequently. It manifests itself in 2 ways. They are Temporal (The recently executed
instruction are likely to be executed again very soon), Spatial (The instructions in close
proximity to recently executed instruction are likely to be executed soon). If the active
segment of the program is placed in cache memory, then the total execution time can be
reduced significantly.

If the active segment of a program can be placed in a fast cache memory, then the
total execution time can be reduced significantly. The operation of a cache memory is
very simple. The memory control circuitry is designed to take advantage of the property
of locality of reference. The term Block refers to the set of contiguous address locations
of some size. The cache line is used to refer to the cache block.

PROCESSOR MAIN
MEMORY
CACHE

Figure : Use of Cache Memory

103
The Figure shows arrangement of Cache between processor and main memory.
The Cache memory stores a reasonable number of blocks at a given time but this
number is small compared to the total number of blocks available in Main Memory. The
correspondence between main memory block and the block in cache memory is
specified by a mapping function. The Cache control hardware decides that which block
should be removed to create space for the new block that contains the referenced word.
The collection of rule for making this decision is called the replacement algorithm. The
cache control circuit determines whether the requested word currently exists in the
cache. If it exists, then Read/Write operation will take place on appropriate cache
location. In this case Read/Write hit will occur. In a Read operation, the memory will not
be involved.

The write operation proceeds in 2 ways. They are:


• Write-through protocol
• Write-back protocol

Write-through protocol:
Here the cache location and the main memory locations are updated
simultaneously.

Write-back protocol:
This technique is to update only the cache location and to mark it as with
associated flag bit called dirty/modified bit. The word in the main memory will be
updated later, when the block containing this marked word is to be removed from the
cache to make room for a new block. If the requested word currently does not exist in
the cache during read operation, then read miss will occur. To overcome the read miss
Load –through / early restart protocol is used.
Read Miss:
The block of words that contains the requested word is copied from the main
memory into cache.

Load –through:
After the entire block is loaded into cache, the particular word requested is
forwarded to the processor. If the requested word does exist in the cache during write
operation, then Write Miss will occur. If Write through protocol is used, the information
is written directly into main memory. If Write back protocol is used then blocks
containing the addressed word is first brought into the cache and then the desired word
in the cache is overwritten with the new information.

104
CACHE MEMORY DESIGN PARAMETERS

There are two cache design parameters that dramatically influence the cache
performance: the block size and the cache associability. There are also many other
implementation techniques both hardware and software that improve the cache
performance but they are not discussed here.

The simplest way to reduce the miss rate is to increase the block size. However
increasing the block size also increases the miss penalty (which is the time to load a block
from main memory into cache) so there is a trade–off between the block size and miss
penalty. We can increase the block size up to a level at which the miss rate is decreasing
but we also have to be sure that this benefit will not be exceeded by the increased miss
penalty.

The second cache design parameter that reduces cache misses is the associability. There
is an empirical result called the 2:1 rule of thumb which states that a direct mapped
cache of size N has about the same miss rate as a 2 way set associative cache of size N/2.
Unfortunately an increased associability will have a bigger hit time. More time will be
taken to retrieve a block inside of an associative cache than in a direct mapped cache. To
retrieve a block in an associative cache, the block must be searched inside of an entire
set since there is more than one place where the block can be stored.

Based on the cause that determines a cache miss we can classify the cache misses as
compulsory, capacity and conflict misses. This classification is called the 3C model.
Compulsory misses are issued when a first access is done to a block that is not in the
memory, so the block must be brought into cache. Increasing block size can reduce
compulsory misses due to perfecting the other elements in the block. If the cache cannot
contain all the blocks needed during the execution of a program, capacity misses will
occur due to blocks being discarded and later retrieved. If the block-placement strategy
is set associative or direct mapped, conflict misses (in addition to compulsory and
capacity misses) will occur because a block can be discarded and later retrieved if too
many blocks map to its set. Increasing the associability in general reduce the number of
conflict misses and implicitly the runtime of the programs. However this is not true all
the time. Minimizing the cache misses does not necessarily minimize the runtime. For
example, there can be fewer cache misses with more memory accesses.

105
Mapping Function:
Direct Mapping:
It is the simplest technique in which block j of the main memory maps onto block
‗J‘ modulo 128 of the cache. Thus whenever one of the main memory blocks 0, 128, 256
is loaded in the cache, it is stored in block 0. Block 1, 129, 257 are stored in cache block
1 and so on. The contention may arise when the cache is full, when more than one
memory block is mapped onto a given cache block position. The contention is resolved
by allowing the new blocks to overwrite the currently resident block. Placement of block
in the cache is determined from memory address.
The memory address is divided into 3 fields, namely,
• Low Order 4 bit field (word) → Select one of 16 words in a block.
• 7 bit cache block field → When new block enters cache, 7 bit determines the
cache position in which this block must be stored.
• 5 bit Tag field → The high order 5 bits of the memory address of the block is
stored in 5 tag bits associated with its location in the cache.
As execution proceeds, the high order 5 bits of the address is compared with tag bits
associated with that cache location. If they match, then the desired word is in that block
of the cache. If there is no match, then the block containing the required word must be
first read from the main memory and loaded into the cache. The direct mapping is shown
in Figure 15.2.

Figure 15.2: Direct Mapped Cache

106
Merit:
• It is easy to implement.
Demerit:
• It is not very flexible.
Associative Mapping:
Here, the main memory block can be placed into any cache block position.12 tag
bits will identify a memory block when it is resolved in the cache. The tag bits of an
address received from the processor are compared to the tag bits of each block of the
cache to see if the desired block is present. This is called associative mapping. It gives
complete freedom in choosing the cache location. A new block that has to be brought
into the cache has to replace (eject) an existing block if the cache is full.

Figure 15.3: Associative Mapped Cache.

107
In this method, the memory has to determine whether a given block is in the
cache. A search of this kind is called an associative Search. The associative-mapped cache
is shown in Figure 15.3.

Merit:
• It is more flexible than direct mapping technique.
Demerit:
• Its cost is high.
Set-Associative Mapping:
It is the combination of direct and associative mapping. The blocks of the cache are
grouped into sets and the mapping allows a block of the main memory to reside in any
block of the specified set. In this case, the cache has two blocks per set, so the memory
blocks 0, 64, 128 4032 map into cache set to 0 and they can occupy either of the two
block position within the set.
6 bit set field → Determines which set of cache contains the desired block.
6 bit tag field → the tag field of the address is compared to the tags of the two
blocks of the set to clock if the desired block is present.

Figure 15.4: Set-Associative Mapping:

108
No of blocks per set No of set field

2 6
3 5
8 4

128 no set field

The cache which contains 1 block per set is called direct Mapping. A cache that
has ‗k‘ blocks per set is called as k-way set associative cache. Each block contains a
control bit called a valid bit. The Valid bit indicates that whether the block contains valid
data. The dirty bit indicates that whether the block has been modified during its cache
residency. The set-associative mapping is shown in Figure 15.4.

Valid bit=0→When power is initially applied to system


Valid bit =1→When the block is loaded from main memory at first
time.

If the main memory block is updated by a source and if the block in the source is
already exists in the cache, then the valid bit will be cleared to ‗0‘. If Processor and DMA
use the same copies of data then it is called as the Cache Coherence Problem.

Merit:
• The Contention problem of direct mapping is solved by having few choices for block
placement.
• The hardware cost is decreased by reducing the size of associative search.

109
REPLACEMENT ALGORITHM:
In a direct-mapped cache, the position of each block is predetermined: hence, no
replacement strategy exists. In associative and set-associative caches there exists some
flexibility. When a new block is to be brought into the cache and all the positions that it
may occupy are full, the cache controller must decide which of the old blocks to
overwrite. This is an important issue, because the decision can be a strong determining
factor in system performance. In general, the objective is to keep blocks in the cache that
are likely to be referenced in the near future. However, it is not easy to determine which
blocks are about to be referenced.

The property of locality of reference in programs gives a clue to a reasonable strategy.


Because, programs usually stay in localized areas for reasonable periods of time, there is
a high probability that the blocks that have been referenced recently will be referenced
again soon. Therefore, when a block is to be overwritten, it is sensible to overwrite the
one that has gone the longest time without being referenced. This block is called the
least recently used (LW) block, and the technique is called the LRU replacement
algorithm. To use the LRU algorithm, the cache controller must track references to all
blocks as computation proceeds.

Suppose it is required to track the LRU block of a four block set in a set-associative
cache. A 2-bit counter can be used for each block. When a hit occurs, the counter of the
block that is referenced is set to 0. Counters with values originally lower than the
referenced one are incremented by one, and all others remain unchanged. When a miss
occurs and the set is not full, the counter associated with the new block loaded from the
main memory is set to 0, and the values of all other counters are increased by one. When
a miss occurs and the set is full, the block with the counter value 3 is removed, the new
block is put in its place, and its counter is set to 0.

The other three block counters are incremented by one (It can be easily verified that the
counter values of occupied blocks are always distinct). The LRU algorithm has been used
extensively. Although it performs well for many access patterns, it can lead to poor
performance in some cases. For example, it produces disappointing results when
accesses are made to sequential elements of an array that is slightly too large to fit into
the cache. Performance of the LRU algorithm can be improved by introducing a small
amount of randomness in deciding which block to replace.

110
Several other replacement algorithms are also used in practice. An intuitively reasonable
rule would be to remove the "oldest" block from a full set when a new block must be
brought in. However, because this algorithm does not take into account the recent
pattern of access to blocks in the cache, it is generally not as effective as the LRU
algorithm in choosing the best blocks to remove. The simplest algorithm is to randomly
choose the block to be overwritten. Interestingly enough, this simple algorithm has been
found to be quite effective in practice.

Example: Let us consider 4 blocks/set, in set associative cache, where 2 bit counter can
be used for each block. When a ‗hit‘ occurs, then block counter = 0, the counter with
values originally lower than the referenced one are incremented by 1 and all others
remain unchanged. When a ‗miss‘ occurs and if the set is full, the blocks with the
counter value 3 is removed, the new block is put in its place and its counter is set to ‗0‘
and other block counters are incremented by 1.

Merit:
The performance of LRU algorithm is improved by randomness in deciding which
block is to be overwritten.

PERFORMANCE CONSIDERATION:

Two Key factors in the commercial success are the performance and cost where the
best possible performance is at low cost. A common measure of success is called the
Price Performance ratio.
Performance depends on how fast the machine instructions are brought to the processor
and how fast they are executed. To achieve parallelism (i.e., both the slow and fast units
are accessed in the same manner) interleaving is used.

Interleaving:

If the main memory is structured as a collection of physically separated modules, each


with its own ABR (Address buffer register) and DBR( Data buffer register), memory
access operations may proceed in more than one module at the same time. Thereby the
aggregate rate of transmission of words to and from the main memory system can be
increased.
Two methods of address layout are indicated in Figure 3.5. In the first case, memory
address generated by the processor is decoded as shown in part (a) of the figure. The
high-order k bits name one of n modules and the low-order m bits name a particular
word in that module. When consecutive locations are accessed, only one module is

111
involved. At the same time, devices with DMA ability may be accessing information in
other modules.
In the second case, as shown in part (b) of the figure, which is called memory
interleaving. The low-order k bits of the memory address select a module, and the high-
order m bits name a location within the module. Thus, any component of the system that
generates requests for access to consecutive memory locations can keep several
modules busy at any one time which results in both faster access to a block of data and
higher average utilization of the memory system as a whole.

Figure : Addressing multiple-module memory system

HIT RATE AND MISS PENALTY

An excellent indicator of the effectiveness of a particular implementation of the


memory hierarchy is the success rate in accessing information at various levels of the
hierarchy. Recall that a successful access to data in a cache is called a hit. The number of
hits stated as a fraction of all attempted accesses is called the hit rate, and the miss rate
is the number of misses stated as a fraction of attempted accesses. Ideally, the entire
memory hierarchy would appear to the CPU as a single memory unit that has the access
time of a cache on the CPU chip and the size of a magnetic disk. How close we get to this
ideal depends largely on the hit rate at different levels of the hierarchy. High hit
rates,over 0.9, are essential for high performance computers. Performance is adversely
affected by the actions that must be taken after a miss. The extra time needed to bring
the desired information into the cache is called the Miss penalty. This penalty is

112
ultimately reflected in the time that the CPU is stalled because the required instructions
or data are not available for execution. In general, the miss penalty is the time needed to
bring a block of data from a slower unit in the memory hierarchy to a faster unit. The
miss penalty is reduced if efficient mechanisms for transferring data between the various
units of the hierarchy are implemented. The previous section shows how an interleaved
memory can reduce the miss penalty substantially. Consider now the impact of the cache
on the overall performance of the computer. Let h be the hit rate, M the miss penalty,
that is, the time to access information in the main memory, and C the time to access
information in the cache. The average access time experienced by the CPU is hC + (1 – h)
M.

CACHES ON PROCESSING CHIPS

When information is transferred between different chips, considerable delays are


introduced in driver and receiver gates on the chips. Thus, from the speed point of view,
the optimal place for a cache is on the CPU chip. Unfortunately, space on the CPU chip is
needed for many other functions: this limits the size of the cache that can be
accommodated. All high performance processor chips include some form of a cache.
Some manufacturers have chosen to implement two separate caches, one for
instructions and another for data, as in the 68040 and PowerPC 604 processors. Others
have implemented a single cache for both instructions and data, as in the PowerPC 601
processor. A combined cache for instructions and data is likely to have a somewhat
better hit rate, because it offers greater flexibility in mapping new information into the
cache. However, if separate caches arc used, it is possible to access both caches at the
same time, which leads to increased parallelism and, hence, better performance. The
disadvantage of separate caches is that the increased parallelism comes at the expense
of more complex circuitry. Since the Size of a cache on the CPU chip is limited by space
constraints, a good strategy for designing a high performance system is to use such a
cache as a primary cache. An external secondary cache, constructed with SRAM chips, is
When added to provide the desired capacity. If both primary and secondary
caches are used, the primary cache should be designed to allow very fast access by the
CPU, because its access time will have a large effect on the clock rate of the CPU. A cache
cannot be accessed at the same speed as a register file, because the cache is much bigger
and hence more complex. A practical way to speed up access in the cache is to access
more than one word simultaneously and then let the CPU use them one at a time. The
secondary cache can be considerably slower, but it should be much larger to ensure a high
hit rate. Its speed is less critical, because it only affects the miss penalty of the primary
cache. A workstation computer may include a primary cache with the capacity of tens of
kilobytes and a secondary cache of several megabytes.

113
OTHER ENHANCEMENTS
In addition to the main design issues just discussed, several other possibilities
exist for enhancing performance. We discuss three of them in this section.
Write Buffer
When the write-through protocol is used, each write operation results in writing
a new value into the main memory. If the CPU must wait for the memory function to be
completed, as we have assumed until now, then the CPU is slowed down by all write
requests. Yet the CPU typically does not immediately depend on the result of a write
operation, so it is not necessary for the CPU to wait for the write request to be
completed. To improve performance, a write buffer can be included for temporary
storage of write requests. The CPU places each write request into this buffer and
continues execution of the next instruction. The write requests stored in the write buffer
are sent to the main memory whenever the memory is not responding to read requests.
Note that it is important that the read requests be serviced immediately, because the
CPU usually cannot proceed without the data that is to be read from the memory.
Hence, these requests are given priority over write requests. The write buffer may hold a
number of write requests. Thus, it is possible that a subsequent read request may refer
to data that are still in the write buffer. To ensure correct operation, the addresses of
data to be read

from the memory are compared with the addresses of the data in the write buffer. In
case of a match, the data in the write buffer are used. This need for address comparison
entails considerable cost. But the cost is justified by improved performance. A different
situation occurs with the write-back protocol. In this case, the write operations are
simply performed on the corresponding word in the cache. But consider what happens
when a new block of data is to be brought into the cache as a result of a read miss, which
replaces an existing block that has some dirty data. The dirty block has to be written into
the main memory. If the required write-back is performed first, then the CPU will have to
wait longer for the new block to be read into the cache. It is more prudent to read the
new block first. This can be arranged by providing a fast write buffer for temporary
storage of the dirty block that is ejected from the cache while the new block is being
read. Afterward, the contents of the buffer are written into the main memory. Thus, the
write buffer also works well for the write-back protocol.

Prefetching
In the previous discussion of the cache mechanism, we assumed that new data
are brought into the cache when they are first needed. A read miss occurs, and the
desired data are loaded from the main memory. The CPU has to pause until the new data

114
arrive, which is the effect of the miss penalty. To avoid stalling the CPU, it is possible to
prefetch the data into the cache before they are needed. The simplest way to do this is
through software. A special prefetch instruction may be provided in the instruction set of
the processor. Executing this instruction causes the addressed data to be loaded into the
cache, as in the case of a read miss. However, the processor does not wait for the
referenced data. A prefetch instruction is inserted in a program to cause the data to be
loaded in the cache by the time they are needed in the program. The hope is that
prefetching will take place while the CPU is busy executing instructions that do not result
in a read miss, thus allowing accesses to the main memory to be overlapped with
computation in the CPU. Prefetch instructions can be inserted into a program either by
the programmer or by the compiler. It is obviously preferable to have the compiler insert
these instructions, which can be done with good success for many applications. Note
that software prefetching entails a certain overhead; because inclusion of prefetch
instructions increases the length of programs. Moreover, some prefetches may load into
the cache data that will not be used by the instruction. This can happen if the prefetched
data are ejected from the cache by a read miss involving other data. However, the
overall effect of software prefetching on performance is positive, and many processors
(including the PowerPC) have machine instructions to support this feature. Prefetching
can also be done through hardware. This involves adding circuitry that attempts to
discover a pattern in memory references, and then prefetches data according to this
pattern.
Lockup-Free
Cache the software prefetehing scheme just discussed does not work well if it
interferes significantly with the normal execution of instructions. This is the case if the
action of prefetehing stops other accesses to the cache until the prefetch is completed.
A cache of this type is said to be locked while it services a miss. We can solve this
problem by modifying the basic cache structure to allow the CPU to access the cache
while a miss is being serviced. In fact, it is desirable that more than one outstanding miss
can be supported. A cache that can support multiple outstanding misses is called lockup-
free. Since it can service only one miss at a time, it must include circuitry that keeps
track of all outstanding misses. This may be done with special registers that hold the
pertinent inFormation about these misses. Lockup-free caches were first used in the
early l980s in the Cyber series of computers manufactured by Control Data Company.

We have used software prefetching as an obvious motivation for a cache that is not
locked by a read miss. A much more important reason is that, in a processor that uses a
pipelined organization, which overlaps the execution of several instructions, a read miss
caused by one instruction could stall the execution of other instructions. A lockup-free
cache reduces the likelihood of such stalling.

115
VIRTUAL MEMORY:
Techniques that automatically move program and data blocks into the physical main memory
when they are required for execution is called the VirtualMemory.
The binary address that the processor issues either for instruction or data are called the
virtual / Logical address.
The virtual address is translated into physical address by a combination of hardware and
software components.This kind of address translation is done by MMU(Memory Management
Unit).When the desired data are in the main memory ,these data are fetched /accessed
immediately.If the data are not in the main memory,the MMU causes the Operating system to
bring the data into memory from the disk. Transfer of data between disk and main memory is
performed using DMA scheme.

Fig:Virtual Memory Organisation

Address Translation:
In address translation,all programs and data are composed of fixed length units called Pages.
The Page consists of a block of words that occupy contiguous locations in the main
memory. The pages are commonly range from 2K to 16K bytes in length.

116
The cache bridge speed up the gap between main memory and secondary storage and
it is implemented in software techniques.Each virtual address generated by the
processor contains virtual Page number(Low order bit) and offset(High order bit) Virtual
Page number+ Offset Specifies the location of a particular byte (or word) within
page.
Page Table:
It contains the information about the main memory address where the page is stored
& the current status of the page.
Page Frame:
An area in the main memory that holds one page is called the page frame.
Page Table Base Register:
It contains the starting address of the page table.

Virtual Page Number+Page Table Base register Gives the address of the corresponding
entry in the page table.ie)it gives the starting address of the page if that page currently
resides in memory
Control Bits in Page Table:
The Control bits specifies the status of the page while it is in main memory.
Function:
The control bit indicates the validity of the page ie)it checks whether the page is actually loaded
in the main memory.

It also indicates that whether the page has been modified during its residency in the
memory;this information is needed to determine whether the page should be written back to
the disk before it is removed from the main memory to make room for another page.

117
Fig:Virtual Memory Address Translation

The Page table information is used by MMU for every read & write access. The Page
table is placed in the main memory but a copy of the small portion of the page table is
located within MMU.

This small portion or small cache is called Translation LookAside Buffer(TLB). This
portion consists of the page table enteries that corresponds to the most recently
accessed pages and also contains the virtual address of the entry.

118
Fig:Use of Associative Mapped TLB

When the operating system changes the contents of page table ,the control bit in TLB
will invalidate the corresponding entry in the TLB.Given a virtual address,the MMU looks
in TLB for the referenced page.If the page table entry for this page is found in TLB,the
physical address is obtained immediately.If there is a miss in TLB,then the required entry
is obtained from the page table in the main memory & TLB is updated.When a program
generates an access request to a page that is not in the main memory ,then Page Fault
will occur.The whole page must be broght from disk into memry before an access can
proceed.
When it detects a page fault,the MMU asks the operating system to generate an interrupt.
The operating System suspend the execution of the task that caused the page fault and
begin execution of another task whose pages are in main memory because the long delay
occurs while page transfer takes place.When the task resumes,either the interrupted
instruction must continue from the point of interruption or the instruction must be
restarted.
If a new page is brought from the disk when the main memory is full,it must replace one of
the resident pages.In that case,it uses LRU algorithm which removes the least referenced
Page.A modified page has to be written back to the disk before it is removed from the main
memory. In that case,write –through protocol is used.

119
MEMORY MANAGEMENT REQUIREMENTS:
Management routines are part of the Operating system.Assembling the OS routine into virtual
address space is called „System Space‟. The virtual space in which the user application program
reside is called the „User Space’.Each user space has a separate page table.The MMU uses the
page table to determine the address of the table to be used in the translation process.
Hence by changing the contents of this register, the OS can switch from one space to another.
The process has two stages. They are,

User State

Supervisor state.
User State:
In this state,the processor executes the user program.
Supervisor State:
When the processor executes the operating system routines,the processor will be in
supervisor state.
Privileged Instruction:
In user state,the machine instructions cannot be executed.Hence a user program is
prevented from accessing the page table of other user spaces or system spaces.
The control bits in each entry can be set to control the access privileges granted to each
program.One program may be allowed to read/write a given page,while the other
programs may be given only red access.
SECONDARY STORAGE:
The Semi-conductor memories donot provide all the storage capability.
The Secondary storage devices provide larger storage requirements.
Some of the Secondary Storage devices are,

Magnetic Disk

Optical Disk

Magnetic Tapes.

Magnetic Disk:
Magnetic Disk system consists o one or more disk mounted on a common spindle.
A thin magnetic film is deposited on each disk, usually on both sides.
The disk are placed in a rotary drive so that the magnetized surfaces move in close proximity
to read /write heads.Each head consists of magnetic yoke & magnetizing coil.
Digital information can be stored on the magnetic film by applying the current pulse of
suitable polarity to the magnetizing coil.Only changes in the magnetic field under the
head can be sensed during the Read operation.Therefore if the binary states 0 & 1 are
represented by two opposite states of magnetization, a voltage is induced in the head
only at 0-1 and at 1-0 transition in the bit stream.A consecutive (long string) of 0‟s &
1‟s are determined by using the clock which is mainly used for synchronization. Phase
120
Encoding or Manchester Encoding is the technique to combine the clocking
information with data.The Read/Write heads must be maintained at a very small
distance from the moving disk surfaces in order to achieve high bit densities.
When the disk are moving at their steady state, the air pressure develops between the disk
surfaces & the head & it forces the head away from the surface.The flexible spring connection
between head and its arm mounting permits the head to fly at the desired distance away from
the surface.
Wanchester Technology:
Read/Write heads are placed in a sealed, air –filtered enclosure called the
Wanchester Technology.
In such units, the read/write heads can operate closure to magnetic track surfaces because the
dust particles which are a problem in unsealed assemblies are absent

Merits:
It have a larger capacity for a given physical size.The data intensity is high because the storage
medium is not exposed to contaminating elements.The read/write heads of a disk system are
movable.

The disk system has 3 parts.They are,



Disk Platter(Usually called Disk)

Disk Drive(spins the disk & moves Read/write heads)

Disk Controller(controls the operation of the system.)
Fig:Organizing & Accessing the data on disk

Each surface is divided into concentric tracks.Each track is divided into sectors.
The set of corresponding tracks on all surfaces of a stack of disk form a logical cylinder.

121
The data are accessed by specifying the surface number,track number and the
sector number.
The Read/Write operation start at sector boundaries.Data bits are stored serially
on eachtrack.Each sector usually contains 512 bytes.
Sector header -> contains identification information.It helps to find the desired
sector on the selected track.
ECC (Error checking code)- used to detect and correct errors.An unformatted disk has
no information on its tracks.
The formatting process divides the disk physically into tracks and sectors and this
process may discover some defective sectors on all tracks.The disk controller keeps
a record of such defects.
The disk is divided into logical partitions. They are,

Primary partition

Secondary partition
In the diag, Each track has same number of sectors.So all tracks have same storage
capacity. Thus the stored information is packed more densely on inner track than on
outer track.
Access time
There are 2 components involved in the time delay between receiving an address
and the beginning of the actual data transfer. They are,

Seek time

Rotational delay / Latency
Seek time – Time required to move the read/write head to the proper track.
Latency – The amount of time that elapses after the head is positioned over the
correct track until the starting position of the addressed sector passes under the
read/write head. Seek time + Latency = Disk access time
Typical disk
One inch disk- weight=1 ounce, size ->
comparable to matchbookCapacity -> 1GB
Inch disk has the following
parameter Recording surface=20
Tracks=15000 tracks/surface
Sectors=400.
Each sector stores 512 bytes of data
Capacity of formatted
disk=20x15000x400x512=60x109 =60GB Seek
time=3ms
Platter rotation=10000 rev/min
Latency=3ms
Internet transfer rate=34MB/s
Data Buffer / cache
A disk drive that incorporates the required SCSIcircuit is referred as SCSI drive.
The SCSI can transfer data at higher rate than the disk tracks.An efficient method to
deal with the possible difference in transfer rate between disk and SCSI bus is
123
accomplished by including a data buffer.This buffer is a semiconductor memory.
The data buffer can also provide cache mechanism for the disk (ie) when a
read request arrives at the disk, then controller first check if the data is
available in the cache(buffer). If the data is available in the cache, it can be
accessed and placed on SCSI bus . If it is not available then the data will be
retrieved from the disk.
Disk Controller
The disk controller acts as interface between disk drive and system bus.
The disk controller uses DMA scheme to transfer data between disk and main memory.
When the OS initiates the transfer by issuing Read/Write request, the
controllers register will load the following information. They are,

Main memory address(address of first main memory location of the block of


words involved in the transfer)Disk address(The location of the sector
containing the beginning of the desired block of words)(number of words in
the block to be transferred).Sector header -> contains identification
information.It helps to find the desired sector on the selected track.ECC
(Error checking code)- used to detect and correct errors.An unformatted disk
has no information on its tracks.
The formatting process divides the disk physically into tracks and
sectors and this process may discover some defective sectors on all tracks.
The disk controller keeps a record of such
defects. The disk is divided into logical
partitions. They are,
Primary
partition
Secondary
partition
In the diag, Each track has same number of
sectors. So all tracks have same storage
capacity.
Thus the stored information is packed more densely on inner track than
on outer track. Access time
There are 2 components involved in the time delay between
receiving an address and the beginning of the actual data transfer. They
are,
Seek time
Rotational delay / Latency
Seek time – Time required to move the read/write head to the proper
track. Latency – The amount of time that elapses after the head is
positioned over the correct track until the starting position of the
addressed sector passes under the read/write head.
Seek time + Latency = Disk access time

124
Part-A
2 MARKS QUESTIONS WITH ANSWERS
1. What is General purpose Register?
Ans: Fifteen general-purpose registers are visible at any one time, depending on the current
processor mode. These are R0-R12, SP, LR. The PC (R15) is not considered a general-
purpose register. SP (or R13) is the stack pointer.
2. What is Accumulator?
Ans: In a computer's central processing unit (CPU), an accumulator is a register in which
intermediate arithmetic and logic results are stored.
3. What is Micro operation?
Ans: Usually, micro-operations perform basicoperations on data stored in one or more
registers, including transferring data between registers or between registers and external
buses of the central processing unit (CPU), and performing arithmetic or logical
operations on registers
4. What is Register Transfer language?
Ans: In computer science, register transfer language (RTL) is a kind of intermediate
representation (IR) that is very close to assembly language, such as that which is used in a
compiler. Academic papers and textbooks also often use a form of RTL as an architecture-
neutral assembly language.
5. List some examples for Register Transfer Operations?
Ans: Example Description
R3 ← R1 + R2 Addition
R3 ← R1 - R2 (R1 + R2' + 1) Subtraction
R2 ← R2' Complement (really a logic operation)
R2 ← -R2 (R2' + 1) Negation
R1 ← R1 + 1 Increment
R1 ← R1 - 1 Decrement
6. What is Control Memory?
Ans: Control memory is a random access memory (RAM) consisting of addressable storage
registers. It is primarily used in mini and mainframe computers. It is used as a temporary
storage for data. Access to control memory data requires less time than to main memory; this
speeds up CPU operation by reducing the number of memory references for data storage and
retrieval. Access is performed as part of a control section sequence while the master clock
oscillator is running.
7. What are Logical Micro operations?
Ans: The logic micro-operation is used for specifying binary operation for the strings of bits
stored in the register. Here, each bit of register is treated as a separate binary variable. In
other words, each individual bit of a register operates with corresponding register bit.
Various logical operation like OR, AND, NAND, and other logical operators. Special
symbols are used for representing OR and AND operators with v and ^ respectively.
8. What is Control unit?
Ans: The control unit (CU) is a component of a computer's central processing unit (CPU)
that directs operation of the processor. It tells the computer's memory,
arithmetic/logic unit and input and output devices how to respond to a program's
instructions.
9. What is Control Word?
Ans: It is a word stored in a register (control register) used to control the operation of a
program digital device.
10. List some examples for Arithmetic Micro operations?
Ans: Arithmetic Micro Operations
The basic arithmetic micro operations are
1. Addition
2. Subtraction
125
3. Increment
4. Decrement
The additional arithmetic micro operations are
1. Add with carry
2. Subtract with borrow
3. Transfer/Load

11. What are the types of shift Micro operations?


Ans: Shift micro operation
Shift micro operations are the operations in which the contents of the register can be shifted
to left or right. Shift micro operations are used for serial transfer of data. They can also be
used in lieu with arithmetic, logic and other data-processing operations. There are three types
of shift operations:
1.Logical shift
2.Circular shift
3.Arithmetic shift
12. What is the need of Control Signals?
Ans: control signal. A pulse or frequency of electricity or light that represents
a control command as it travels over a network, a computer channel or wireless.

IMPORTANT QUESTIONS
2 MARKS
1. Give a note on control address register? [Jun 2015]
2. Write micro operations for ADD R1, R2? [Dec 2015]
3. What is the function of Control Memory? [Dec 2015]
4. How does high impedance state behaves in bus system?[Dec 2015]
5. Explain the need of bootstrap loader? [III-Dec 2015]
Part-B
10 MARKS
1. (a) Define register transfer language? Explain the instructions with example?[Dec 2015]
(b) With a neat diagram explain 4-bit binary incrementer?
2. Draw and explain the micro program sequencer for the Control memory?[Jun 2015]
3. Write control sequence for fetching a word from memory with neat diagram? [Dec 2015]
4. (a) Draw and explain how to construct a bus line using three state buffers? [Jun 2015,
Dec 2015]
(b) How to multiply and divide a number by 2 using shift operations with example?
5. (a) Explain address sequencing capabilities required for the Control memory?[Dec 2015]
(b) Explain how to construct a common bus for 4 registers of n-bits each using 3 state
buffers?
6. Describe micro instruction sequencing with neat block diagram? [Dec 2015]

126
UNIT-4
INPUT/OUTPUT ORGANIZATION
ACCESSING I/O DEVICES
A simple arrangement to connect I/O devices to a computer is to use a single
bus arrangement, as shown in Figure 4.1. The bus enables all the devices connected to
it to exchange information. Typically, it consists of three sets of lines used to carry
address, data, and control signals. Each I/O device is assigned a unique set of
addresses. When the processor places a particular address on the address lines, the
device that recognizes this address responds to the commands issued on the control
lines. The processor requests either a read or a write operation, and the requested
data are transferred over the data lines. When I/O devices and the memory share
the same address space, the arrangement is called memory-mapped I/O.

Figure 4.1: A single-bus structure

With memory-mapped I/O, any machine instruction that accesses memory can
be used to transfer data to or from an I/O device. For example, if DATAIN is the
address of the input buffer associated with the keyboard, the instruction
Move DATAIN, RO
reads the data from DATAIN and stores them into processor register RO. Similarly,
the instruction
Move RO, DATAOUT

Sends the contents of register RO to location DATAOUT, which may be the output
data buffer of a display unit or a printer.
Most computer systems use memory-mapped I/O. Some processors have
special in and out instructions to perform I/O transfers. For example, processors in the
Intel have special I/O instructions and a separate 16-bit address space for I/O
devices. When building a computer system based on these processors, the designer
has the option of connecting I/O devices to use the special I/O address space or
simply incorporating them as part of the memory address space.

127
The latter approach is by far the most common as it leads to simpler software.
One advantage of a separate I/O address space is that I/O devices deal with fewer
address lines. Note that a separate I/O address space does not necessarily mean that
the I/O address lines are physically separate from the memory address lines. A
special signal on the bus indicates that the requested read or write transfer is an
I/O operation. When this signal is asserted, the memory unit ignores the requested
transfer. The I/O devices examine the low-order bits of the address bus to determine
whether they should respond.

Figure 4.2 illustrates the hardware required to connect an I/O device to the
bus. The address decoder enables the device to recognize its address when this
address appears on the address lines. The data register holds the data being
transferred to or from the processor. The status register contains information relevant
to the operation of the I/O device. Both the data and status registers are connected to
the data bus and assigned unique addresses. The address decoder, the data and status
registers, and the control circuitry required to coordinate I/O transfers constitute the
device's interface circuit.

I/O devices operate at speeds that are vastly different from that of the
processor. When a human operator is entering characters at a keyboard, the
processor is capable of executing millions of instructions between successive
character entries. An instruction that reads a character from the keyboard
should be executed only when a character is available in the input buffer of
the keyboard interface. Also, we must make sure that an input character is read
only once.

Figure 4.2: I/O interface for an input device

128
This example illustrates program-controlled I/O, in which the
processor repeatedly checks a status flag to achieve the required
synchronization between the processor and an input or output device. We say
that the processor polls the device. There are two other commonly used
mechanisms for implementing I/O operations: interrupts and direct memory
access. In the case of interrupts, synchronization is achieved by having the I/O
device send a special signal over the bus whenever it is ready for a data
transfer operation. Direct memory access is a technique used for high-speed I/O
devices. It involves having the device interface transfer data directly to or from
the memory, without continuous involvement by the processor.

INTERRUPTS

When a program enters a wait loop, it will repeatedly check the device status. During
this period, the processor will not perform any function.
The Interrupt request line will send a hardware signal called the interrupt signal to the
processor.
On receiving this signal, the processor will perform the useful function during the
waiting period.
The routine executed in response to an interrupt request is called the
interrupt- service routine, which is the PRINT routine in our example. Interrupts bear
considerable resemblance to subroutine calls. Assume that an interrupt
request arrives during execution of instruction I in figure 1

Program 1 Program 2
COMPUTER Routine PRINT routine

129
Figure 1. Transfer of control through the use of interrupts
The p r o c e s s o r f i r s t c omp l ete s e x e c u t i o n o f i n s t ru ct i on i . Then, it
loads the program counter with the address of the first instruction of the interrupt-
service routine. For the time being, let us assume that this address is hardwired in the
processor. After execution of the interrupt-service routine, the processor has to come
back to instruction
i +1. Therefore, when an interrupt occurs, the current contents of the PC, which point
to instruction i+1, must be put in temporary storage in a known location. A Return-
from- interrupt instruction at the end of the interrupt-service routine reloads the
PC from the temporary storage location, causing execution to resume at
instruction i +1. In many processors, the return address is saved on the processor
stack.
We should note that as part of handling interrupts, the processor must inform
the device that its request has been recognized so that it may remove its interrupt-
request signal. This may be accomplished by means of a special control signal on the bus.
An interrupt-acknowledge s i g n a l . The execution of an instruction in the interrupt-
service routine that accesses a status or data register in the device interface implicitly
informs that device that its interrupt request has been recognized.

So far, treatment of an interrupt-service routine is very similar to that of


a subroutine. An important departure from this similarity should be noted. A
subroutine performs a function required by the program from which it is called.
However, the interrupt-service routine may not have anything in common with
the program being executed at the time the interrupt request is received. In fact,
the two programs often belong to different users. Therefore, before starting
execution of the interrupt-service routine, any information that may be altered during
the execution of that routine must be saved. This information must be restored
before execution of the interrupt program is resumed. In this way, the original
program can continue execution without being affected in any way by the
interruption, except for the time delay. The information that needs to be saved and
restored typically includes the condition code flags and the contents of any registers
used by both the interrupted program and the interrupt-service routine.
The task of saving and restoring information can be done automatically by
the processor or by program instructions. Most modern processors save only the
minimum amount of information needed to maintain the registers involves
memory transfers that increase the total execution time, and hence represent
execution overhead. Saving registers also increase the delay between the time an
interrupt request is received and the start of execution of the interrupt-service
routine. This delay is called interrupt latency.

126
INTERRUPT HARDWARE:
We pointed out that an I/O device requests an interrupt by activating a
bus line called interrupt-request. Most computers are likely to have several I/O
devices that can request an interrupt. A single interrupt-request line may be used
to serve n devices as

All devices are connected to the line via switches to ground. To request
an interrupt, a device closes its associated switch. Thus, i f a l l i n t e r r u p t -request
s i g n a l s INTR1 to INTRn are inactive, that is, if all switches are open, the voltage on
the interrupt- request line will be equal to Vdd. This is the inactive state of the line.
Since the closing of
One or more switches will cause the line voltage to drop to 0, the value of INTR is
the logical OR of the requests from individual devices, that is,

INTR = INTR1 + ………+INTRn

It is customary to use the complemented form INTR, to name the interrupt-request signal
on the common line, because this signal is active when in the low-voltage state.

ENABLING AND DISABLING INTERRUPTS:


The facilities provided in a computer must give the programmer complete
control over the events that take place during program execution. The arrival of
an interrupt request from an external device causes the processor to suspend the
execution of one program and start the execution of another. Because interrupts
can arrive at any time, they may alter the sequence of events from the envisaged by
the programmer. Hence, the interruption of program execution must be carefully
controlled.

127
Let us consider in detail the specific case of a single interrupt request from
one device. When a device activates the interrupt-request signal, it keeps this signal
activated until it learns that the processor has accepted its request. This means
that the interrupt- request signal will be active during execution of the interrupt-
service routine, perhaps until an instruction is reached that accesses the device in
question.
The first possibility is to have the processor hardware ignore the interrupt-
request line until the execution of the first instruction of the interrupt-service
routine has been completed. Then, by using an Interrupt-disable instruction as the
first instruction in the interrupt-service routine, the programmer can ensure that
no further interruptions will occur until an Interrupt-enable instruction is executed.
Typically, the Interrupt-enable instruction will be the last instruction in the interrupt-
service routine before the Return- from-interrupt instruction. The processor must
guarantee that execution of the Return- from-interrupt instruction is completed
before further interruption can occur.
The second option, which is suitable for a simple processor with only one
interrupt-request line, is to have the processor automatically disable interrupts
before starting the execution of the interrupt-service routine. After saving the
contents of the PC
and the processor status register (PS) on the stack, the processor performs the
equivalent of executing an Interrupt-disable instruction. It is often the case that
one bit in the PS register, called Interrupt-enable, indicates whether interrupts are
enabled.
In the third option, the processor has a special interrupt-request line for which
the interrupt-handling circuit responds only to the leading edge of the signal. Such
a line is said to be edge-triggered.
Before proceeding to study more complex aspects of interrupts, let us
summarize the sequence of events involved in handling an interrupt request from a
single device.

Assuming that interrupts are enabled, the following is a typical scenario.


1. The device raises an interrupt request.

2. The processor interrupts the program currently being executed.

3. Interrupts are disabled by changing the control bits in the PS (except in the case
of edge-triggered interrupts).

128
4. The device is informed that its request has been recognized, and in
response, it deactivates the interrupt-request signal.

5. The action requested by the interrupt is performed by the interrupt-service


routine.

6. Interrupts are enabled and execution of the interrupted program is resumed.

HANDLING MULTIPLE DEVICES:


Let us now consider the situation where a number of devices capable of
initiating interrupts a r e c o n n e c t e d t o the p r o c e s s o r . Because these
devices are operationally independent, there is no definite order in which they
will generate interrupts. For example, device X may request in interrupt while an
interrupt caused by device Y is being serviced, or several devices may request
interrupts at exactly the same time. This gives rise to a number of questions

1. How can the processor recognize the device requesting an interrupts?


2. Given that different devices are likely to require different interrupt-
service routines, how can the processor obtain the starting address of the
appropriate routine in each case?
3. Should a device be allowed to interrupt the processor while another
interrupt is being serviced?
4. How should two or more simultaneous interrupt requests be handled?

The means by which these problems are resolved vary from one computer to
another, And the approach taken is an important consideration in determining the
computer’s suitability for a given application.
When a request is received over the common interrupt-request line, additional
information is needed to identify the particular device that activated the line.
The information needed to determine whether a device is requesting an
interrupt is available in its status register. When a device raises an interrupt request,
it sets to 1 one of the bits in its status register, which we will call the IRQ bit. For
example, bits KIRQ and DIRQ are the interrupt request bits for the keyboard and
the display, respectively. The simplest way to identify the interrupting device is to
have the interrupt-service routine poll all the I/O devices connected to the bus. The
first device encountered with its IRQ bit set is the device that should be serviced.
An appropriate subroutine is called to provide the requested service.
The polling scheme is easy to implement. Its main disadvantage is the time
spent interrogating the IRQ bits of all the devices that may not be requesting any
service. An alternative approach is to use vectored interrupts, which we describe
next.

129
Vectored Interrupts:-
To reduce the time involved in the polling process, a device requesting an
interrupt may identify itself directly to the processor. Then, the processor
can immediately start executing the corresponding interrupt-service routine.
The term vectored interrupts refers to all interrupt-handling schemes based on this
approach.
A device requesting an interrupt can identify itself by sending a special
code to the processor over the bus. This enables the processor to identify individual
devices even if they share a single interrupt-request line. The code supplied by
the device may represent the starting address of the interrupt-service routine for
that device. The code length is typically in the range of 4 to 8 bits. The remainder of
the address is supplied by the processor based on the area in its memory where the
addresses for interrupt-service routines are located.

This arrangement implies that the interrupt-service routine for a given


device must always start at the same location. The programmer can gain some
flexibility by storing in this location an instruction that causes a branch to the
appropriate routine.

Interrupt Nesting: -

Interrupts should be disabled during the execution of an interrupt-service


routine, to ensure that a request from one device will not cause more than one
interruption. The same arrangement is often used when several devices are
involved, in which case execution of a given interrupt-service routine, once
started, always continues to completion before the processor accepts an
interrupt request from a second device. Interrupt-service routines are typically
short, and the delay they may cause is acceptable for most simple devices.

For some devices, however, a long delay in responding to an interrupt


request may lead to erroneous operation. Consider, for example, a computer that
keeps track of the time of day using a real-time clock. This is a device that sends
interrupt requests to the processor at regular intervals. For each of these requests,
the processor executes a short interrupt-service routine to increment a set of
counters in the memory that keep track of time in seconds, minutes, and so on.

130
Proper operation requires that the delay in responding to an interrupt
request from the real-time clock be small in comparison with the interval between
two successive requests. To ensure that this requirement is satisfied in the presence
of other interrupting devices, it may be necessary to accept an interrupt request
from the clock during the execution of an interrupt-service routine for another
device.
This example suggests that I/O devices should be organized in a priority
structure. An interrupt request from a high-priority device should be accepted while the
processor is servicing another request from a lower-priority device.

A multiple-level priority organization means that during execution of


an interrupt-service routine, interrupt requests will be accepted from some devices
but not from others, depending upon the device’s priority. To implement this
scheme, we can assign a priority level to the processor that can be changed under
program control. The priority level of the processor is the priority of the program
that is currently being executed. The processor accepts interrupts only from
devices that have priorities higher than its own.

The processor’s priority is usually encoded in a few bits of the processor


status word. It can be changed by program instructions that write into the PS.
These are privileged instructions, which can be executed only while the processor is
running in the supervisor mode. The processor is in the supervisor mode only when
executing operating system r o u t i n e s . It switches to the user mode before
beginning to execute application programs. Thus, a user program cannot
accidentally, or intentionally, change the priority of the processor and disrupt the
system’s operation. An attempt to execute a privileged instruction while in the user
mode leads to a special type of interrupt called a privileged instruction.

A multiple-priority scheme can be implemented easily by using separate interrupt-


request and interrupt-acknowledge lines for each device, as shown in figure. Each of
the interrupt-request lines is assigned a different priority level. Interrupt requests
received over these lines are sent to a priority arbitration circuit in the processor. A
request is accepted only if it has a higher priority level than that currently assigned to
the processor.

131
Priority arbitration Circuit

Figure2: Implementation of interrupt priority using individual interrupt-request


and acknowledge lines.

Simultaneous Requests:-

Let us now consider the problem of simultaneous arrivals of interrupt


requests from two or more devices. The processor must have some means of
deciding which requests to service first. Using a priority scheme such as that of
figure, the solution is straightforward. The processor simply accepts the requests
having the highest priority.

Polling the status registers of the I/O devices is the simplest such
mechanism. In this case, priority is determined by the order in which the devices
are polled. When vectored interrupts are used, we must ensure that only one
device is selected to send its
interrupt vector code. A widely used scheme is to connect the devices to form a
daisy chain, as shown in figure 3a. The interrupt-request line INTR is common to all
devices. The interrupt-acknowledge line, INTA, is connected in a daisy-chain fashion,
such that the INTA signal propagates serially through the devices.

When several d evices raise an interrupt request and the INTR line is activated the
processor responds by setting the INTA line to 1. This signal is received by device 1

132
Device 1 passes the signal on to device 2 only if it does not require any service.
If device 1 has a pending request for interrupt, it blocks the INTA signal and proceeds
to put its identifying code on the data lines. Therefore, in the daisy-chain
arrangement, the device that is electrically closest to the processor has the
highest priority. The second device along the chain has second highest priority,
and so on.

The scheme in figure 3.a requires considerably fewer wires than the
individual connections in figure 2. The main advantage of the scheme in figure 2 is
that it allows the processor to accept interrupt requests from some devices but not
from others, depending upon their priorities. The two schemes may be
combined to produce the more general structure in figure 3b. Devices are
organized in groups, and each group is connected at a different priority level.
Within a group, devices are connected in a daisy chain. This organization is used
in many computer systems.

133
CONTROLLING DEVICE REQUESTS:
Until now, we have assumed that an I/O device interface generates an
interrupt request whenever it is ready for an I/O transfer, for example whenever
the SIN flag is 1. It is important to ensure that interrupt requests are generated
only by those I/O devices that are being used by a given program. Idle devices
must not be allowed to generate interrupt requests, even though they may
be ready to participate in I/O transfer operations. Hence, we need a
mechanism in the interface circuits of individual devices to control whether a device
is allowed to generate an interrupt request.

The control needed is usually provided in the form of an interrupt-enable


bit in the device’s interface circuit. The keyboard interrupt-enable, KEN, and display
interrupt- enable, DEN, flags in register CONTROL perform this function. If either
of these flags is set, the interface circuit generates an interrupt request
whenever the corresponding status flag in register STATUS is set. At the same time,
the interface circuit sets bit KIRQ or DIRQ to indicate that the keyboard or
display unit, respectively, is requesting an interrupt. If an interrupt-enable bit is
equal to 0, the interface circuit will not generate an interrupt request, regardless of
the state of the status flag.

To summarize, there are two independent mechanisms for controlling


interrupt requests. At the device end, an interrupt-enable bit in a control
register determines whether the device is allowed to generate an interrupt
request. At the processor end, either an interrupt enable bit in the PS register or a
priority structure determines whether a given interrupt request will be accepted.

EXCEPTIONS:
An interrupt is an event that causes the execution of one program to be
suspended and the execution of another program to begin. So far, we have dealt only
with interrupts caused by requests received during I/O data transfers. However,
the interrupt mechanism is used in a number of other situations.

The term exception is often used to refer to any event that causes an
interruption. Hence, I/O interrupts are one example of an exception. We now
describe a few other kinds of exceptions.

134
Recovery from Errors:
Computers use a variety of techniques to ensure that all hardware
components are operating properly. For example, many computers include an error-
checking code in the main memory, which allows detection of errors in the stored
data. If errors occur, the control hardware detects it and informs the processor by
raising an interrupt.

The processor may also interrupt a program if it detects an error or an unusual


condition while executing the instructions of this program. For example, the OP-
code field of an instruction may not correspond to any legal instruction, or an
arithmetic instruction may attempt a division by zero.

When exception processing is initiated as a result of such errors, the


processor proceeds in exactly the same manner as in the case of an I/O interrupt
request. It suspends the program being executed and starts an exception-service
routine. This routine takes appropriate action to recover from the error, if
possible, or to inform the user about it. Recall that in the case of an I/O
interrupt, the processor completes execution of the instruction in progress before
accepting the interrupt. However, when an interrupt is caused by an error,
execution of the interrupted instruction cannot usually be completed, and the
processor begins exception processing immediately.

Debugging:
Another important type of exception is used as an aid in debugging
programs. System software usually includes a program called a debugger,
which helps the programmer find errors in a program. The debugger uses
exceptions to provide two important facilities called trace and breakpoints.

When a processor is operating in the trace mode, an exception occurs


after execution of every instruction, using the debugging program as the exception-
service routine. The debugging program enables the user to examine the contents
of registers, memory locations, and so on. On return from the debugging program,
the next instruction in the program being debugged is executed, then the debugging
program is activated again. The trace exception is disabled during the execution of the
debugging program.

135
Breakpoint provides a similar facility, except that the program being
debugged is interrupted only at specific points selected by the user. An
instruction called Trap or Software-interrupt is usually provided for this
purpose. Execution of this instruction results in exactly the same actions as
when a hardware interrupt request is received. While debugging a program, the
user may wish to interrupt program execution after instruction i. The debugging
routine saves instruction i+1 and replaces it with a software interrupt instruction.
When the program is executed and reaches that point, it is interrupted and
the debugging routine is activated. This gives the user a chance to examine
memory and register contents. When the user is ready to continue executing the
program being debugged, the debugging routine restores the saved instruction
that was a location i+1 and executes a Return-from-interrupt instruction.

Privilege Exception:
To protect the operating system of a computer from being corrupted by user
programs, certain instructions can be executed only while the processor is in
supervisor mode. These are called privileged instructions. For example, when
the processor is running in the user mode, it will not execute an instruction that
changes the priority level of the processor or that enables a user program to
access areas in the computer memory that have been allocated to other users.
An attempt to execute such an instruction will produce a privilege exceptions,
causing the processor to switch to the supervisor mode and begin executing an
appropriate routine in the operating system.

DIRECT MEMORY ACCESS:

A special control unit may be provided to allow the transfer of large block of data at
high speed directly between the external device and main memory , without continous
intervention by the processor. This approach is called DMA.
DMA transfers are performed by a control circuit called the DMA Controller. To initiate
the transfer of a block of words , the processor sends,

Starting address
Number of words in the block
Direction of transfer.

136
When a block of data is transferred , the DMA controller increment the memory
address for successive words and keep track of number of words and it also informs the
processor by raising an interrupt signal.
While DMA control is taking place, the program requested the transfer cannot
continue and the processor can be used to execute another program.
After DMA transfer is completed, the processor returns to the program that requested the
transfer.

R/W Determines the direction of transfer . When

R/W =1, DMA controller read data from memory to I/O device.

R/W =0, DMA controller perform write operation.

Done Flag=1, the controller has completed transferring a block of data and is ready to
receive another command.

IE=1, it causes the controller to raise an interrupt (interrupt Enabled) after it has
completed transferring the block of data.

IRQ=1, it indicates that the controller has requested an interrupt.

Fig:Registes in a DMA Interface

31 30 1 0
Status &
Control Flag
IRQ
DONE

IE R/W

Starting address

Word count

137
Fig: Use of DMA controllers in a computer system

A DMA controller connects a high speed network to the computer bus . The disk
controller two disks, also has DMA capability and it provides two DMA channels.
To start a DMA transfer of a block of data from main memory to one of the disks, the
program write s the address and the word count inf. Into the registers of the
corresponding channel of the disk controller.
When DMA transfer is completed, it will be recorded in status and control registers of
the DMA channel (ie) Done bit=IRQ=IE=1.

Cycle Stealing:
Requests by DMA devices for using the bus are having higher priority than processor
requests .
Top priority is given to high speed peripherals such as , D i s k H i g h speed Network
Interface and Graphics display device.

Since the processor originates most memory access cycles, the DMA controller can be said
to steal the memory cycles from the processor.
This interviewing technique is called Cycle stealing.
burst Mode:
The DMA controller may be given exclusive access to the main memory to
transfer a block of data without interruption. This is known as Burst/Block Mode

Bus Master:
The device that is allowed to initiate data transfers on the bus at any given time is
called the bus master.

Bus Arbitration:
It is the process by which the next device to become the bus master is selected and the
bus mastership is transferred to it.
138
Types:
There are 2 approaches to bus arbitration. They are,

Centralized arbitration ( A single bus arbiter performs arbitration)


Distributed arbitration (all devices participate in the selection of next bus master).

Centralized Arbitration:

Here the processor is the bus master and it may grants bus mastership to one of its
DMA controller.
A DMA controller indicates that it needs to become the bus master by activating the
Bus Request line (BR) which is an open drain line.

The signal on BR is the logical OR of the bus request from all devices connected to it.

When BR is activated the processor activates the Bus Grant Signal (BGI) and indicated
the DMA controller that they may use the bus when it becomes free.
This signal is connected to all devices using a daisy chain arrangement.
If DMA requests the bus, it blocks the propagation of Grant Signal to other devices
and it indicates to all devices that it is using the bus by activating open collector line, Bus
Busy (BBSY).

Fig:A simple arrangement for bus arbitration using a daisy chain

139
Fig: Sequence of signals during transfer of bus mastership for the devices

The timing diagram shows the sequence of events for the devices connected to the
processor is shown.DMA controller 2 requests and acquires bus mastership and later
releases the bus. During its tenture as bus master, it may perform one or more data
transfer.
After it releases the bus, the processor resources bus mastership
Distributed Arbitration:
It means that all devices waiting to use the bus have equal responsibility in carrying out
the arbitration process.

Fig:A distributed arbitration scheme 140


Each device on the bus is assigned a 4 bit id.
When one or more devices request the bus, they assert the Start-Arbitration signal
& place their 4 bit ID number on four open collector lines, ARB0 to ARB3.
A winner is selected as a result of the interaction among the signals transmitted over
these lines.
The net outcome is that the code on the four lines represents the request that has the
highest ID number.
The drivers are of open collector type. Hence, if the i/p to one driver is equal to 1,
the i/p to another driver connected to the same bus line is equal to „0‟(ie. bus the
is in low-voltage state).
Example: Assume two devices A & B have their ID 5 (0101), 6(0110) and their code is0111.
Each devices compares the pattern on the arbitration line to its own ID starting from MSB.
If it detects a difference at any bit position, it disables the drivers at that bit position. It
does this by placing „0‟ at the i/p of these drivers.
In our eg. „A‟ detects a difference in line ARB1, hence it disables the drivers onlines ARB1
& ARB0.This causes the pattern on the arbitration line to change to 0110 which means
that„B‟ has won the contention.
Buses

A bus protocol is the set of rules that govern the behavior of various devices
connected to the bus ie, when to place information in the bus, assert control signals etc.
The bus lines used for transferring data is grouped into 3 types. They are,
• Address line
• Data line
• Control line.
Control signals Specifies that whether read / write operation has to performed.
It also carries timing infn/. (ie) they specify the time at which the processor & I/O devices
place the data on the bus & receive the data from the bus.

During data transfer operation, one device plays the role of a „Master‟.
Master device initiates the data transfer by issuing read / write command on the
bus. Hence it is also called as „Initiator‟.
The device addressed by the master is called as Slave / Target.

Types of Buses:
There are 2 types of buses. They are,
• Synchronous Bus
• Asynchronous Bus.

141
Synchronous Bus:-
In synchronous bus, all devices derive timing information from a common clock line.
Equally spaced pulses on this line define equal time.
During a „bus cycle‟, one data transfer on take place.
The „crossing points‟ indicate the tone at which the patterns change.
A „signal line’ in an indeterminate / high impedance state is represented by an
intermediate half way between the low to high signal levels.

Fig:Timing of an input transfer of a Synchronous bus

At time t0 master places address on address lines & sends an appropriate


command on the control lines.
In this case, the command will indicate an input operation & specify the length of the
operand to be read.
The clock pulse width t1 – t0 must be longer than the maximum delay between
devices connected to the bus.
The clock pulse width should be long to allow the devices to decode the address
& control signals so that the addressed device can respond at time t1. The slaves take no
action or place any data on the bus before t1.
 At the end of the clock cycle, at time t2, the master strobes the data on the data
lines into its input buffer if it’s a Read operation.
 “Strobe” means to capture the values of the data and store them into a buffer.
 When data are to be loaded into a storage buffer register, the data should be
available for a period longer than the setup time of the device.
 Width of the pulse t2 - t1 should be longer than: 142
 Maximum propagation time of the bus plus
 Set up time of the input buffer register of the master.

Fig:A detailed timing diagram for the input transfer

The picture shows two views of the signal except the clock.
One view shows the signal seen by the master & the other is seen by the salve. The master
sends the address & command signals on the rising edge at the beginning of clock period
(t0). These signals do not actually appear on the bus until tam.

Signals do not appear on the bus as soon as they are placed on the bus, due to the
propagation delay in the interface circuits.
Signals reach the devices after a propagation delay which depends on the
characteristics of the bus.
Data must remain on the bus for some time after t2 equal to the hold time of the buffer.

Data transfer has to be completed within one clock cycle.


Clock period t2 - t0 must be such that the longest propagation delay on the bus and the
slowest device interface must be accommodated.
Forces all the devices to operate at the speed of the slowest device.
Processor just assumes that the data are available at t2 in case of a Read operation, or are
read by the device in case of a Write operation.
What if the device is actually failed, and never really responded?
Most buses have control signals to represent a response from the slave.
 Control signals serve two purposes:
 Inform the master that the slave has recognized the address, and is ready to
participate in a data transfer operation.
 Enable to adjust the duration of the data transfer operation based on the
speed of the participating slaves.
 High-frequency bus clock is used: 143
 Data transfer spans several clock cycles instead of just one clock cycle as in
the earlier case.
Multiple Cycle Transfer:-
During, clock cycle1, the master sends address & cmd infn/. On the bus‟
requesting a „read‟ operation.
The slave receives this information & decodes it.
At the active edge of the clock (ie) the beginning of clock cycel2, it makes accession to
respond immediately.
The data become ready & are placed in the bus at clock cycle3.
At the same times, the slave asserts a control signal called „slave-ready‟. The master which
has been waiting for this signal, strobes, the data to its i/p buffer at the end of clock
cycle3.
The bus transfer operation is now complete & the master sends a new address to start
a new transfer in clock cycle4.
The „slave-ready‟ signal is an acknowledgement form the slave to the master
confirming that valid data has been sent.

Fig:An input transfer using multiple clock cycles

144
Asynchronous Bus:-

An alternate scheme for controlling data transfer on the bus is based on the use of
„handshake‟ between Master & the Slave. The common clock is replaced by two timing
control lines.
They are
 Master–ready
 Slave ready.

 Master-ready signal is asserted by the master to indicate to the slave that it is


ready to participate in a data transfer.
 Slave-ready signal is asserted by the slave in response to the master-ready from
the master, and it indicates to the master that the slave is ready to participate in a
data transfer.
Data transfer using the handshake protocol:
 Master places the address and command information on the bus.
 Asserts the Master-ready signal to indicate to the slaves that the address
and command information has been placed on the bus.
 All devices on the bus decode the address.
 Address slave performs the required operation, and informs the processor
it has done so by asserting the Slave-ready signal.
 Master removes all the signals from the bus, once Slave-ready is asserted.
 If the operation is a Read operation, Master also strobes the data into its
input buffer.

Fig:Handshake control of data transfer during an input operation

145
The handshake protocol proceed as follows :
 At t0 The master places the address and command information on the bus
and all devices on the bus begin to decode the information
 At t1 The master sets the Master ready line to 1 to inform the I/O devices
that the address and command information is ready.

The delay t1 – t0 is intended to allow for any skew that may occurs on the bus.
The skew occurs when two signals simultaneously transmitted from one source arrive
at the destination at different time.
Thus to guarantee that the Master ready signal does not arrive at any device a
head of the address and command information the delay t1 – t0 should be larger than
the maximum possible bus skew.

 At t2 The selected slave having decoded the address and command


information performs the required i/p operation by placing the data from
its data register on the data lines. At the same time, it sets the “slave –
Ready” signal to 1.
 At t3 The slave ready signal arrives at the master indicating that the i/p data are
available on the bus.
 At t4 The master removes the address and command information on the
bus.
The delay between t3 and t4 is again intended to allow for bus skew.
Errorneous addressing may take place if the address, as seen by some
device on the bus, starts to change while the master – ready signal is still equal to
1.
 At t5 When the device interface receives the 1 to 0 tranitions of the
Master – ready signal. It removes the data and the slave – ready signal from the
bus. This completes the i/p transfer.

In this diagram, the master place the output data on the data lines and at the same
time it transmits the address and command information.
The selected slave strobes the data to its o/p buffer when it receives the Master-
ready signal and it indicates this by setting the slave – ready signal to 1.
At time t0 to t1 and from t3 to t4, the Master compensates for bus.
A change of state is one signal is followed by a change is the other signal. Hence this
scheme is called as Full Handshake.
It provides the higher degree of flexibility and reliability.

146
INTERFACE CIRCUITS:

 I/O interface consists of the circuitry required to connect an I/O device to a


computer bus.
 Side of the interface which connects to the computer has bus signals for:
 Address,
 Data
 Control
 Side of the interface which connects to the I/O device has:
 Datapath and associated controls to transfer data between the interface
and the I/O device.
 This side is called as a “port”.
 Ports can be classified into two:
 Parallel port,
 Serial port.

Parallel Port:
 Parallel port transfers data in the form of a number of bits, normally 8 or 16 to or
from the device.
 Serial port transfers and receives data one bit at a time.
 Processor communicates with the bus in the same way, whether it is a parallel port
or a serial port.
 Conversion from the parallel to serial and vice versa takes place inside the
interface circuit.

Fig:Keyboard is connected to a processor using a parallel port.

Processor is 32-bits and uses memory-mapped I/O and the asynchronous bus protocol.
On the processor side of the interface we have:
- Data lines.
- Address lines
- Control or R/W line.
- Master-ready signal and
- Slave-ready signal.
147
The output of the encoder consists of the bits that represent the encoded character and
one signal called valid,which indicates the key is pressed.
The information is sent to the interface circuits,which contains a data
register,DATAIN and a status flag SIN.
When a key is pressed, the Valid signal changes from 0 to1,causing the ASCII
code to be loaded into DATAIN and SIN set to 1.
The status flag SIN set to 0 when the processor reads the contents of the DATAIN
register.
The interface circuit is connected to the asynchronous bus on which transfers are
controlled using the Handshake signals Master ready and Slave-ready.

Fig: Input Interface Circuit


148
Output lines of DATAIN are connected to the data lines of the bus by means of 3 state drivers
are turned on when the processor issues a read signal and the address selects this register

SIN signal is generated using a status flag circuit. It is connected to line D 0 of the processor
bus using a three-state driver. Address decoder selects the input interface basedon bits
A1 through A31. Bit A0 determines whether the status or data register is to be read, when
Master-ready is active. In response, the processor activates the Slave-readysignal, when
either the Read-status or Read-datais equal to 1, which depends on line A0.

Fig:Printer is connected to a processor using a parallel port.

Processor is 32 bits, uses memory-mapped I/O and asynchronous bus protocol.


On the processor side:
- Data lines.
- Address lines
- Control or R/W line.
- Master-ready signal and
- Slave-ready signal.

149
Fig: Output Interface Circuit

Data lines of the processor bus are connected to the DATAOUT register of the interface.
The status flag SOUT is connected to the data line D1 using a three-state driver. The three-
state driver is turned on, when the control Read-status line is1. Address decoder selects the

150
output interface using address lines A1 through A31.Address line A0 determines whether
the data is to be loaded into the DATAOUT register or status flag isto be read. If the Load-
data line is 1, then the
Valid line is set to 1.If the Idle line is 1, then the statusflag SOUT is set to 1.

Fig :General 8 bit- parallel interface

Data lines to I/O device are bidirectional. Data lines P7 through P0 can be used for both
input, and output. In fact, some lines can be used for input & some for output depending on
the pattern in the Data Direction Register (DDR). Processor places an 8-bit pattern into a
DDR.If a given bit position in the DDR is 1, the corresponding data line acts as an output
line, otherwise it acts as an input line. C1 and C2 control the interaction between the
interface circuit and the I/O devices. Ready and Accept lines are the handshake control
lines on the processor bus side, and are connected to Master-ready & Slave-ready.Input
signal My-address is connected to the output of an address decoder. Three register select
lines that allow up to 8registers to be selected.

151
Serial Port:
Serial port is used to connect the processor to I/O devices that require transmission of data
one bit at a time.
Serial port communicates in a bit-serial fashion on the device side and bit parallel fashion
on the bus side.
Transformation between the parallel and serial formats is achieved with shift registers that
have parallel access capability.

Input shift register accepts input one bit at a time from the I/O device. Once all the 8 bits
are received, the contents of the input shift register are loaded in parallel into DATAIN
register.Output data in the DATAOUT register are loaded into the output shift register. Bits
are shifted out of the output shift register and sent out to the I/O device onebit at a time.
As soon as data from the input shift reg.are loaded into DATAIN, it can start accepting
another 8 bits of data. Input shift register and DATAIN registersare both used at input so
that the input shift register can start receiving anotherset of 8 bits from the input device
after loading the contents to DATAIN, before the processor reads the contents of DATAIN.
This is called as double-buffering.

152
Serial interfaces require fewer wires, and hence serial transmission is convenient for
connecting devices that are physically distant from the computer.
Speed of transmission of the data over a serial interface is known as the “bit rate”.Bit rate
depends on the nature of the devices connected.In order to accommodate devices with a
range of speeds, a serial interface must be able to use a range of clock speeds.
Several standard serial interfaces have been developed:
Universal Asynchronous Receiver Transmitter (UART) for low-speed serial devices.
RS-232-C for connection to communication links.

Standard I/O INTERFACES:


The processor bus is the bus defined by the signals on the processor chip itself.
Devices that require a very high speed connection to the processor, such as the main
memory, may be connected directly to this bus. For electrical reasons, only a few devices
can be connected in this manner. The motherboard usually provides another bus that can
support more devices. The two buses are interconnected by a circuit, which we will call a
bridge that translates the signals and protocols of one bus into those of the other. Devices
connected to the expansion bus appear to the processor as if they were connected
directly to the processor's own bus. The only difference is that the bridge circuit
introduces a small delay in data transfers between the processor and those devices.
It is not possible to define a uniform standard for the processor bus. The structure
of this bus is closely tied to the architecture of the processor. It is also dependent on the
electrical characteristics of the processor chip, such as its clock speed. The expansion bus
is not subject to these limitations, and therefore it can use a standardized signaling
scheme. A number of standards have been developed. Some have evolved by default,

when a particular design became commercially successful. For example, IBM developed a
standard called ISA (Industry Standard Architecture) for their personal computer
known at the time as PC AT. The popularity of that computer led to other manufacturers
producing ISA -compatible interfaces for their 110 devices, thus making ISA into a de facto
standard.
Some standards have been developed through industrial cooperative efforts, even
among competing companies driven by their common self-interest in having compatible

153
products. In some cases, organizations such as the IEEE (Institute of Electrical and
Electronics Engineers), ANSI (American Nationa1 Standards Institute), or international
bodies such as ISO (International Standards Organization) have blessed these standards
and given them an official status.

Figure 12.3: An example of a computer system using different interface standards

In this section, we present three widely used bus standards, PCI (Peripheral
Component Interconnect), SCSI (Small Computer System Interface), and USB (Universal
Serial Bus). The way these standards are used in a typical computer system is illustrated

in Figure 12.3. The PCI standard defines an expansion bus on the motherboard, SCSI and
USB are used for connecting additional devices, both inside and outside the computer
box. The SCSI bus is a high-speed parallel bus intended for devices such as disks and video
displays. The USB bus uses serial transmission to suit the needs of equipment ranging from
keyboards to game controls to internet connections. The figure shows an interface
circuit that enables devices compatible with the earlier ISA standard, such as the popular
IDE (Integrated Device Electronics) disk, to be connected. It also shows a connection to an
Ethernet. The Ethernet is a widely used local area network, providing a high-speed

154
connection among computers in a building or a university campus.
A given computer may use more than one bus standard. A typical Pentium
computer has both a PCI bus and an ISA bus, thus providing the user with a wide range
of devices to choose from.

PCI BUS

The PCI follows a sequence of bus standards that were used primarily in IBM
PCs. Early PCs used the 8-bit XT bus, whose signals closely mimicked those of Intel's
80x86 processors. Later, the 16-bit bus used on the PC AT computers became known as
the ISA bus. Its extended 32-bit version is known as the ISA bus. Other buses developed in
the eighties with similar capabilities are the Micro channel used in IBM PCs and the NuBus
used in Macintosh computers.
The PCI was developed as a low-cost bus that is truly processor independent. Its
design anticipated a rapidly growing demand for bus bandwidth to support high-speed
disks and graphic and video devices, as well as the specialized needs of multiprocessor
systems. As a result, the PCI is still popular as an industry standard almost a decade after
it was first introduced in 1992.
An important feature that the PCI pioneered is a plug-and-play capability for
connecting I/O devices. To connect a new device, the user simply connects the device
interface board to the bus.

Data Transfer
In today's computers, most memory transfers involve a burst of data rather than
just one word. The reason is that modern processors include a cache memory. Data are
transferred between the cache and the main memory in bursts of several words each.
The words involved in such a transfer are stored at successive memory locations. When
the processor (actually the cache controller) specifies an address and requests a
read operation from the main memory, the memory responds by sending a sequence of
data words starting at that address. Similarly, during a write operation, the processor
sends a memory address followed by a sequence of data words, to be written in
successive memory locations starting at that address. The PCI is designed primarily to

155
support this mode of operation. A read or a write operation involving a single word is
simply treated as a burst of length one.
The bus supports three independent address spaces: memory, I/O, and
configuration. The first two are self-explanatory. The I/O address space is intended for
use with processors, such as Pentium, that have a separate I/O address space.
However, the system designer may choose to use memory-mapped I/O even when a
separate I/O address space is available. In fact, this is the approach recommended by the
PCI standard for wider compatibility. The configuration space is intended to give the PCI
its plug-and- play capability. A 4-bit command that accompanies the address identifies
which of the three spaces is being used in a given data transfer operation.

Figure 12.3 shows the main memory of the computer connected directly to the processor
bus. An alternative arrangement that is used often with the PCI bus is shown in Figure 12.4.

The PCI Bridge provides a separate physical connection for the main memory. For electrical
reasons, the bus may be further divided into segments connected via bridges. However,
regardless of which bus segment a device is connected to, it may still be mapped into the
processor's memory address space.

Figure 12.4: Use of a PCI bus in a computer system

Data Transfer Signals on PCI Bus:

Name Function

CLK 33 MHZ / 66 MHZ clock

156
FRAME # Sent by the indicator to indicate the duration of
AD 32 address / data line
C/BE # 4 command / byte Enable Lines
IRDY, TRDYA Initiator Ready, Target Ready Signals
DEVSEL # Aresponse from the device indicating that it has
recognized its address and is ready for data transfer
transaction.
IDSEL # Initialization Device Select

Individual word transfers are called „phases’.

Fig :Read operation an PCI Bus

In Clock cycle1, the processor asserts FRAME # to indicate the beginning of a


transaction ; it sends the address on AD lines and command on C/BE # Lines.
Clock cycle2 is used to turn the AD Bus lines around ; the processor ; The
processor removes the address and disconnects its drives from AD lines.
The selected target enable its drivers on AD lines and fetches the requested data to be
placed on the bus.
It asserts DEVSEL # and maintains it in asserted state until the end of the
transaction.
C/BE # is used to send a bus command in clock cycle and it is used for different
purpose during the rest of the transaction.

157
During clock cycle 3, the initiator asserts IRDY #, to indicate that it is ready to receive
data.
If the target has data ready to send then it asserts TRDY #. In our eg, the target sends
3 more words of data in clock cycle 4 to 6.
The indicator uses FRAME # to indicate the duration of the burst, since it read 4
words, the initiator negates FRAME # during clock cycle 5.
After sending the 4th word, the target disconnects its drivers and negates DEVSEL
# during clockcycle 7.

Fig: A read operation showing the role of IRDY# / TRY#

It indicates the pause in the middle of the transaction.


The first and words are transferred and the target sends the 3rd word in cycle 5. But the
indicator is not able to receive it. Hence it negates IRDY#.
In response the target maintains 3rd data on AD line until IRDY is asserted again. In cycle 6,
the indicator asserts IRDY. But the target is not ready to transfer the
fourth word immediately, hence it negates TRDY in cycle 7. Hence it sends the
4th word and asserts TRDY# at cycle 8.

Device Configuration

When an I/O device is connected to a computer, several actions are needed to


configure both the device and the software that communicates with it. A typical device
interface card for an ISA bus, for example, has a number of jumpers or switches that have
to be set by the user to select certain options. Once the device is connected, the
software needs to know the address of the device. It may also need to know relevant
device characteristics, such as the speed of the transmission link, whether parity bits

158
are used, and so on.
The PCI simplifies this process by incorporating in each I/O device interface a
small configuration ROM memory that stores information about that device. The
configuration ROMs of all devices are accessible in the configuration address space. The
PCI initialization software reads these ROMs whenever the system is powered up or
reset. In each case, it determines whether the device is a printer, a keyboard, an Ethernet
interface, or a disk controller. It can further learn about various device options and
characteristics.

The PCI has a configuration ROM memory that stores information about that device.
The configuration ROM‟s of all devices are accessible in the configuration address
space.
The initialization s/w read these ROM‟s whenever the S/M is powered up or reset In each
case, it determines whether the device is a printer, keyboard, Ethernet interface or disk
controller.
Devices are assigned address during initialization process and each device has an w/p
signal called IDSEL # (Initialization device select) which has 21 address lines
(AD) (AD to AD31).
During configuration operation, the address is applied to AD i/p of the device and the
corresponding AD line is set to and all other lines are set to 0.

AD11 - AD31 Upper address line


A00 - A10 Lower address line → Specify the type of the operation and to access
the content of device configuration ROM.
The configuration software scans all 21 locations.
PCI bus has interrupt request lines.
Each device may requests an address in the I/O space or memory space
Electrical Characteristics:

The connectors can be plugged only in compatible motherboards PCI bus can
operate with either 5 – 33V power supply.
The motherboard can operate with signaling system.

SCSI Bus:- (Small Computer System Interface)

SCSI refers to the standard bus which is defined by ANSI (American National
Standard Institute).
SCSI bus the several options. It may be,

159
Narrow bus It has 8 data lines & transfers 1 byte at a time.
Wide bus It has 16 data lines & transfer 2 byte at a time.
Single-Ended Transmission aEch signal uses separate wire.
HVD (High Voltage Differential) It was 5v (TTL cells)
LVD (Low Voltage Differential) It uses 3.3v

Because of these various options, SCSI connector may have 50, 68 or 80 pins. The data
transfer rate ranges from 5MB/s to 160MB/s 320Mb/s, 640MB/s.
The transfer rate depends on,
Length of the cable
Number of devices connected.
To achieve high transfer rat, the bus length should be 1.6m for SE signaling and
12m for LVD signaling.
The SCSI bus us connected to the processor bus through the SCSI controller. The data are
stored on a disk in blocks called sectors.
Each sector contains several hundreds of bytes. These data will not be stored in contiguous
memory location.
SCSI protocol is designed to retrieve the data in the first sector or any other selected
sectors.
Using SCSI protocol, the burst of data are transferred at high speed. The controller
connected to SCSI bus is of 2 types. They are,
• Initiator
• Target
Initiator:

It has the ability to select a particular target & to send commands specifying the operation
to be performed.
They are the controllers on the processor side.
Target:
The disk controller operates as a target.
It carries out the commands it receive from the initiator. The initiator establishes a logical
connection with the intended target.
Steps:

Consider the disk read operation, it has the following sequence of events.
The SCSI controller acting as initiator, contends process, it selects the target controller &
hands over control of the bus to it.
The target starts an output operation, in response to this the initiator sends a command
specifying the required read operation.

160
The target that it needs to perform a disk seek operation, sends a message to the
initiator indicating that it will temporarily suspends the connection between them. Then it
releases the bus.
The target controller sends a command to disk drive to move the read head to the first
sector involved in the requested read in a data buffer. When it is ready to begin transferring
data to initiator, the target requests control of the bus. After it wins arbitration, it reselects
the initiator controller, thus restoring the suspended connection.
The target transfers the controls of the data buffer to the initiator & then suspends the
connection again. Data are transferred either 8 (or) 16 bits in parallel depending on the
width of the bus.
The target controller sends a command to the disk drive to perform another seek
operation. Then it transfers the contents of second disk sector to the initiator. At the end of
this transfer, the logical connection b/w the two controller is terminated.
As the initiator controller receives the data, if stores them into main memory using
DMA approach.
The SCSI controller sends an interrupt to the processor to inform it that the requested
operation has been completed.

Bus Signals:-
The bus has no address lines.
Instead, it has data lines to identify the bus controllers involved in the selection /
reselection / arbitration process.
For narrow bus, there are 8 possible controllers numbered from 0 to 7. For a wide bus,
there are 16 controllers.
Once a connection is established b/w two controllers, these is no further need for
addressing & the datalines are used to carry the data.

SCSI bus signals:

Category Name Function


Data - DB (0) to DB (7) Datalines
- DB(P) Parity bit for data bus.
Phases - BSY Busy
- SEL Selection
Information type - C/D Control / Data
- MSG Message
Handshake - REQ Request
- ACK Acknowledge
Direction of transfer I/O Input / Output
Other - ATN Attention
- RST Reset.

161
All signal names are proceeded by minus sign.
This indicates that the signals are active or that the dataline is equal to 1, when they are in
the low voltage state.

Phases in SCSI Bus:-


The phases in SCSI bus operation are,
• Arbitration
• Selection
• Information transfer
• Reselection
Arbitration:-
When the –BSY signal is in inactive state, the bus will he free & any controller can
request the use of the bus.
Since each controller may generate requests at the same time, SCSI uses distributed
arbitration scheme.
Each controller on the bus is assigned a fixed priority with controller 7 having the
highest priority.
When –BSY becomes active, all controllers that are requesting the bus examines the
data lines & determine whether the highest priority device is requesting the bus at the
same time.
The controller using the highest numbered line realizes that it has won the arbitration
process.
At that time, all other controllers disconnect from the bus & wait for –BSY to become
inactive again.

Fig:Arbitration and selection on the SCSI bus.Device 6 wins arbitration and select device 2

162
Selection:

Here Device wons arbitration and it asserts –BSY and –DB6 signals. The Select Target
Controller responds by asserting –BSY.
This informs that the connection that it requested is established.

Reselection:

The connection between the two controllers has been reestablished,with the target in
control the bus as required for data transfer to proceed.

USB –Universal Serial Bus

USB supports 3 speed of operation. They are,


• Low speed (1.5Mb/s)
• Full speed (12mb/s)
• High speed ( 480mb/s)
The USB has been designed to meet the key objectives. They are,

• It provide a simple, low cost & easy to use interconnection s/m that overcomes the
difficulties due to the limited number of I/O ports available on a computer.
• It accommodate a wide range of data transfer characteristics for I/O devices
including telephone & Internet connections.
• Enhance user convenience through ‘Plug & Play’ mode of operation.

Port Limitation:-
Normally the system has a few limited ports.
To add new ports, the user must open the computer box to gain access to the internal
expansion bus & install a new interface card.
The user may also need to know to configure the device & the s/w.

Merits of USB:-

USB helps to add many devices to a computer system at any time without opening the
computer box.

Device Characteristics:-
The kinds of devices that may be connected to a cptr cover a wide range of
functionality.

163
The speed, volume & timing constrains associated with data transfer to & from devices
varies significantly.

Eg:1 Keyboard Since the event of pressing a key is not synchronized to any other event in a
computer system, the data generated by keyboard are called asynchronous. The data
generated from keyboard depends upon the speed of the human operator which is about
100bytes/sec.

Eg:2 Microphone attached in a cptr s/m internally / externally


The sound picked up by the microphone produces an analog electric signal, which must
be converted into digital form before it can be handled by the cptr.
This is accomplished by sampling the analog signal periodically.
The sampling process yields a continuous stream of digitized samples that arrive
at regular intervals, synchronized with the sampling clock. Such a stream is called
isochronous (ie) successive events are separated by equal period of time.
If the sampling rate in „S‟ samples/sec then the maximum frequency captured by
sampling process is s/2.
A standard rate for digital sound is 44.1 KHz.
Requirements for sampled Voice:-
It is important to maintain precise time (delay) in the sampling & replay process. A high
degree of jitter (Variability in sampling time) is unacceptable.
Eg-3:Data transfer for Image & Video:-
The transfer of images & video require higher bandwidth.
The bandwidth is the total data transfer capacity of a communication channel. To maintain
high picture quality, The image should be represented by about
160kb, & it is transmitted 30 times per second for a total bandwidth if 44MB/s.
Plug & Play:-
The main objective of USB is that it provides a plug & play capability.
The plug & play feature enhances the connection of new device at any time, while the
system is operation.
The system should,
• Detect the existence of the new device automatically.
• Identify the appropriate device driver s/w.
• Establish the appropriate addresses.
• Establish the logical connection for communication.

USB Architecture:-

USB has a serial bus format which satisfies the low-cost & flexibility requirements.

164
Clock & data information are encoded together & transmitted as a single signal. There are
no limitations on clock frequency or distance arising form data skew, & hence it is possible
to provide a high data transfer bandwidth by using a high
clock frequency.
To accommodate a large no/. of devices that can be added / removed at any time, the
USB has the tree structure.

USB Tree Structure:

Fig:Universal Serial Bus tree structure

 To accommodate a large number of devices that can be added or removed at any


time, the USB has the tree structure as shown in the figure.

 Each node of the tree has a device called a hub, which acts as an intermediate
control point between the host and the I/O devices. At the root of the tree, a root
hub connects the entire tree to the host computer. The leaves of the tree are the
I/O devices being served (for example, keyboard, Internet connection, speaker, or
digital TV)

 In normal operation, a hub copies a message that it receives from its upstream
connection to all its downstream ports. As a result, a message sent by the host
computer is broadcast to all I/O devices, but only the addressed device will respond
to that message. However, a message from an I/O device is sent only upstream
towards the root of the tree and is not seen by other devices. Hence, the USB

165
enables the host to communicate with the I/O devices, but it does not enable these
devices to communicate with each other.

Addressing:
 When a USB is connected to a host computer, its root hub is attached to the
processor bus, where it appears as a single device. The host software communicates
with individual devices attached to the USB by sending packets of information,
which the root hub forwards to the appropriate device in the USB tree.

 Each device on the USB, whether it is a hub or an I/O device, is assigned a 7-bit
address. This address is local to the USB tree and is not related in any way to the
addresses used on the processor bus.

 A hub may have any number of devices or other hubs connected to it, and addresses
are assigned arbitrarily. When a device is first connected to a hub, or when it is
powered on, it has the address 0. The hardware of the hub to which this device is
connected is capable of detecting that the device has been connected, and it
records this fact as part of its own status information. Periodically, the host polls
each hub to collect status information and learn about new devices that may have
been added or disconnected.

 When the host is informed that a new device has been connected, it uses a
sequence of commands to send a reset signal on the corresponding hub port, read
information from the device about its capabilities, send configuration information to
the device, and assign the device a unique USB address. Once this sequence is
completed the device begins normal operation and responds only to the new
address.

USB protocols:
 All information transferred over the USB is organized in packets, where a packet
consists of one or more bytes of information. There are many types of packets that
perform a variety of control functions.

 The information transferred on the USB can be divided into two broad categories:
control and data.

 Control packets perform such tasks as addressing a device to initiate data


transfer, acknowledging that data have been received correctly, or indicating
an error.

 Data packets carry information that is delivered to a device.

166
 A packet consists of one or more fields containing different kinds of information.
The first field of any packet is called the packet identifier, PID, which identifies the
type of that packet.

 They are transmitted twice. The first time they are sent with their true values, and
the second time with each bit complemented

 The four PID bits identify one of 16 different packet types. Some control packets,
such as ACK (Acknowledge), consist only of the PID byte.

Control packets used for controlling data transfer operations are called token packets.

167
Figure: An output transfer
Isochronous Traffic on USB:

 One of the key objectives of the USB is to support the transfer of isochronous data.

 Devices that generates or receives isochronous data require a time reference to control the sampling
process.

 To provide this reference. Transmission over the USB is divided into frames of equal length.

 A frame is 1ms long for low-and full-speed data.

 The root hub generates a Start of Frame control packet (SOF) precisely once every 1 ms to mark the
beginning of a new frame.

 The arrival of an SOF packet at any device constitutes a regular clock signal that the device can use for its
own purposes.
 To assist devices that may need longer periods of time, the SOF packet carries an 11- bit frame number.

 Following each SOF packet, the host carries out input and output transfers for isochronous devices.

 This means that each device will have an opportunity for an input or output transfer once every 1 ms.

170
Electrical Characteristics:

 The cables used for USB connections consist of four wires.

 Two are used to carry power, +5V and Ground.

 Thus, a hub or an I/O device may be powered directly from the bus, or it may have its own external
power connection.

 The other two wires are used to carry data.

 Different signaling schemes are used for different speeds of transmission.

 At low speed, 1s and 0s are transmitted by sending a high voltage state (5V) on one or the other o
the two signal wires. For high-speed links, differential transmission is used.

171
Part-A

1. Define Main Memory ?


2. Define Cache Memory?
3. What is Associative Memory?
4. What is Auxiliary Memory?
5. What is Virtual Memory?
6. What is Priority Interrupt?
7. What is DMA?

Part B
1. Explain briefly about INPUT/OUTPUT organization?
2. What are the different Interupt units in I/O organization?
3. What is Auxiliary Memory and explain its types with examples?
4. Explain briefly about Interface circuits?
5. What is Asynchronous data transfer and explain its types?
6. Explain briefly about DMA Controller?
7. What is Main Memory and explain its types?

172
UNIT-5
PIPELING

BASIC CONCEPTS

The speed of execution of programs is influenced by many factors. One way to


improve performance is to use faster circuit technology to build the processor and the
main memory. Another possibility is to arrange the hardware so that more than one
operation can be performed at the same time. In this way, the number of operations
performed per second is increased even though the elapsed time needed to perform
any one operation is not changed.

We have encountered concurrent activities several times before. Chapter 1 in-


troduced the concept of multiprogramming and explained how it is possible for I/O
transfers and computational activities to proceed simultaneously. DMA devices make
this possible because they can perform I/O transfers independently once these transfers
are initiated by the processor.

Pipelining is a particularly effective way of organizing concurrent activity in a


computer system. The basic idea is very simple. It is frequently encountered in manu-
facturing plants, where pipelining is commonly known as an assembly-line operation.
Readers are undoubtedly familiar with the assembly line used in car manufacturing.
The first station in an assembly line may prepare the chassis of a car, the next station
adds the body, the next one installs the engine, and so on. While one group of workers
is installing the engine on one car, another group is fitting a car body on the chassis of
another car, and yet another group is preparing a new chassis for a third car. It may
take days to complete work on a given car, but it is possible to have a new car rolling
off the end of the assembly line every few minutes.

Consider how the idea of pipelining can be used in a computer. The processor
executes a program by fetching and executing instructions, one after the other. Let F i
and Ei refer to the fetch and execute steps for instruction Ii . Execution of a program
consists of a sequence of fetch and execute steps, as shown in Figure 8.1a.

Now consider a computer that has two separate hardware units, one for fetching
instructions and another for executing them, as shown in Figure 5.1b. The instruction
fetched by the fetch unit is deposited in an intermediate storage buffer, B1. This buffer
is needed to enable the execution unit to execute the instruction while the fetch unit is
fetching the next instruction. The results of execution are deposited in the destination
location specified by the instruction. For the purposes of this discussion, we assume
that both the source and the destination of the data operated on by the instructions

173
are inside the block labeled “Execution unit.”

174
Time
I1 I2 I3

F1 E1 F2 E2 F3 E3

(a) Sequential execution

Interstage buffer
B1

Instruction
Execution
fetch unit
unit

(b) Hardware organization

Time
Clock cycle 1 2 3 4

Instruction

I1 F1 E1

I2 F2 E2

I3 F3 E3

(c) Pipelined execution

Figure 5.1 Basic idea of instruction pipelining.

The computer is controlled by a clock whose period is such that the fetch and
execute steps of any instruction can each be completed in one clock cycle. Operation of
the computer proceeds as in Figure 5.1c. In the first clock cycle, the fetch unit fetches an
instruction I1 (step F1 ) and stores it in buffer B1 at the end of the clock cycle. In the
second clock cycle, the instruction fetch unit proceeds with the fetch operation for
instruction I2 (step F2 ). Meanwhile, the execution unit performs the operation specified by
instruction I1 , which is available to it in buffer B1 (step E1 ).

175
By the end of thesecond clock cycle, the execution of instruction I1 is completed
and instruction I2 is available. Instruction I2 is stored in B1, replacing I1 , which is no
longer needed. Step E2 is performed by the execution unit during the third clock cycle,
while instruction I3 is being fetched by the fetch unit. In this manner, both the fetch
and execute units are kept busy all the time. If the pattern in Figure 8.1c can be sustained
for a long time, the completion rate of instruction execution will be twice that achievable
by the sequential operation depicted in Figure 5.1a.
In summary, the fetch and execute units in Figure 5.1b constitute a two-stage
pipeline in which each stage performs one step in processing an instruction. An inter-
stage storage buffer, B1, is needed to hold the information being passed from one stage
to the next. New information is loaded into this buffer at the end of each clock cycle.
The processing of an instruction need not be divided into only two steps. For
example, a pipelined processor may process each instruction in four steps, as follows:

F Fetch: read the instruction from the memory.


D Decode: decode the instruction and fetch the source
operand(s). E Execute: perform the operation specified by the
instruction.
W Write: store the result in the destination location.

The sequence of events for this case is shown in Figure 8.2a. Four instructions are
in progress at any given time. This means that four distinct hardware units are needed,
as shown in Figure 8.2b. These units must be capable of performing their tasks
simultane- ously and without interfering with one another. Information is passed from
one unit to the next through a storage buffer. As an instruction progresses through the
pipeline, all the information needed by the stages downstream must be passed along.
For example, during clock cycle 4, the information in the buffers is as follows:
• Buffer B1 holds instruction I3 , which was fetched in cycle 3 and is
being decoded by the instruction-decoding unit.
• Buffer B2 holds both the source operands for instruction I 2 and the
specification of the operation to be performed. This is the information produced by the
decoding hardware in cycle 3. The buffer also holds the information needed for the write
step of instruction I2 (step W2 ). Even though it is not needed by stage E, this information
must be passed on to stage W in the following clock cycle to enable that stage to
perform the required Write operation.
• Buffer B3 holds the results produced by the execution unit and the
destination information for instruction I1 .

172
ROLE OF CACHE MEMORY

Each stage in a pipeline is expected to complete its operation in one clock cycle.
Hence, the clock period should be sufficiently long to complete the task being
performed in any stage. If different units require different amounts of time, the
clock period must allow the longest task to be completed. A unit that completes its
task early is idle for the remainder of the clock period. Hence, pipelining is most
effective in improving

Time
Clock cycle 1 2 3 4 5 6 7

Instruction

I1 F1 D1 E1 W1

I2 F2 D2 E2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4 W4

(a) Instruction execution divided into four steps

Interstage buffers

D : Decode
F : Fetch instruction E: Execute W : Write
instruction and fetch operation results
operands
B1 B2 B3

(b) Hardware organization

Figure 5.2 A 4-stage pipeline.

173
performance if the tasks being performed in different stages require about the same
amount of time.
This consideration is particularly important for the instruction fetch step, which is
assigned one clock period in Figure 5.2a. The clock cycle has to be equal to or greater
than the time needed to complete a fetch operation. However, the access time of the
main memory may be as much as ten times greater than the time needed to perform
basic pipeline stage operations inside the processor, such as adding two numbers. Thus,
if each instruction fetch required access to the main memory, pipelining would be of
little value.
The use of cache memories solves the memory access problem. In particular, when a
cache is included on the same chip as the processor, access time to the cache is usually
the same as the time needed to perform other basic operations inside the processor. This

makes it possible to divide instruction fetching and processing into steps that
are more or less equal in duration. Each of these steps is performed by a different
pipeline stage, and the clock period is chosen to correspond to the longest one.

PIPELINE PERFORMANCE

The pipelined processor in Figure 5.2 completes the processing of one instruction
in each clock cycle, which means that the rate of instruction processing is four times that
of sequential operation. The potential increase in performance resulting from
pipelining is proportional to the number of pipeline stages. However, this increase
would be achieved only if pipelined operation as depicted in Figure 5.2a could be
sustained without interruption throughout program execution. Unfortunately, this is not
the case.

For a variety of reasons, one of the pipeline stages may not be able to complete its
processing task for a given instruction in the time allotted. For example, stage E in the
four-stage pipeline of Figure 5.2b is responsible for arithmetic and logic operations,
and one clock cycle is assigned for this task. Although this may be sufficient for
most operations, some operations, such as divide, may require more time to complete.
Figure 5.3 shows an example in which the operation specified in instruction I2 requires
three cycles to complete, from cycle 4 through cycle 6. Thus, in cycles 5 and 6, the
Write stage must be told to do nothing, because it has no data to work with. Meanwhile,
the information in buffer B2 must remain intact until the Execute stage has completed
its operation. This means that stage 2 and, in turn, stage 1 are blocked from accepting
new instructions because the information in B1 cannot be overwritten. Thus, steps D 4
and F5 must be postponed as shown.

174
Time
Clock cycle 1 2 3 4 5 6 7 8 9

Instruction

I1 F1 D1 E1 W1

I2 F2 D2 E2 W2

I3 E3 W3
F3 D3

I4 F4 D4 E4 W4

I5 F5 D5 E5

Figure 5.3 Effect of an execution operation taking more than one clock cycle.

Pipelined operation in Figure 5.3 is said to have been stalled for two clock cycles.
Normal pipelined operation resumes in cycle 7. Any condition that causes the pipeline to
stall is called a hazard. We have just seen an example of a data hazard. A data hazard is any
condition in which either the source or the destination operands of an instruction are not
available at the time expected in the pipeline. As a result some operation has to be delayed,
and the pipeline stalls.
The pipeline may also be stalled because of a delay in the availability of an instruct-
tion. For example, this may be a result of a miss in the cache, requiring the instruction to be
fetched from the main memory. Such hazards are often called control hazards or instruction
hazards. The effect of a cache miss on pipelined operation is illustrated in Figure 8.4.
Instruction I1 is fetched from the cache in cycle 1, and its execution proceeds normally.
However, the fetch operation for instruction I 2 , which is started in cycle 2, results in a
cache miss. The instruction fetch unit must now suspend any further fetch re- quests and wait
for I2 to arrive. We assume that instruction I2 is received and loaded into buffer B1 at the
end of cycle 5. The pipeline resumes its normal operation at that point.

175
Time
Clock cycle 1 2 3 4 5 6 7 8 9

Instruction

I1 F1 D1 E1 W1

I2 F2 D2 E2 W2

I3 F3 D3 E3 W3

(a) Instruction execution steps in successive clock cycles

Time
Clock cycle 1 2 3 4 5 6 7 8 9

Stage
F: Fetch F1 F2 F2 F2 F2 F3

D: Decode D1 idle idle idle D2 D3

E: Execute E1 idle idle idle E2 E3

W: Write W1 idle idle idle W2 W3

(b) Function performed by each processor stage in successive clock cycles

Figure 5.4 Pipeline stall caused by a cache miss in F2.

An alternative representation of the operation of a pipeline in the case of a


cache miss is shown in Figure 5.4b. This figure gives the function performed by each
pipeline stage in each clock cycle. Note that the Decode unit is idle in cycles 3
through 5, the Execute unit is idle in cycles 4 through 6, and the Write unit is idle in
cycles 5 through 7. Such idle periods are called stalls. They are also often referred to as
bubbles in the pipeline. Once created as a result of a delay in one of the pipeline stages, a
bubble moves downstream until it reaches the last unit.

A third type of hazard that may be encountered in pipelined operation is known as


a structural hazard. This is the situation when two instructions require the use of a
given hardware resource at the same time. The most common case in which this hazard
may arise is in access to memory. One instruction may need to access memory as part

176
of the Execute or Write stage while another instruction is being fetched. If instructions
and data reside in the same cache unit, only one instruction can proceed and the other
instruction is delayed. Many processors use separate instruction and data caches to
avoid this delay.

An example of a structural hazard is shown in Figure 5.5. This figure shows how the
load instruction

Load X(R1),R2
can be accommodated in our example 4-stage pipeline. The memory address, X+[R1], is
computed in step E2 in cycle 4, then memory access takes place in cycle 5. The operand
read from memory is written into register R2 in cycle 6. This means that the execution
step of this instruction takes two clock cycles (cycles 4 and 5). It causes the pipeline to
stall for one cycle, because both instructions I2 and I3 require access to the register file
in cycle 6. Even though the instructions and their data are all available.

Figure 5.5 Effect of a Load instruction on pipeline timing.

177
The pipeline is stalled because one hardware resource, the register file,
cannot handle two operations at once. If the register file had two input ports, that
is, if it allowed two simultaneous write operations, the pipeline would not be
stalled. In general, structural hazards are avoided by providing sufficient hardware
resources on the processor chip.
It is important to understand that pipelining does not result in individual
instructions being executed faster; rather, it is the throughput that increases, where
throughput is measured by the rate at which instruction execution is completed.
Any time one of the stages in the pipeline cannot complete its operation in one clock
cycle, the pipeline stalls, and some degradation in performance occurs. Thus, the
performance level of one instruction completion in each clock cycle is actually the
upper limit for the throughput achievable in a pipelined processor organized as in
Figure 8.2b.
An important goal in designing processors is to identify all hazards that may cause
the pipeline to stall and to find ways to minimize their impact. In the following sections
we discuss various hazards, starting with data hazards, followed by control hazards. In
each case we present some of the techniques used to mitigate their negative effect
on performance.
DATA HAZARDS
A data hazard is a situation in which the pipeline is stalled because the data to
be operated on are delayed for some reason, as illustrated in Figure 8.3. We will
now examine the issue of availability of data in some detail.
Consider a program that contains two instructions, I1 followed by I2 . When
this program is executed in a pipeline, the execution of I2 can begin before the
execution of I1 is completed. This means that the results generated by I1 may not
be available for use by I2 . We must ensure that the results obtained when
instructions are executed in a pipelined processor are identical to those obtained
when the same instructions are executed sequentially. The potential for obtaining
incorrect results when operations are performed concurrently can be demonstrated
by a simple example. Assume that A = 5, and consider the following two operations:
A←3+AB←4×A
When these operations are performed in the order given, the result is B = 32.
But if they are performed concurrently, the value of A used in computing B would
be the original value, 5, leading to an incorrect result. If these two operations are
performed by instructions in a program, then the instructions must be executed
one after the other, because the data used in the second instruction depend on the
result of the first instruction. On the other hand, the two operations
A←5×C
B ← 20 + C

178
can be performed concurrently, because these operations are independent.

Figure 5.6 Pipeline stalled by data dependency between D2 and W1 .

This example illustrates a basic constraint that must be enforced to


guarantee correct results. When two operations depend on each other, they must
be performed sequentially in the correct order. This rather obvious condition has
far-reaching con- sequences. Understanding its implications is the key to
understanding the variety of design alternatives and trade-offs encountered in
pipelined computers.
Consider the pipeline in Figure 5.2. The data dependency just described arises
when the destination of one instruction is used as a source in the next instruction.
For example, the two instructions

MulR2,R3,R4
AddR5,R4,R6

give rise to a data dependency. The result of the multiply instruction is placed
into register R4, which in turn is one of the two source operands of the Add
instruction. Assuming that the multiply operation takes one clock cycle to complete,
execution would proceed as shown in Figure 5.6.

As the Decode unit decodes the Add instruction in cycle 3, it realizes that R4 is
used as a source operand. Hence, the D step of that instruction cannot be
completed until the W step of the multiply instruction has been completed.
Completion of step D2 must be delayed to clock cycle 5, and is shown as step D 2A in
the figure. Instruction I3 is fetched in cycle 3, but its decoding must be delayed
because step D3 cannot precede D2 . Hence, pipelined execution is stalled for two
cycles.

179
5.2.1 OPERAND FORWARDING

The data hazard just described arises because one instruction, instruction I2 in Figure
5.6, is waiting for data to be written in the register file. However, these data are
available at the output of the ALU once the Execute stage completes step E1 . Hence,
the delay can
be reduced, or possibly eliminated, if we arrange for the result of instruction I1 to be
forwarded directly for use in step E2 .
Figure 5.7a shows a part of the processor datapath involving the ALU and the
register file. This arrangement is similar to the three-bus structure except that
registers SRC1, SRC2, and RSLT have been added. These registers constitute the
interstage buffers needed for pipelined operation, as illustrated in Figure 8.7b. With
reference to Figure 8.2b, registers SRC1 and SRC2 are part of buffer B2 and RSLT is
part of B3. The data forwarding mechanism is provided by the blue connection lines. The
two multiplexers connected at the inputs to the ALU allow the data on the
destination bus to be selected instead of the contents of either the SRC1 or SRC2
register.

When the instructions in Figure 5.6 are executed in the datapath of Figure 5.7,
the operations performed in each clock cycle are as follows. After decoding
instruction I2 and detecting the data dependency, a decision is made to use data
forwarding. The operand not involved in the dependency, register R2, is read and
loaded in register SRC1 in clock cycle 3. In the next clock cycle, the product
produced by instruction I1 is available in register RSLT, and because of the forwarding
connection, it can be used in step E2 . Hence, execution of I2 proceeds without
interruption.

180
Figure 5.7 Operand forwarding in a pipelined processor.

5.2. HANDLING DATA HAZARDS IN SOFTWARE


In Figure 5.6, we assumed the data dependency is discovered by the hardware
while the instruction is being decoded. The control hardware delays reading register
R4 until cy- cle 5, thus introducing a 2-cycle stall unless operand forwarding is used.
An alternative approach is to leave the task of detecting data dependencies and
dealing with them to the software. In this case, the compiler can introduce the two-
cycle delay needed between instructions I1 and I2 by inserting NOP (No-operation)
instructions, as follows:

181
I1 : Mul R2,R3,R4
NOP NOP
I2 : Add R5,R4,R6

If the responsibility for detecting such dependencies is left entirely to the


software, the compiler must insert the NOP instructions to obtain a correct result. This
possibility illustrates the close link between the compiler and the hardware. A
particular feature can be either implemented in hardware or left to the compiler.
Leaving tasks such as inserting NOP instructions to the compiler leads to simpler
hardware. Being aware of the need for a delay, the compiler can attempt to reorder
instructions to perform useful tasks in the NOP slots, and thus achieve better
performance. On the other hand, the insertion of NOP instructions leads to larger code
size. Also, it is often the case that a given processor architecture has several hardware
implementations, offering different features. NOP instructions inserted to satisfy the
requirements of one implementation may not be needed and, hence, would lead to
reduced performance on a different implementation.

5.2.3 SIDE EFFECTS

The data dependencies encountered in the preceding examples are explicit and easily
detected because the register involved is named as the destination in instruction I1 and
as a source in I2 . Sometimes an instruction changes the contents of a register other
than the one named as the destination. An instruction that uses an auto increment or
auto decrement addressing mode is an example. In addition to storing new data in its
destination location, the instruction changes the contents of a source register used to
access one of its operands. All the precautions needed to handle data dependencies
involving the destination location must also be applied to the registers affected by an
auto increment or auto decrement operation. When a location other than one explicitly
named in an instruction as a destination operand is affected, the instruction is said to
have a side effect. For example, stack instructions, such as push and pop, produce
similar side effects because they implicitly use the auto increment and auto decrement
addressing modes.
Another possible side effect involves the condition code flags, which are used by
instructions such as conditional branches and add-with-carry. Suppose that registers
R1 and R2 hold a double-precision integer number that we wish to add to another
double- precision number in registers R3 and R4. This may be accomplished as
follows:
Add R1,R3
Add With Carry R2,R4

182
An implicit dependency exists between these two instructions through the carry
flag. This flag is set by the first instruction and used in the second instruction, which
performs the operation

R4 ← [R2] + [R4] +carry

Instructions that have side effects give rise to multiple data dependencies, which
lead to a substantial increase in the complexity of the hardware or software needed
to resolve them. For this reason, instructions designed for execution on pipelined
hardware should have few side effects. Ideally, only the contents of the destination
location, either a register or a memory location, should be affected by any given
instruction. Side effects, such as setting the condition code flags or updating the
contents of an address pointer, should be kept to a minimum. However, that the auto
increment and auto decrement addressing modes are potentially useful. Condition
code flags are also needed for recording such information as the generation of a
carry or the occurrence of overflow in an arithmetic operation. In Section 8.4 we
show how such functions can be provided by other means that are consistent with a
pipelined organization and with the requirements of optimizing compilers.

INSTRUCTION HAZARDS

The purpose of the instruction fetch unit is to supply the execution units with a
steady stream of instructions. Whenever this stream is interrupted, the pipeline
stalls, as Figure 5.4 illustrates for the case of a cache miss. A branch instruction
may also cause the pipeline to stall. We will now examine the effect of branch
instructions and the techniques that can be used for mitigating their impact. We start
with unconditional branches.

UNCONDITIONAL BRANCHES

Figure 5.8 shows a sequence of instructions being executed in a two-stage pipeline.


Instructions I1 to I3 are stored at successive memory addresses, and I2 is a
branch instruction. Let the branch target be instruction Ik . In clock cycle 3, the fetch
operation for instruction I3 is in progress at the same time that the branch
instruction is being decoded and the target address computed. In clock cycle 4, the
processor must discard I3 , which has been incorrectly fetched, and fetch
instruction Ik . In the meantime, the hardware unit responsible for the Execute (E)
step must be told to do nothing during that clock period. Thus, the pipeline is
stalled for one clock cycle.
The time lost as a result of a branch instruction is often referred to as the

183
branch penalty. In Figure 5.8, the branch penalty is one clock cycle. For a longer
pipeline, the branch penalty may be higher. For example, Figure 5.9a shows the
effect of a branch instruction on a four-stage pipeline. We have assumed that
the branch ad- dress is computed in step E2 . Instructions I3 and I4 must be
discarded, and the tar- get instruction, Ik , is fetched in clock cycle 5. Thus, the
branch penalty is two clock cycles.

Reducing the branch penalty requires the branch address to be computed earlier
in the pipeline. Typically, the instruction fetch unit has dedicated hardware to
identify a branch instruction and compute the branch target address as quickly as
possible after an instruction is fetched. With this additional hardware, both of
these tasks can be performed in step D 2 , leading to the sequence of events shown
in Figure 5.9b. In this case, the branch penalty is only one clock cycle.

Figure 5.8 An idle cycle caused by a branch instruction.

184
Fig:5.9 Branch timing

Instruction Queue and Prefetching


Either a cache miss or a branch instruction stalls the pipeline for one or more clock
cycles.

Fig:5.10 Use of an instruction queue in the hardware organization of Figure 5.2b.

185
To reduce the effect of these interruptions, many processors employ
sophisticated fetch units that can fetch instructions before they are needed and put
them in a queue.
Typically, the instruction queue can store several instructions. A separate unit,
which we call the dispatch unit, takes instructions from the front of the queue and
sends them to the execution unit. This leads to the organization shown in Figure
5.10. The dispatch unit also performs the decoding function.
To be effective, the fetch unit must have sufficient decoding and processing
capa- bility to recognize and execute branch instructions. It attempts to keep the
instruction queue filled at all times to reduce the impact of occasional delays when
fetching in- structions. When the pipeline stalls because of a data hazard, for
example, the dispatch unit is not able to issue instructions from the instruction
queue. However, the fetch unit continues to fetch instructions and add them to the
queue. Conversely, if there is a delay in fetching instructions because of a branch or a
cache miss, the dispatch unit continues to issue instructions from the instruction
queue.
We have assumed that initially the queue contains one instruction. Every fetch
operation adds one instruction to the queue and every dispatch operation reduces
the queue length by one. Hence, the queue length remains the same for the first
four clock cycles. (There is both an F and a D step in each of these cycles.) Suppose
that instruction I1 introduces a 2-cycle stall. Since space is available in the queue,
the fetch unit continues to fetch instructions and the queue length rises to 3 in
clock cycle 6.
Instruction I5 is a branch instruction. Its target instruction, Ik , is fetched in cycle
7, and instruction I6 is discarded. The branch instruction would normally cause a
stall in cycle 7 as a result of discarding instruction I6 . Instead, instruction I4 is
dispatched from the queue to the decoding stage. After discarding I6 , the queue
length drops to 1 in cycle 8. The queue length will be at this value until another stall
is encountered.
CONDITIONAL BRANCHES AND BRANCH PREDICTION

A conditional branch instruction introduces the added hazard caused by the


dependency of the branch condition on the result of a preceding instruction. The
decision to branch cannot be made until the execution of that instruction has been
completed.
Branch instructions occur frequently. In fact, they represent about 20 percent
of the dynamic instruction count of most programs. (The dynamic count is the
number of instruction executions, taking into account the fact that some program
instructions are executed many times because of loops.) Because of the branch

186
penalty, this large percentage would reduce the gain in performance expected from
pipelining. Fortunately, branch instructions can be handled in several ways to reduce
their negative impact on the rate of execution of instructions.

Delayed Branch
In Figure 5.8, the processor fetches instruction I3 before it determines whether
the current instruction, I2 , is a branch instruction. When execution of I2 is
completed and a branch is to be made, the processor must discard I3 and fetch
the instruction at the branch target. The location following a branch instruction is
called a branch delay slot. There may be more than one branch delay slot, depending
on the time it takes to execute a branch instruction. For example, there are two
branch delay slots in Figure 5.9a and one delay slot in Figure 5.9b. The instructions
in the delay slots are always fetched and at least partially executed before the
branch decision is made and the branch target address is computed.

A technique called delayed branching can minimize the penalty incurred as a


result of conditional branch instructions. The idea is simple. The instructions in the
delay slots are always fetched. Therefore, we would like to arrange for them to be
fully executed whether or not the branch is taken. The objective is to be able to place
useful instructions in these slots. If no useful instructions can be placed in the delay
slots, these slots must be filled with NOP instructions. This situation is exactly the
same as in the case of data dependency

Consider the instruction sequence given in Figure 5.12a. Register R2 is used as


a counter to determine the number of times the contents of register R1 are
shifted left. For a processor with one delay slot, the instructions can be reordered
as shown in Figure 5.12b. The shift instruction is fetched while the branch
instruction is being executed. After evaluating the branch condition, the processor
fetches the instruction at LOOP or at NEXT, depending on whether the branch
condition is true or false, respectively.

In either case, it completes execution of the shift instruction. The sequence of


events during the last two passes in the loop is illustrated in Figure 5.13. Pipelined
operation is not interrupted at any time, and there are no idle cycles. Logically, the
program is executed as if the branch instruction were placed after the shift
instruction. That is, branching takes place one instruction later than where the
branch instruction appears in the instruction sequence in the memory, hence the
name “delayed branch.”

187
Fig:5.12 Reordering of instructions for a delayed

Fig:5.13 Execution timing showing the delay slot being filled during the last two passes

through the loop in Figure 5.12

188
The effectiveness of the delayed branch approach depends on how often it is pos-
sible to reorder instructions as in Figure 5.12. Experimental data collected from many
programs indicate that sophisticated compilation techniques can use one branch delay
slot in as many as 85 percent of the cases. For a processor with two branch delay slots,
the compiler attempts to find two instructions preceding the branch instruction that it
can move into the delay slots without introducing a logical error. The chances of finding
two such instructions are considerably less than the chances of finding one. Thus, if
increasing the number of pipeline stages involves an increase in the number of branch
delay slots, the potential gain in performance may not be fully realized.

Branch Prediction
Another technique for reducing the branch penalty associated with conditional
branches is to attempt to predict whether or not a particular branch will be taken. The
simplest form of branch prediction is to assume that the branch will not take place and to
continue to fetch instructions in sequential address order. Until the branch condition is
evaluated, instruction execution along the predicted path must be done on a speculative
basis. Speculative execution means that instructions are executed before the processor
is certain that they are in the correct execution sequence. Hence, care must be taken that
no processor registers or memory locations are updated until it is confirmed that these
instructions should indeed be executed. If the branch decision indicates otherwise, the
instructions and all their associated data in the execution units must be purged, and the
correct instructions fetched and executed.

An incorrectly predicted branch is illustrated in Figure 5.14 for a four-stage pipeline.

189
The figure shows a Compare instruction followed by a Branch>0 instruction. Branch
prediction takes place in cycle 3, while instruction I3 is being fetched. The fetch unit
predicts that the branch will not be taken, and it continues to fetch instruction I 4 as I3
enters the Decode stage. The results of the compare operation are available at the end
of cycle 3. Assuming that they are forwarded immediately to the instruction fetch unit,
the branch condition is evaluated in cycle 4. At this point, the instruction fetch unit real-
izes that the prediction was incorrect, and the two instructions in the execution pipe are
purged. A new instruction, Ik , is fetched from the branch target address in clock cycle 5.

If branch outcomes were random, then half the branches would be taken. Then
the simple approach of assuming that branches will not be taken would save the time
lost to conditional branches 50 percent of the time. However, better performance can
be achieved if we arrange for some branch instructions to be predicted as taken and
others as not taken, depending on the expected program behavior. For example, a branch
instruction at the end of a loop causes a branch to the start of the loop for every pass
through the loop except the last one. Hence, it is advantageous to assume that this
branch will be taken and to have the instruction fetch unit start to fetch instructions at
the branch target address. On the other hand, for a branch instruction at the beginning
of a program loop, it is advantageous to assume that the branch will not be taken.
A decision on which way to predict the result of the branch may be made in
hardware by observing whether the target address of the branch is lower than or higher
than the address of the branch instruction. A more flexible approach is to have the
compiler decide whether a given branch instruction should be predicted taken or not
taken. The branch instructions of some processors, such as SPARC, include a branch
prediction bit, which is set to 0 or 1 by the compiler to indicate the desired behavior.
The instruction fetch unit checks this bit to predict whether the branch will be taken or
not taken.

With either of these schemes, the branch prediction decision is always the same
every time a given instruction is executed. Any approach that has this characteristic is
called static branch prediction. Another approach in which the prediction decision may
change depending on execution history is called dynamic branch prediction.

INFLUENCES ON INSTRUCTION SETS

We have seen that some instructions are much better suited to pipelined execution than
others. For example, instruction side effects can lead to undesirable data dependencies.
In this section, we examine the relationship between pipelined execution and machine
instruction features. We discuss two key aspects of machine instructions — addressing
modes and condition code flags.

190
ADDRESSING MODES

Addressing modes should provide the means for accessing a variety of data structures
simply and efficiently. Useful addressing modes include index, indirect, autoincrement,
and autodecrement. Many processors provide various combinations of these modes to
increase the flexibility of their instruction sets. Complex addressing modes, such as those
involving double indexing, are often encountered.
In choosing the addressing modes to be implemented in a pipelined processor, we must
consider the effect of each addressing mode on instruction flow in the pipeline. Two
important considerations in this regard are the side effects of modes such as
autoincrement and autodecrement and the extent to which complex addressing modes
cause the pipeline to stall. Another important factor is whether a given mode is likely to
be used by compilers.
To compare various approaches, we assume a simple model for accessing operands in the
memory. The load instruction Load X(R1),R2 takes five cycles to complete execution, as
indicated in Figure 5.5. However, the instruction

Load (R1),R2
can be organized to fit a four-stage pipeline because no address computation is required.
Access to memory can take place in stage E. A more complex addressing mode may
require several accesses to the memory to reach the named operand. For example, The
instruction
Load (X(R1)),R2
may be executed as shown in Figure 5.16a, assuming that the index offset, X, is given in
the instruction word. After computing the address in cycle 3, the processor needs to
access memory twice — first to read location X+[R1] in clock cycle 4 and then to read
location [X+[R1]] in cycle 5. If R2 is a source operand in the next instruction, that
instruction would be stalled for three cycles, which can be reduced to two cycles with
operand forwarding, as shown.

191
Figure 5.16 Equivalent operations using complex and simple addressing modes.
To implement the same Load operation using only simple addressing modes
requires several instructions. For example, on a computer that allows three
operand addresses, we can use

Add #X,R1,R2
Load (R2),R2
Load (R2),R2

The Add instruction performs the operation R2 ← X+ [R1]. The two Load instructions
fetch the address and then the operand from the memory. This sequence of
instructions takes exactly the same number of clock cycles as the original, single
Load instruction, as shown in Figure 5.16b.

This example indicates that, in a pipelined processor, complex addressing


modes that involve several accesses to the memory do not necessarily lead to faster
execution. The main advantage of such modes is that they reduce the number of
instructions needed to perform a given task and thereby reduce the program space
needed in the main memory. Their main disadvantage is that their long execution
times cause the pipeline to stall, thus reducing its effectiveness. They require more

192
complex hardware to decode and execute them. Also, they are not
convenient for compilers to work with.
The instruction sets of modern processors are designed to take maximum
advantage of pipelined hardware. Because complex addressing modes are not suitable
for pipelined execution, they should be avoided. The addressing modes used in
modern processors often have the following features:

• Access to an operand does not require more than one access to the
memory.
• Only load and store instructions access memory
operands.
• The addressing modes used do not have side
effects.

Three basic addressing modes that have these features are register, register
indirect, and index. The first two require no address computation. In the index
mode, the address can be computed in one cycle, whether the index value is given
in the instruction or in a register. Memory is accessed in the following cycle. None of
these modes has any side effects, with one possible exception. Some architectures,
such as ARM, allow the address computed in the index mode to be written back
into the index register. This is a side effect that would not be allowed under the
guidelines above. Note also that relative addressing can be used; this is a special
case of indexed addressing in which the program counter is used as the index
register.

CONDITION CODES

In many processors, the condition code flags are stored in the processor status
register. They are either set or cleared by many instructions, so that they can be
tested by subsequent conditional branch instructions to change the flow of
program execution. An optimizing compiler for a pipelined processor attempts to
reorder instructions to avoid stalling the pipeline when branches or data
dependencies between successive instructions occur. In doing so, the compiler
must ensure that reordering does not cause a change in the outcome of a
computation. The dependency introduced by the condition-code flags reduces the
flexibility available for the compiler to reorder instructions.
Consider the sequence of instructions in Figure 5.17a, and assume that the ex-
ecution of the Compare and Branch=0 instructions proceeds as in Figure 8.14. The
branch decision takes place in step E2 rather than D2 because it must await the
result of the Compare instruction. The execution time of the Branch instruction can
be reduced by interchanging the Add and Compare instructions, as shown in Figure
5.17b.

193
Figure 5.17 Instruction reordering.

This will delay the branch instruction by one cycle relative to the Compare instruction. As a
result, at the time the Branch instruction is being decoded the result of the Com- pare
instruction will be available and a correct branch decision will be made. There would be no
need for branch prediction. However, interchanging the Add and Com- pare instructions can
be done only if the Add instruction does not affect the condition codes.

These observations lead to two important conclusions about the way condition codes should be
handled. First, to provide flexibility in reordering instructions, the condition-code flags should be
affected by as few instructions as possible. Second, the compiler should be able to specify in
which instructions of a program the condi- tion codes are affected and in which they are not. An
instruction set designed with pipelining in mind usually provides the desired flexibility. Figure
8.17b shows the instructions reordered assuming that the condition code flags are affected only
when this is explicitly stated as part of the instruction OP code. The SPARC and ARM
architectures provide this flexibility.

194
Large Computer Systems

Parallel Processing

Parallel processing is a term used to denote a large class of techniques that are used to
provide simultaneous data-processing tasks for the purpose of increasing the computational
speed of a computer system.
The purpose of parallel processing is to speed up the computer processing capability and
increase its throughput, that is, the amount of processing that can be accomplished during a
given interval of time. The amount of hardware increases with parallel processing, and with it,
the cost of the system increases.
Parallel processing can be viewed from various levels of complexity
At the lowest level, we distinguish between parallel and serial operations by the type of registers
used. e.g. shift registers and registers with parallel load
There are a variety of ways that parallel processing can be classified.
Internal organization of the processors
Interconnection structure between processors
The flow of information through the system

M. J. Flynn considers the organization of a computer system by the number of instructions
and data items that are manipulated simultaneously.

Single instruction stream, single data stream (SISD)


Single instruction stream, multiple data stream (SIMD)
Multiple instruction stream, single data stream (MISD)
Multiple instruction stream, multiple data stream (MIMD)

SISD
Represents the organization of a single computer containing a control unit, a processor unit, and a
memory unit. Instructions are executed sequentially and the system may or may not have internal
parallel processing capabilities. parallel processing may be achieved by means of multiple
functional units or by pipeline processing.
SIMD
Represents an organization that includes many processing units under the supervision of common
control unit. All processors receive the same instruction from the control unit but operate on
different items of data. The shared memory unit must contain multiple modules so that it can
communicate with all the processors simultaneously.
MISD & MIMD
MISD structure is only of theoretical interest since no practical system has been constructed using
this organization.
MIMD organization refers to a computer system capable of processing several programs at the same
time. e.g. multiprocessor and multicomputer system

195
Flynn’s classification depends on the distinction between the performance of the control unit and
the data-processing unit. It emphasizes the behavioral characteristics of the computer system
rather than its operational and structural interconnections. One type of parallel processing that
does not fit Flynn’s classification is pipelining.
Array Processing
An array processor is a processor that performs computations on large arrays of data. The term is
used to refer to two different types of processors.Attached array processor: Is an auxiliary
processor. It is intended to improve the performance of the host computer in specific numerical
computation tasks. SIMD array processor: Has a single-instruction multiple-data organization.It
manipulates vector instructions by means of multiple functional units responding to a common
instruction. Attached Array ProcessorIts purpose is to enhance the performance of the computer
by providing vector processing for complex scientific applications.
Parallel processing with multiple functional units
Fig. shows the interconnection of an attached array processor to a host computer. For example,
when attached to a VAX 11 computer, the FSP-164/MAX from Floating-Point Systems increases
the computing power of the VAX to 100megaflops. The objective of the attached array processor
is to provide vector manipulation capabilities to a conventional computer at a fraction of the cost
of supercomputer.

Fig : Attached array processor with host computer


SIMD Array Processor:An SIMD array processor is a computer with multiple processing units
operating in parallel. A general block diagram of an array processor is shown in Fig below. It
contains a set of identical processing elements (PEs), each having a local memory M. Each PE
includes an ALU, a floating-point arithmetic unit, and working registers. Vector instructions are
broadcast to all PEs simultaneously. Masking schemes are used to control the status of each PE
during the execution of vector instructions. Each PE has a flag that is set when the PE is active
and reset when the PE is inactive.

Fig: SIMD array processor organization

196
Characteristics of multiprocessors

A multiprocessor system is an interconnection of two or more CPUs with memory and input-out
equipment. The term “processor” in multiprocessor can mean either a central processing unit
(CPU) or an input-output processor (IOP). Multiprocessors are classified as multiple instruction
stream, multiple data stream (MIMD) systems The similarity and distinction between
multiprocessor and multicomputer are Similarity Both support concurrent operations Distinction

The network consists of several autonomous computers that may or may not communicate with
eachother. A multiprocessor system is controlled by one operating system that provides interaction
between processors and all the components of the system cooperate in the solution of a problem.
Multiprocessing improves the reliability of the system.
The benefit derived from a multiprocessor organization is an improved system performance.
Multiple indepedent jobs can be made to operate in parallel.

A single job can be partitioned into multiple parallel tasks.

Multiprocessing can improve performance by decomposing a program into parallel executable tasks.
The user can explicitly declare that certain tasks of the program be executed in parallel. This must
be done prior to loading the program by specifying the parallel executable segments.The other is to
provide a compiler with multiprocessor software that can automatically detect parallelism in a
user’s program. Multiprocessor are classified by the way their memory is organized.

A multiprocessor system with common shared memory is classified as a shared-memory or tightly


coupled multiprocessor. Tolerate a higher degree of interaction between tasks. Each processor
element with its own private local memory is classified as a distributed-memory or loosely coupled
system. Are most efficient when the interaction between tasks

THE STRUCTURE OF GENERAL-PURPOSE MULTIPROCESSORS

5.5.3.1 UMA (Uniform Memory Access) Multiprocessor


An interconnection-network permits n processors to access k memories Thus, any of the
processors can access any of the memories. The interconnection-network may introduce network-
delay between Processor & Memory.
A system which has the same network-latency for all accesses from the processors to the
memory-modules is called a UMA Multiprocessor.
Although the latency is uniform, it may be large for a network that connects many processors &
many memory-modules.
For better performance, it is desirable to place a memory-module close to each processor.
Disadvantage:

Interconnection-networks with very short delays are costly and complex to implement.

197
.5.3.2 NUMA (Non-Uniform Memory Access) Multiprocessors

Memory-modules are attached directly to the processors (Figure 12.3).


The network-latency is avoided when a processor makes a request to access its local memory.
However, a request to access a remote-memory-module must pass through the network.
Because of the difference in latencies for accessing local and remote portions of the shared
memory, systems of this type are called NUMA multiprocessors.
Advantage:
A high computation rate is achieved in all processors
Disadvantage:
The remote accesses take considerably longer than accesses to the local memory.

5.5.3.3 Distributed Memory Systems


All memory-modules serve as private memories for processors that are directly connectedto
them.A processor cannot access a remote-memory without the cooperation of the remote-processor.
This cooperation takes place in the form of messages exchanged by the processors. Such systems are
often called Distributed-Memory Systems (Figure 12.4).

198
5.5.4 Interconnection Structures :
The components that form a multiprocessor system are CPUs, IOPs connected to input-
output devices, and a memory unit. The interconnection between the components can
have different physical configurations, depending on the number of transfer paths that are
available Between the processors and memory in a shared memory system o Among the
processing elements in a loosely coupled system There are several physical forms available
for establishing an interconnection network.
Time-shared common bus

Multiport memory

Crossbar switch

Multistage switching network

Hypercube system

Time Shared Common Bus:


A common-bus multiprocessor system consists of a number of processors connected
through a common path to a memory unit.
Disadvantage.: Only one processor can communicate with the memory or another
processor at any given time. As a consequence, the total overall transfer rate within the
system is limited by the speed of the single path A more economical implementation of a
dual bus structure is depicted in Fig. below. Part of the local memory may be designed as a
cache memory attached to the CPU.

199
Fig: Time shared common bus organization

Fig: System bus structure for


multiprocessors

Multiport Memory:
A multiport memory system employs separate buses between each memory module and
each CPU. The module must have internal control logic to determine which port will have
access to memory at any given time. Memory access conflicts are resolved by assigning
fixed priorities to each memory port.
Advanatge.:
The high transfer rate can be achieved because of the multiple paths.
Disadvantage.: It requires expensive memory control logic and a large number of cables
and connections

200
Fig: Multiport memory organization
Crossbar Switch:
Consists of a number of crosspoints that are placed at intersections between processor
buses and memory module paths. The small square in each crosspoint is a switch that
determines the path from a processor to a memory module.
Advanatge.:
Supports simultaneous transfers from all510 memory modules
Disadvantage.: The hardware required to implement the switch can become quite large
and complex. Below fig. shows the functional design of a crossbar switch connected to one
memory module.

Fig: Crossbar switch

201
Multistage Switching Network:The basic component of a multistage network is a two-
input, two-output interchange switch as shown in Fig. below.

Using the 2x2 switch as a building block, it is510possible to build a multistage network to
control the communication between a number of sources and destinations. To see how this
is done, consider
he binary tree shown in Fig. below. Certain request patterns cannot be satisfied
simultaneously. i.e., if P1 à 000~011, then P2 à 100~111 One such topology is the omega
switching network shown in Fig. below

202
Fig: 8 x 8 Omega Switching Network
Some request patterns cannot be connected simultaneously. i.e., any
two sources cannot be connected simultaneously to destination 000
and 001 . In a tightly coupled multiprocessor system, the source is a
processor and the destination is a memory module. In a loosely
coupled multiprocessor system, both the source and destination are
processing elements.
Hypercube System:
The hypercube or binary n-cube multiprocessor structure is a loosely coupled 510
system composed of N=2n processors interconnected in an n-
dimensional binary cube. Each processor forms a node of the cube, in
effect it contains not only a CPU but also local memory and I/O
interface. Each processor address differs from that of each of its n
neighbors by exactly one bit position. Fig. below shows the hypercube
structure for n=1, 2, and 3. Routing messages through an n-cube
structure may take from one to n links from a source node to a
destination node. A routing procedure can be developed by
computing the exclusive-OR of the source node address with the
destination node address. The message is then sent along any one of
the axes that the resulting binary value will have 1 bits corresponding
to the axes on which the two nodes differ. A representative of the
hypercube architecture is the Intel iPSC computer complex. It consists
of 128(n=7) microcomputers, each node consists of a CPU, a floating-
point processor, local memory, and serial communication interface
units.
Fig: Hypercube structures for n=1,2,3
Part A
1. What is Parallel Processing?
2. What is pipelining?
3. What is Arithmetic pipeline?
4. What is Instruction pipeline?
5. What is multiprocessing?
6. What are the Characteristics of multiprocessors?

Part B
1. What are Data Hazards and explain with examples?
2. Explain about different forms of parallel processing?
3. What are Instruction Hazards? Explain with examples?
4. What is arry processing ? Explain its Characteristics?
5.Explain Interconnection Networks? with a neat diagrams?

You might also like