CS6303 Computer Architecture ACT Notes
CS6303 Computer Architecture ACT Notes
CS6303 Computer Architecture ACT Notes
series of processing steps, during which patterns of chemicals are placed on each wafer, creating the
transistors, conductors, and insulators.
The simplest way to cope with imperfection is to place many independent components on a single wafer.
The patterned wafer is then chopped up, or diced, into these components, called dies and more
informally known as chips. To reduce the cost, using the next generation process shrinks a large die as it
uses smaller sizes for both transistors and wires. This improves the yield and the die count per wafer.
Once youve found good dies, they are connected to the input/output pins of a package, using a
process called bonding. These packaged parts are tested a final time, since mistakes can occur in packaging,
and then they are shipped to customers.
=============================================================================
5. Write short notes on performance?
Running a program on two different desktop computers, youd say that the faster one is the desktop
computer that gets the job done first. If you were running a datacenter that had several servers running jobs
submitted by many users, youd say that the faster computer was the one that completed the most jobs
during a day. As an individual computer user, you are interested in reducing response timethe time
between the start and completion of a taskalso referred as execution time.
Datacenter managers are often interested in increasing throughput or bandwidththe total amount
of work done in a given time.
Measuring Performance:
The computer that performs the same amount of work in the least time is the fastest. Program
execution time is measured in seconds per program. CPU execution time or simply CPU time, which
recognizes this distinction, is the time the CPU spends computing for this task and does not include time
spent waiting for I/O or running other programs. CPU time can be further divided into the CPU time spent
in the program, called user CPU time, and the CPU time spent in the operating system performing tasks on
behalf of the program, called system CPU time.
The term system performance to refer to elapsed time on an unloaded system and CPU
performance to refer to user CPU time.
CPU Performance and Its Factors:
CPU execution time for a program = CPU clock cycles for a program X Clock cycle time
Alternatively, because clock rate and clock cycle time are inverses,
CPU execution time for a program = CPU clock cycles for a program
Clock rate
This formula makes it clear that the hardware designer can improve performance by reducing the number
of clock cycles required for a program or the length of the clock cycle.
Instruction Performance:
The performance equations above did not include any reference to the number of instructions needed
for the program. The execution time must depend on the number of instructions in a program. Here
execution time is that it equals the number of instructions executed multiplied by the average time per
instruction. clock cycles required for a program can be written as
CPU clock cycles = Instructions for a program X Average clock cycles per instruction
The term clock cycles per instruction, which is the average number of clock cycles each instruction takes to
execute, is often abbreviated as CPI. CPI provides one way of comparing two different implementations
of the same instruction set architecture, since the number of instructions executed for a program will be the
same.
The Classic CPU Performance Equation:
The basic performance equation in terms of instruction count (the number of instructions executed by the
program), CPI, and clock cycle time:
CPU time = Instruction count X CPI X Clock cycle time
or, since the clock rate is the inverse of clock cycle time:
CPU time = Instruction count X CPI
Clock rate
These formulas are particularly useful because they separate the three key factors that affect performance.
Components of performance
CPU execution time for a program
Instruction count
Clock cycles per instruction (CPI)
Clock cycle time
Units of Measure
Seconds for the program
Instructions executed for the program
Average number of clock cycles per
instruction
Seconds
per clock cycle
We can measure the CPU execution time by running the program, and the clock cycle time is
usually published as part of the documentation for a computer. The instruction count and CPI can be more
difficult to obtain. Of course, if we know the clock rate and CPU execution time, we need only one of the
instruction count or the CPI to determine the other.
=========================================================================================
Energy and thus power can be reduced by lowering the voltage, which occurred with each new
generation of technology, and power is a function of the voltage squared. In 20 years, voltages have gone
from 5 V to 1 V, which is why the increase in power is only 30 times. To try to address the power problem,
designers have already attached large devices to increase cooling, and they turn off parts of the chip that
are not used in a given clock cycle.
=============================================================================
7. Write short notes on operations and operands in computer hardware?
Operations in MIPS:
Every computer must be able to perform arithmetic. The MIPS assembly language notation
add a, b, c
instructs a computer to add the two variables b and c and to put their sum in a.
This notation is rigid in that each MIPS arithmetic instruction performs only one operation and must always
have exactly three variables.
EXAMPLE, To add 4 variables, b,c,d,e and store it in a.
add a, b, c # The sum of b and c is placed in a
add a, a, d # The sum of b, c, and d is now in a
add a, a, e # The sum of b, c, d, and e is now in a
Thus, it takes three instructions to sum the four variables.
Design Principle 1: Simplicity favors regularity.
EXAMPLE: Compiling Two C Assignment Statements into MIPS
This segment of a C program contains the five variables a, b, c, d, and e. Since Java evolved from C, this
example and the next few work for either high-level programming language:
a = b + c;
d = a e;
The translation from C to MIPS assembly language instructions is performed by the compiler. Show the
MIPS code produced by a compiler. A MIPS instruction operates on two source operands and places the
result in one destination operand. Hence, the two simple statements above compile directly into these two
MIPS assembly language instructions:
add a, b, c
sub d, a, e
Operands in MIPS:
The operands of arithmetic instructions are restricted; they must be from a limited number of special
locations built directly in hardware called registers. The size of a register in the MIPS architecture is 32
bits; groups of 32 bits occur so frequently that they are given the name word in the MIPS architecture.
Design Principle 2: Smaller is faster.
A very large number of registers may increase the clock cycle time simply because it takes electronic signals
longer when they must travel farther. So, 32 registers were used in MIPS architecture. The MIPS convention
is to use two-character names following a dollar sign to represent a register. eg: $s0, $s1
Example: f = (g + h) (i + j); instructions using registers.
add $t0,$s1,$s2
# register $t0 contains g + h
add $t1,$s3,$s4
# register $t1 contains i + j
sub $s0,$t0,$t1
# f gets $t0 $t1, which is (g + h)(i + j)
Memory Operands: Programming languages have simple variables that contain single data
elements, as in these examples, but they also have more complex data structuresarrays and structures.
These complex data structures can contain many more data elements than there are registers in a computer.
The processor can keep only a small amount of data in registers, but computer memory contains billions
of data elements. So, MIPS must include instructions that transfer data between memory and registers.
Such instructions are called data transfer instructions. To access a word in memory, the instruction
must supply the memory address.
The data transfer instruction that copies data from memory to a register is traditionally called load.
The format of the load instruction is the name of the operation followed by the register to be loaded, then
a constant and register used to access memory. The sum of the constant portion of the instruction and the
contents of the second register forms the memory address.
The actual MIPS name for this instruction is lw, standing for load word.
EXAMPLE:
g = h + A[8];
To get A[8] from memory use lw,
lw $t0,8($s3)
# Temporary reg $t0 gets A[8]
Use Result of A[8] i.e., stored in $t0,
add $s1,$s2,$t0
# g = h + A[8]
The constant in a data transfer instruction (8) is called the off set, and the register added to form the
address ($s3) is called the base register.
In MIPS, words must start at addresses that are multiples of 4. This requirement is called an
alignment restriction, and many architectures have it.(since in MIPS each 32 bits form a word in memory, so
the address of one word to another jumps in multiples of 4)
Byte addressing also affects the array index. To get the proper byte address in the code above, the off set to
be added to the base register $s3 must be 4 x 8, or 32,(as per previous example).
Above EXAMPLE:
g = h + A[8]; (implemented based on byte address)
To get A[8] from memory use lw and calculate (8 x4) = 32 which is the actual offset value,
lw $t0,32($s3)
# Temporary reg $t0 gets A[8]
Use Result of A[8] i.e., stored in $t0,
add $s1,$s2,$t0
# g = h + A[8]
The instruction complementary to load is traditionally called store; it copies data from a register to
memory. The format of a store is similar to that of a load: the name of the operation, followed by the
register to be stored, then off set to select the array element, and finally the base register. Once again, the
MIPS address is specified in part by a constant and in part by the contents of a register. The actual
MIPS name is sw, standing for store word.
EXAMPLE:
A[12] = h + A[8];
lw $t0,32($s3)
# Temporary reg $t0 gets A[8], note (8 x4) used.
add $t0,$s2,$t0
# Temporary reg $t0 gets h + A[8]
sw $t0,48($s3)
# Stores h + A[8] back into A[12], note (12 x 4) used.
Constant or Immediate Operands:
For example, to add the constant 4 to register $s3, we could use the code
lw $t0, AddrConstant4($s1)
# $t0 = constant 4
add $s3,$s3,$t0
# $s3 = $s3 + $t0 ($t0 == 4)
Alternative, that avoids the load instruction is to offer versions of the arithmetic instructions in which one
operand is a constant.
example:
add immediate or addi instructions.
addi $s3,$s3,4
# $s3 = $s3 + 4
Constant operands occur frequently, and by including constants inside arithmetic instructions, operations
are much faster and use less energy than if constants were loaded from memory.
=============================================================================
8. Write short notes on Instructions and its types that are used in MIPS?
Registers are referred to in instructions, there must be a convention to map register names into
numbers. In MIPS assembly language, registers $s0 to $s7 map onto registers 16 to 23, and registers $t0
to $t7 map onto registers 8 to 15. Hence, $s0 means register 16, $s1 means register 17, $s2 means register
18, . . . , $t0 means register 8, $t1 means register 9, and so on.
MIPS Fields for instruction
Multiple formats complicate the hardware, we can reduce the complexity by keeping the formats
similar. For example, the first three fields of the R-type and I-type formats are the same size and have the
same names; the length of the fourth field in I-type is equal to the sum of the lengths of the last three
fields of R-type. Note that the meaning of the rt field has changed for this instruction: the rt field specifies
the destination register,
LW
ADD
ADD
The lw instruction is identified by 35 (OP field), The add instruction that follows is specified with 0 (OP
field), The sw instruction is identified with 43 (OP field).
LW
ADD
ADD
Also Shifting left by i bits gives the same result as multiplying by 2i (refer above representation for
9 and 144) (9 x 2 4 = 9 x 16 = 144, where i =4, since left shift done 4 times )
LOGICAL AND (and)
AND is a bit-by-bit operation that leaves a 1 in the result only if both bits of the operands are 1.
For example, if register $t2 contains
0000 0000 0000 0000 0000 1101 1100 0000two
and
register $t1 contains
0000 0000 0000 0000 0011 1100 0000 0000two
then, after executing the MIPS instruction
and $t0,$t1,$t2
# reg $t0 = reg $t1 & reg $t2
the value of register $t0 would be
0000 0000 0000 0000 0000 1100 0000 0000two
(example for bit wise, note: do not add , ..00101
.10111
Bit wise AND 00101 (use AND truth table for each bit))
AND is traditionally called a mask, since the mask conceals some bits.
LOGICAL OR (or)
It is a bit-by-bit operation that places a 1 in the result if either operand bit is a 1. To elaborate, if the registers
$t1 and $t2 are unchanged from the preceding i.e.,
register $t2 contains
0000 0000 0000 0000 0000 1101 1100 0000two
and
register $t1 contains
0000 0000 0000 0000 0011 1100 0000 0000two
then, after executing the MIPS instruction
or $t0,$t1,$t2
# reg $t0 = reg $t1 | reg $t2
the value in register $t0 would be
0000 0000 0000 0000 0011 1101 1100 0000two
(example for bit wise,
..00101
.10111
Bit wise OR
10111 (use OR truth table for each bit))
LOGICAL NOT (nor)
The final logical operation is a contrarian. NOT takes one operand and places a 1 in the result if one operand
bit is a 0, and vice versa. Since MIPS needs three-operand format, the designers of MIPS decided to
include the instruction NOR (NOT OR) instead of NOT.
Step 1: . Perform bit wise OR ,
..00101
.00000 (dummy operation register filled with zero)
00101
Step 2: Take Inverse for the above result now we get 11010
Instruction : nor $t0,$t1,$t3
Constants are useful in AND and OR logical operations as well as in arithmetic operations, so MIPS
also provides the instructions and immediate (andi) and or immediate (ori).
=============================================================================
10. Write short notes on Decision making and branching instructions? ( control operations )
Branch and Conditional branches: Decision making is commonly represented in programming languages
using the if statement, sometimes combined with go to statements and labels. MIPS assembly language
includes two decision-making instructions, similar to an if statement with a go to. The first instruction is
beq register1, register2, L1
This instruction means go to the statement labeled L1 if the value in register1 equals the value in register2.
The mnemonic beq stands for branch if equal. The second instruction is
bne register1, register2, L1
It means go to the statement labeled L1 if the value in register1 does not equal the value in register2.
The mnemonic bne stands for branch if not equal. These two instructions are traditionally called
conditional branches.
EXAMPLE: if (i == j) f = g + h; else f = g h; the MIPS version of the given statements is
bne $s3,$s4, Else
# go to Else if i j
add $s0,$s1,$s2
# f = g + h (skipped if i j)
j Exit
# go to Exit
Else: sub $s0,$s1,$s2
#f=gh
(skipped if i = j)
Exit:
Loops:
Decisions are important both for choosing between two alternativesfound in if statementsand for
iterating a computationfound in loops. The same assembly instructions are the basic building blocks for
both cases(if and loop).
EXAMPLE: while (save[i] == k)
i += 1;
the MIPS version of the given statements is
Assume that i and k correspond to registers $s3 and $s5 and the base of the array save is in $s6.
Loop: sll $t1,$s3,2 # Temp reg $t1 = i * 4
To get the address of save[i], we need to add $t1 and the
base of save in $s6:
add $t1,$t1,$s6
# $t1 = address of save[i]
Now we can use that address to load save[i] into a
temporary register:
lw $t0,0($t1)
# Temp reg $t0 = save[i]
The next instruction performs the loop test, exiting if
save[i] k:
bne $t0,$s5, Exit
# go to Exit if save[i] k
addi $s3,$s3,1
#i=i+1
The next instruction adds 1 to i:
j Loop
Exit:
# go to Loop
UNIT 2
ARITHMETIC OPERATIONS
PART B
1. Write short notes on Addition and Subtraction?
Digits are added bit by bit from right to left , with carries passed to the next digit to the left.
-6:
0000 0000 0000 0000 0000 0000 0000 0111 two = 7ten
+ 1111 1111 1111 1111 1111 1111 1111 1010two = 6ten (2's complement of -6)
_______________________________________________________
= 0000 0000 0000 0000 0000 0000 0000 0001two = 1ten
Over flow cannot occur for the following:
when adding operands with different sign. eg., -10+4 = -6 , since sum is not larger than one of the
operand. so no overflow when adding +ve and -ve operands.
when the signs of the operands are the same, overflow cannot occur. eg., c-a - c+(-a), since
subtraction is done via add operation by negating the second operand.
Over flow occurrence detection:
Adding or subtracting two 32-bit numbers can yield a result that needs 33 bits to be fully expressed.
The lack of a 33rd bit means that when overflow occurs, the sign bit is set with the value of the result
instead of the proper sign of the result.
overflow occurs when adding two positive numbers and the sum is negative
subtract a negative number from a positive number and get a negative result.
subtract a positive number from a negative number and get a positive result.
The MIPS has 2 choices of instructions,
Add (add), add immediate (addi), and subtract (sub) cause exceptions on overflow. . Eg., Fortran
compiler use this type of instructions.
Add unsigned (addu), add immediate unsigned (addiu), and subtract unsigned (subu) do not cause
exceptions on overflow. Eg., C compiler use this type of instructions.
Base on the paper techniques that is shown initially, the multiplicand is moved to left each time and
the multiplier moved right after each bit has performed its intermediate execution. The No of iterations to
find the product will be equal to No: of bits in the multiplier. In this case we have 32 iterations (MIPS).
Example Multiply 2ten _ 3ten, or 0010two _ 0011two. (4 bits are used to save space.)
1001ten
Quotient
Divisor 1000ten
1001010ten
1000
10
101
1010
1000
10ten
Dividend
Remainder
Each iteration of the algorithm needs to move the divisor to the right one digit, so we start with the
divisor placed in the left half of the 64-bit Divisor register and shift it right 1 bit each step.
The Remainder register is initialized with the dividend.
Example,
Using a 4-bit version of the algorithm to save pages, divide 7ten by 2ten,
or 0000 0111two by 0010two.
4. Explain floating point number representation and the IEEE 754 format.
In mathematics , floating point number is otherwise called as real number representation.
Example,
Floating-Point Representation:
The representation has 2 parts that has fraction part and exponent part. The fraction part size
is used to increase or decrease precision. The exponent part size is used to increase or decrease the range of
the number.
The representation has 1 sign bit where bit -0 represents +ve value and bit -0 represents negative value.
(-1)S x F x 2E
overflow here means that the exponent is too large to be represented in the exponent field.
The smallest non zero fraction that can be represented in the fraction field, when it is too small to store in
faction field then we call it as underflow.
-38
38
fig(a)
fig (b)
The 2 power -1 is represented in 2's complement format, in fig (b) value -1 is represented like as if a large
exponent value is sitting in the exponent field, to avoid this confusion bias value is used.
IEEE 754 uses a bias of 127 for single precision, so an exponent of -1 is represented by the bit
pattern of the value -1+ 127ten, or 126ten and +1 is represented by 1 + 127ten, or 128ten
so the value represented by a floating-point number is ,
-/+1.11111111111111111111111two x 2+127
5. State and explain the arithmetic addition operations on floating point numbers.
Add numbers in scientific notation: eg., 9.999ten X 10-1 + 1.610ten X 10-1
Assume that we can store only four decimal digits of the significand and two decimal digits of the
exponent.
Step 1. Align the decimal point of the number that has the smaller exponent. here 1.610ten X 10-1 is
number that has small exponent value among the given two number.
so moving the decimal place to left we achieve the same exponent value as the larger number.
1.610ten X 10-1 = 0.1610ten X 100 = 0.01610ten X 101
10.015ten
X 101.
Step 3. This sum is not in normalized scientific notation, so we need to adjust it:
10.015ten X 101 = 1.0015ten X 102
The exponent is increased or decreased, we must check for overflow or underflowthat is, we
must make sure that the exponent still fits in its field.
Step 4. Since we assumed that the significand can be only four digits long (excluding the sign), we must
round the number. The number 1.0015ten X 102 is rounded to four digits in the significand to
1.002ten X 102
example based on above algorithm -- using binary numbers Binary Floating-Point Addition
Add the number 0.5 ten and -0.4375 ten in binary, convert the given number to binary we get,
Step 1: The number with less exponent shifted right -1.110 two x 2-2 = -0.111 two x 2-1
Step 2: Add the significand: 1.000 two + (-0.111 two ) = 0.001two x 2-1
Step 3: Normalize the sum, check for underflow and overflow
0.001two x 2-1=0.010two x 2-2=0.100two x 2-3=1.000 x 2-4
Step 4: Round the sum: 1.000 x 2-4 , The sum already the 4 bits size, so no change.
The decimal value will be 1.000 x 2-4 = 0.0001two, on conversion we get 1 x 2-4 = 1/24ten = 1/16ten
= 0.0625ten
When added normally the sum of 0.5 ten and -0.4375 ten = 0.0625ten, Thus verified.
6. State and explain the arithmetic multiplication operations on floating point numbers.
Multiplying decimal numbers in scientific notation by hand:
1.110ten X 1010 X 9.200ten X 10-5
Assume that we can store only four digits of the significand and two digits of the exponent.
Step 1. Unlike addition, we calculate the exponent of the product by simply adding the exponents of the
operands together:
New exponent = 10 + (-5) = 5
Step 2. Next comes the multiplication of the significands:
1.110ten
9.200ten
0000
0000
2220
9990
10212000ten
There are three digits to the right of the decimal point for each operand, so the decimal point is placed six
digits from the right in the product significand:
10.212000ten
Assuming that we can keep only three digits to the right of the decimal point, the product is 10.212 X 105.
Step 3. This product is un normalized, so we need to normalize it:
10.212ten x 105 = 1.0212ten x 106
check for overflow and underflow.
Step 4. We assumed that the significand is only four digits long (excluding the sign), so we must round the
number. The number 1.0212ten X 106 is rounded to four digits in the significand to 1.021ten X 106.
Step 5. The sign of the product depends on the signs of the original operands. If they are both the same, the
sign is positive; otherwise, its negative. Hence, the product is
+1.021ten X 106
example based on above algorithm -- using binary numbers Binary Floating-Point Multiplication
Multiply the number 0.5 ten and -0.4375 ten in binary
Convert the number to binary
The pipelined approach, Overlapping of each stages without disturbing the other stages.
Now each person's waiting time is reduced and each stage Idle time is reduced.
The same principles apply to processors where the pipeline instruction-execution is applied. So, the
MIPS instructions classically take five steps or stages:
1. IF: Instruction fetch from memory
2. ID: Instruction decode & register read
3. EX: Execute operation or calculate address
4. MEM: Access memory operand
5. WB: Write result back to register
All the pipeline stages take a single clock cycle, so the clock cycle must be long
enough to accommodate the slowest operation. Here LW instruction takes 800 ps(worst case), so the clock
cycle must have atleast 800 ps.
If the stages are perfectly balanced, then the time between instructions on the pipelined processor
assuming ideal conditionsis equal to
Time between Instruction non-pipelined =
pipelined Approach, Above EXAMPLE - sequences of load Instruction, but stages are overlapped
during execution as given below.
Total time taken for execution is 1400 ps, while in Non pipelined approach is 2400 ps.
but formula suggests
Time taken for pipelined approach = time taken (nonpipelined approach)/ No of stages
= 2400 ps / 5
= 480 ps
But the practical result shows, its 1400 ps.
So only when No of instructions in pipelined execution is high enough, the theoretical execution speed can
be achieved or nearly achieved.
Pipelining improves performance by increasing instruction throughput.
----------------------------------2. Explain types of pipeline Hazards?
There are situations in pipelining when the next instruction cannot execute in the following clock cycle.
These events are called hazards. There are 3 types of Hazards:
1. structural hazard: When a planned instruction cannot execute in the proper clock cycle
because the hardware does not support the combination of instructions that are set to execute.
2. data hazard / pipeline data hazard: When a planned instruction cannot execute in the
proper clock cycle because data that is needed to execute the instruction is not yet available.
3. control hazard / branch hazard: When the proper instruction cannot execute in the
proper pipeline clock cycle because the instruction that was fetched is not the one that is needed; that is, the
flow of instruction addresses is not what the pipeline expected.
structural hazard:
The hardware cannot support the combination of instructions that we want to execute in the same
clock cycle. Assume that we had a single memory instead of two memories. If the pipeline in Figure
part B Q.No 1 , if it had a fourth instruction, we would see that in the same clock cycle the first instruction
is accessing data from memory while the fourth instruction is fetching an instruction from that same
memory. Without two memories, our pipeline could have a structural hazard.
Data Hazards:
Data hazards occur when the pipeline must be stalled because one step must wait for another to
complete. In a computer pipeline, data hazards arise from the dependence of one
instruction on an earlier one that is still in the pipeline.
For example, suppose we have an add instruction followed immediately by a subtract instruction
add
sub
The add instruction doesnt write its result until the fifth stage, meaning that we would have to waste
three clock cycles in the pipeline.
To resolve the data hazard, for the code sequence above, as soon as the ALU creates the sum for the
add, we can supply it as an input for the subtract. This is done by Adding extra hardware to retrieve
the missing item early from the internal resources is called forwarding or bypassing.
Figure below shows the connection to forward the value in $s0 after the execution stage of the add
instruction as input to the execution stage of the sub instruction.
forwarding paths are valid only if the destination stage is later in time than the source stage.
For example, there cannot be a valid forwarding path from the output of the memory access stage in
the first instruction to the input of the execution stage of the following, since that would mean going
backward in time.
Forwarding cannot prevent all pipeline stalls,
Suppose the first instruction was a load of $s0 instead of an add, So desired data would be available only
after the fourth stage of the first instruction in the dependence, which is too late for the input of the third
stage of sub instruction.
lw $s0, 20($t1)
even with forwarding, we would have to stall one stage for a load-use data hazard, This figure shows below
an important pipeline concept, officially called a pipeline stall, but often given the nickname bubble.
Control Hazards:
control hazard arising from the need to make a decision based on the results of one instruction while
others are executing. The equivalent decision task in a computer is the branch instruction.
Only just received the branch instruction from memory, the pipeline cannot possibly know what the
next instruction should be loaded. one possible solution is to stall immediately after we fetch a branch,
waiting until the pipeline determines the outcome of the branch.
The cost of this option is too high for most computers to use and so a second solution to the control hazard
is to predict. This option does not slow down the pipeline when you are correct.
One simple approach is to predict always that branches will be untaken.
Only when branches are taken,then the pipeline stalls (prediction goes wrong).
In cases of loops, branches are predicted to be taken because they jump back to the top of the loop.
In case of Conditional statements, branches are predicted to be untaken because they jump towards
forward direction with reference to the program flow.
Applying the PREDICTION METHODOLOGIES
Static branch prediction
o Based on typical branch behavior of the branching statements.
o Example: loop and if-statement branches
Refer Above diagram, instructions and data move generally from left to right through the five stages as they
complete execution.
There are, however, two exceptions to this left -to-right flow of instructions:
o The write-back stage, which places the result back into the register file in the middle of the
data path.
o The selection of the next value of the PC, choosing between the incremented PC and the
branch address from the MEM stage.
Data flowing from right to left does not affect the current instruction; these reverse data movements
influence only later instructions in the pipeline.
o The first right-to-left flow of data can lead to data hazards
o The second right-to-left leads to control hazards.
To retain the value of an individual instruction for its other four stages, the value read from instruction
memory must be saved in a register. Similar arguments apply to every pipeline stage, so we must
place registers wherever there are dividing lines between stages.
The registers are named for the two stages separated by that register. For example, the
pipeline register between the IF and ID stages is called IF/ID. Notice that there is no pipeline register at the
end of the write-back stage.
All instructions must update some state in the processor -the register file, memory, or the PC.
Its contents must be saved in the pipeline register when an exception occurs, while the contents of the
pipeline registers can be discarded. The registers can be referred from the below diagram.
Consider load instruction abbreviation lw and how the operations of this instruction in a
pipe line is done are as follows: (lw $s0, 32($t0))
1. Instruction fetch: The top portion of Figure shows the instruction being read from memory using the
address in the PC and then being placed in the IF/ID pipeline register. The PC address is incremented by 4
and then written back into the PC to be ready for the next clock cycle. This incremented
address is also saved in the IF/ID pipeline register in case it is needed later for an branch instruction, such as
beq.
2. Instruction decode and register file read: The bottom portion of Figure shows the instruction portion
of the IF/ID pipeline register supplying the 16-bit immediate field of the load instruction( here in our
example is 32), which is sign-extended to 32 bits, and the register numbers to read the two registers ( $s0
and $t0). All three values are stored in the ID/EX pipeline register, along with the incremented PC address.
3. Execute or address calculation:
Figure shows that the load instruction reads the contents of register1
($t0) and the sign-extended immediate (value 32) from the ID/EX pipeline register and adds them using the
ALU. That sum is placed in the EX/MEM pipeline register.
4. Memory access: The top portion of Figure shows the load instruction reading the data memory using the
address from the EX/MEM pipeline register and loading the data into the MEM/WB pipeline register.
5. Write-back: The bottom portion of Figure shows the final step: reading the data from the MEM/WB
pipeline register and writing it into the register file ($s0) in the middle of the figure.
The following figure shows the reading and writing operations done during a stage ( here instruction
decode stage). It show that reading operation is performed from IF/ID register to get input for the stage and
after completion of execution of that stage the results are written in ID/EX register.
Note: The read and write operations are shaded differently.
Similarly other stages follows with the same read / write operations with respect to the execution of
each stage and pipelined buffers.
Control signals derived from instruction as in single-cycle implementation, which shows the signals of
different instructions that are active during its respective stages (refer table 4.49). Implementing control
means setting the nine control lines to these values in each stage for each instruction. The simplest way to do
this is to extend the pipeline registers to include control information. Since the control lines start with the
EX stage, we can create the control information during instruction decode. The figure 4.50 shows that these
control signals are then used in the appropriate pipeline stage as the instruction moves down the pipeline,
just as the destination register number for loads moves down the pipeline.
--------------------------------------5. Explain data hazard and the ways to detect and handle the data hazards ?
Consider this sequence:
sub
$2, $1,$3
and
$12,$2,$5
or
$13,$6,$2
add
$14,$2,$2
sw
$15,100($2)
The last four instructions are all dependent on the result in register $2 of the first instruction.
register $2 had the value 10 before the subtract instruction
20 after subtraction
Intention of the logic is to use 20 from register $2. After subtraction operation i.e., the rest of the
four instruction to use the value -20 from $2 register.
The above diagram shows the dependence of each instruction with respect to the first instruction SUB and
the result stored in $2 register.
As above, the and & or instructions would get the incorrect value 10(assumed value of $2
before execution of SUB instruction).
Instructions that would get the correct value of 20 are add & sw (since both the instruction
will need the value from / after CC5(clock cycle).
TO DETECT AND FORWARD WITHOUT STALL:
To avoid stall, the result can forwarded to and & or instruction from CC3 where the result is available at
the end of EX stage. (As shown in above diagram).
To detect the data hazard, the source register dependence on any destination register of the previous
instruction can be found by checking the following condition.
1a.
EX/MEM.RegisterRd = ID/EX.RegisterRs
1b.
EX/MEM.RegisterRd = ID/EX.RegisterRt
When the result is available in CC3, and the above condition is TRUE then the hazard is detected and
forwarding process must be initiated to the next instruction, at the end of EX stage (CC3) of the current
instruction. Here in this example condition 1a is TRUE.
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs = $2 is TRUE.
Similarly, When the result is available in CC4, and the below condition is TRUE then the hazard is detected
and forwarding process must be initiated to the next instruction, at the end of MEM stage (CC4) of the
current instruction.
2a.
MEM/WB.RegisterRd = ID/EX.RegisterRs
2b.
MEM/WB.RegisterRd = ID/EX.RegisterRt
Here in the given example Condition 2b is TRUE.
2b.
MEM/WB.RegisterRd = ID/EX.RegisterRt= $2 is TRUE.
Since the $2 is the second operand of the OR instruction, 2b is TRUE. Only the above all 4 conditions will
work when the read operation is Active, i.e., when the rd register does hold the result as ZERO.
PART A
1.How can memory access made faster and which hazard can be reduced when memeory is accessed faster?
Ans: Pipelining is an implementation technique in which multiple instructions are
overlapped in execution. Today, pipelining is key to making processors fast. hence memory access is faster.
Control hazard can be reduced.
UNIT 4 - PARALLELISM
PART B
1. Explain instruction level parallelism amd its difficulties in inplementing it?
Ans: ILP is to Achieve not only instruction overlap, but the actual execution of more than one instruction
at a time through dynamic scheduling and how to maximize the throughput of a processor.for typical RISC
processors, instructions usually depend on each other too and as a result the amount of overlap is limited.
Instruction level dependencies and hazards during ILP
If two instructions are not dependent then they can execute simultaneously assuming sufficient
resources that is no structural hazards. if one instruction depends on another, they must execute in order
though they may still partially overlap.
1. Data dependencies: Data dependence means that one instruction is dependent on another if there exists a
chain of dependencies between them. . Compilers can be of great help in detecting and scheduling around
these sorts of hazards; hardware can only resolve these dependencies with severe limitations.
2. Name Dependencies: A name dependency occurs when two instructions use the same register or memory
location, called a name, but there is no flow of data between them.
anti-dependence occurs when j writes a register/memory that i reads.
output dependence occurs when i and j write to the same register/memory location
The name used in the instruction is changed so that they do not conflict. This technique is known as register
renaming (uses temp registers).
3. Data Hazards:
Data dependency between instructions and they are close enough to cause the pipeline to stall three types of
data hazards:
read after write (RAW)j tries to read a source before i writes itthis is the most common type and is a
true data dependence;
example sequence logic 1. i=i+1;
2. j=i+1
write after write (WAW)j tries to write an operand before it is written by ithis corresponds to the
output dependence;
example sequence logic 1. i=i+1; 2. print i; 3. i=j+1
write after read (WAR)j tries to write a destination before i has read itthis corresponds to an antidependency
example sequence logic 1.read i ; 2. i=j+1
4. Control Dependencies:
A control dependency determines the order of an instruction i with respect to a branch,
if (p1)
S1
if (p2)
S2
S1 is control dependent on p1 and S2 is control dependent on p2
speculatively executed instructions do not effect the program state until branch result is determined. This
implies that the instructions executed speculatively must not raise an exception or otherwise cause side
effects.
2.
read operandswait until no data hazards obtain then read the operands and start executing.
It dynamically schedules the pipeline. instructions must pass through the issue phase in order;
This method can stall or bypass each other, in the read operands phase and enter, or even, complete
execution in out of order manner.
eg: CDC6600 used a scoreboard , The goal of a scoreboard is to maintain processor throughput of one
instruction per clock cycle (no structural hazard). If the next instruction would stall then store it on a queue
and start with a later instruction and takes full responsibility for instruction issue and execution.
It uses as many as 16 separate functional units.
2. Tomasulo's solution for dynamic scheduling.
Executing instructions only when operands are available, waiting instruction is stored in a
reservation station. Reservation stations keep track of pending instructions (RAW). WAW can be avoided
using Register renaming.(80 reg).
Tomasulo architecture executes instructions in three phases; each phase may take more than one clock cycle:
Phase 1issuegets the next instruction from the fetch unit. If there is a reservation station available then
insert the instruction, if not (structural hazard detected) wait for reservation station.
Phase 2executeeach reservation station monitors the output of the functional units. When all operands
are available, the appropriate functional unit can execute the instruction.(RAW is eliminated). To preserve
exception behaviour, no instruction is allowed to initiate execution until all proceeding branches are
finalised.
Phase 3write resultswhen the execution finishes and the results are available. Data are routed into both
the register file and any reservation stations that depend on the result. A memory store is usually saved in a
separate queue until both address and data are available .
3. Branch prediction.
It uses predictor is a simple saturating n-bit counter.Each time a particular branch is taken its entry
is incremented otherwise it is decremented. If the most significant bit in the counter is set then predict that
the branch is taken.
Limitations of ILP
1. An instruction stream needs to be run on an ideal processor with no significant limitations.
2. The ideal processor always predicts branches correctly, has no structural hazards .
3. This eliminates all control and name dependencies. (only data dependencies)
4. Theoritically it is possible for the last dynamically executed instruction in the program to be
scheduled on the first cycle.
-------------------------------------------------------------------------------------------------------------------------------2. How parallel processing is implemented and explain the architecture that are used? or explain
Flynn classication?
Ans: Processor executes programs by executing machine instructions in a sequence and one at a time. Each
instruction is executed in a sequence of operations (fetch instruction, fetch operands, perform operation store
result.)
The taxonomy first introduced by Flynn is still the most common way of categorizing systems with parallel
processing capability. Flynn proposed the following categories of computer system:
Single Instruction, Multiple Data (SIMD) system: A single machine instruction controls the simultaneous
execution of a number of processing elements on a lockstep basis. Each processing element has an
associated data memory, so that each instruction is executed on a different set of data by the different
processors. Vector and array processors fall into this category
Multiple Instruction, Single Data (MISD) system A sequence of data is transmitted to a set of processors,
each of which executes a different instruction sequence. This structure has never been implemented.
Multiple Instruction, Multiple Data (MIMD) system: A set of processors simultaneously execute
different instruction sequences on different data sets. SMPs, clusters, and NUMA systems fits into this
category.
MIMD can be subdivided into two main categories:
Symmetric multiprocessor (SMP): In an SMP, multiple processors share a single memory or a pool
of memory by means of a shared bus or other interconnection mechanism. A distinguish feature is
that the memory access time to any region of memory is approximately the same for each processor.
Nonuniform memory access (NUMA): The memory access time to different regions of memory
may differ for a NUMA processor.
Organization:
The organization of a multiprocessor system is shown in Figure 10.1
There are multiple independent nodes, each node contains multiple processors, each with its
own L1 and L2 caches, plus main memory. The node is the basic building block of the overall CC NUMA
organization
The nodes are interconnected by means of some communication facility, which could be a switching
mechanism, a ring, or some other networking facility. Each node in the CC-NUMA system includes some
main memory.
Memory: From the point of view of the processors, there is only a single addressable memory, with each
location having a unique system-wide address.
When a processor initiates a memory access, if the requested memory location is not in the
processors cache, then the L2 cache initiates a fetch operation. If the desired line is in the local portion of the
main memory, the line is fetch across the local bus.
If the desired line is in a remote portion of the main memory, then an automatic request is send out to
fetch that line across the interconnection network, deliver it to the local bus, and then deliver it to the
requesting cache on that bus.
All of this activity is atomic and transparent to the processors and its cache.
In this configuration, cache coherence is a central concern. For that each node must maintain some sort of
directory that gives it an indication of the location of various portion of memory and also cache status
information.
============================================================================
Addressing: It must be possible to distinguish modules on the bus to determine the source and destination
of data
Arbitration: Any I/O module can temporarily function as master. A mechanism is provided to arbitrate
competing request for bus control, using some sort of priority scheme.
Time shearing: when one module is controlling the bus, other modules are locked out and if
necessary suspend operation until bus access in achieved.
The bus organization has several advantages compared with other approaches:
Simplicity: This is the simplest approach to multiprocessor organization. The physical interface and
the addressing, arbitration and time sharing logic of each processor remain the same as in a single
processor system.
Flexibility: It is generally easy to expand the system by attaching more processor to the bus.
Reliability: The bus is essentially a passive medium and the failure of any attached device should
not cause failure of the whole system.
The main drawback to the bus organization is performance. Thus, the speed of the system is limited by the
bus cycle time. To improve performance, each processor can be equipped with local cache memory.
Multiport Memory:
The multiport memory approach allows the direct, independent access of main memory modules by
each processor and io module.The multiport memory system is shown in
Figure 10.3: Multiport
memory
The multiport memory approach is more complex than the bus approach, requiring a fair amount of
logic to be added to the memory system. Logic associated with memory is required for resolving conflict.
The method often used to resolve conflicts is to assign permanently designated priorities to each memory
port.
============================================================================
4. Explain multi core organization and its processors? or Explain the implementation of multicore
organisation?
Ans:
Top level of description, the main variables in a multicore organization are as follows:
The number of core processors on the chip.
The number of levels of cache memory.
The amount of cache memory that is shared.
Dedicated L1 cache:
Figure 18.8a is an organization found in some of the earlier multicore
computer chips and is still seen in embedded chips. In this organization, the only on-chip cache is L1 cache,
with each core having its own dedicated L1 cache. Almost invariably, the L1 cache is divided into
instruction and data caches.An example of this organization is the ARM11 MPCore.
Dedicated L2 cache:
Figure 18.8b is also one in which there is no on-chip cache
sharing. In this, there is enough area available on the chip to allow for L2 cache.An example of this
organization is the AMD Opteron.
Shared L2 cache:
Figure 18.8c shows a similar allocation of chip space to memory, but with the
use of a shared L2 cache. The Intel Core Duo has this organization.
Shared L3 cache:
Figure 18.8 d as the amount of cache memory available on the chip continues
to grow, performance considerations dictate splitting off a separate, shared L3 cache, with dedicated L1 and
L2 caches for each core processor.
The Intel Core i7 is an example of this organization.(shared L3 cache)
1. Constructive interference can reduce overall miss rates. That is, if a thread on one core accesses a main
memory location, this brings the frame containing the referenced location into the shared cache. If a thread
on another core soon thereafter accesses the same memory block, the memory locations will already be
available in the shared on-chip cache.
2. A related advantage is that data shared by multiple cores is not replicated at the shared cache level.
3. With proper frame replacement algorithms, the amount of shared cache allocated to each core is dynamic
4. Interprocessor communication is easy to implement, via shared memory locations.
A potential advantage to having only dedicated L2 caches on the chip is that each core enjoys more
rapid access to its private L2 cache.This is advantageous for threads that exhibit strong locality.
Command decoding : The I/O module accepts command from the processor, typically sent as
signals on control bus.
Data : Data are exchanged betweeen the processor and the I/O module over the data bus.
StatusReporting: Because peripherals are so slow, it is important to know the status of the I/O
module. For example, if an I/O module is asked to send data to the processor(read), it may not be
ready to do so because it is still working on the previous I/O command. This fact can be reported
with a status signal. Common status signals are BUSY and READY.
Address Recognition : Just as each word of memory has an address, so thus each of the I/O
devices. Thus an I/O module must recognize one unique address for each peripheral it controls.
On the other hand, the I/O must be able to perform device communication. This communication
involves command, status information and data.
Data Buffering: An essential task of an I/O module is data buffering. The data buffering is required due to
the mismatch of the speed of CPU, memory and other peripheral devices. In general, the speed of CPU
is higher than the speed of the other peripheral devices. So, the I/O modules store the data in a data buffer
and regulate the transfer of data as per the speed of the devices.
In the opposite direction, data are buffered so as not to tie up the memory in a slow transfer operation.
Thus the I/O module must be able to operate at both device and memory speed.
Error Detection: Another task of I/O module is error detection and for subsequently reporting error to the
processor. One class or error includes mechanical and electrical malfunctions reported by the device (e.g.
paper jam). Another class consists of unintentional changes to the bit pattern as it is transmitted from devices
to the I/O module.
Block diagram of I/O Module is shown in the Figure 6.1.
Since I/O devices are included in the same memory address space, so the status and address registers of
I/O modules are treated as memory location by the processor. Therefore, the same machine instructions
are used to access both memory and I/O devices.
Isolated or I/O -mapped I/O:
In this scheme, the full range of addresses may be available for both. The address refers to a memory
location or an I/O device is specified with the help of a command line.
In general
if
=1, it indicates that the address present in address bus is the address of an I/O device.
(i.e., IO=1 and M=0)
if
=0, it indicates that the address present in address bus is the address of a memory location.
(i.e., IO=0 and M=1)
Since full range of address is available for both memory and I/O devices, so, with 16 address lines, the
system may now support both 216 memory locations and 216 I/O addresses.
(Note: If I/O module is asked for 16 marks try to combine both 1st and 2nd question content)
====================================================================
3. Brief about I/O subsystems or different I/O systems ?
There are three basic forms of input and output systems
Programmed I/O
Interrupt driven I/O
Direct Memory Access(DMA)
With programmed I/O, the processor executes a program that gives its direct control of the I/O operation,
including sensing device status, sending a read or write command, and transferring the data.
With interrupt driven I/O, the processor issues an I/O command, continues to execute other instructions,
and is interrupted by the I/O module when the I/O module completes its work.
In Direct Memory Access (DMA), the I/O module and main memory exchange data directly without
processor involvement.
processor is responsible for
Memory
Special
address
space
I/O device
space
With both programmed I/O and Interrupt driven I/O, the processor is responsible for extracting data from
main memory for output operation and storing data in main memory for input operation. To send data to an
output device, the CPU simply moves that data to a
special memory location in the I/O address space if I/O mapped input/output is used
or to an address in the memory address space if memory mapped I/O is used.
To read data from an input device, the CPU simply moves data from the address (I/O or memory) of that
device into the CPU.
Input/Output Operation: The input and output operation looks very similar to a memory read or write
operation except it usually takes more time since peripheral devices are slow in speed than main memory
modules. The working principle of the three method for input of a Block of Data is shown in the Figure 6.2.
Fig 6.2 Working of three techniques for input of block of data.
================================================================
4. Write short notes on Programmed control I/O? (use flowchart from Q. No 3 fig a)
programmed I/O, the processor executes a program that gives its direct control of the I/O operation,
including sensing device status, sending a read or write command, and transferring the data.
Input/Output Port: An I/O port is a device that looks like a memory cell to the computer but contains
connection to the outside world. An I/O port typically uses a latch. When the CPU writes to the address
associated with the latch, the latch device captures the data and makes it available on a set of wires external
to the CPU and memory system. The I/O ports can be read-only, write-only, or read/write. The writeonly port is shown in the Figure 6.3.
For memory-mapped I/O, any instruction that accessed memory can access a memory-mapped I/O port. I/Omapped input/output uses special instruction to access I/O port.
Generally, a given peripheral device will use more than a single I/O port.
for example printer, The input port returns control signals from the printer. These signals indicate
whether the printer is ready to accept another character, is off-line, is out of paper etc. The output port
transmits control information to the printer such as whether data is available to print.
In programmed I/O, the data transfer between CPU and I/O device is carried out with the help of a
software routine. When a processor is executing a program and encounters an instruction relating to I/O,
it executes that I/O instruction by issuing a command to the appropriate I/O module. The following are
steps followed while executing the I/O instruction/operation.
The I/O module will perform the requested action and then set the appropriate bits in the I/O status
register.
It is the responsibility of the processor to check periodically the status of the I/O module until it finds
that the operation is complete.
when the processor issues a command to a I/O module, it must wait until the I/O operation is
complete.
The I/O devices are slower than the processor, so in this scheme CPU time is wasted. CPU is
checking the status of the I/O module periodically without doing any other work.
I/O Commands
To execute an I/O-related instruction, the processor issues an address, specifying the particular I/O
module and external device, and an I/O command. There are four types of I/O commands that an I/O module
will receive when it is addressed by a processor
Control : Used to activate a peripheral device and instruct it what to do. For example, a magnetic
tape unit may be instructed to rewind or to move forward one record. These commands are specific
to a particular type of peripheral device.
Test : Used to test various status conditions associated with an I/O module and its peripherals. The
processor will want to know if the most recent I/O operation is completed or any error has
occurred.
Read : Causes the I/O module to obtain an item of data from the peripheral and place it in the
internal buffer.
Write : Causes the I/O module to take an item of data ( byte or word ) from the data bus and
subsequently transmit the data item to the peripheral.
================================================================
5. Write short notes on Interrupt driven I/O? (use flowchart from Q. No 3 fig b)
The problem with programmed I/O is that the processor has to wait a long time for the I/O module
of concern to be ready for either reception or transmission of data. The processor, while waiting, must
repeatedly interrogate the status of the I/O module.
The solution to this problem is to provide an interrupt mechanism. In this approach the
processor issues an I/O command to a module and then go on to do some other useful work. The I/O module
then interrupt the processor to request service when it is ready to exchange data with the processor. The
processor then executes the data transfer. Once the data transfer is over, the processor then resumes its
former processing.
I/O module point of view
For input, the I/O module receives a READ command from the processor.
The I/O module then proceeds to read data from an associated peripheral device.
Once the data are in the modules data register, the module issues an interrupt to the processor over a
control line.
The module then waits until its data are requested by the processor.
When the request is made, the module places its data on the data bus and is then ready for another
I/O operation.
Processor point of view
It then does something else (e.g. the processor may be working on several different programs at the
same time)
At the end of each instruction cycle, the processor checks for interrupts.
When the interrupt from an I/O module occurs, the processor saves the context (e.g. program counter
& processor registers) of the current program and processes the interrupt.
In this case, the processor reads the word of data from the I/O module and stores it in memory.
It then restores the context of the program it was working on and resumes execution.
Interrupt Processing
The occurrence of an interrupt triggers a number of events, both in the processor hardware and in
software. When an I/O device completes an I/O operation, the following sequences of hardware events
occurs:
1. The device issues an interrupt signal to the processor.
2. The processor finishes execution of the current instruction before responding to the interrupt.
3. The processor tests for the interrupt; if there is one interrupt pending, then the processor sends an
acknowledgement signal to the device which issued the interrupt. After getting acknowledgement,
the device removes its interrupt signals.
4. The processor now needs to prepare to transfer control to the interrupt routine. It needs to save the
information needed to resume the current program at the point of interrupt. The minimum
information required to save is the processor status word (PSW) and the location of the next
instruction to be executed which is nothing but the contents of program counter. These can be pushed
into the system control stack.
5. The processor now loads the program counter with the entry location of the interrupt handling
program that will respond to the interrupt.
================================================================
6. Explain Interrupt processing in detail?
Interrupt Processing (instruction level)
1. An interrupt occurs when the processor is executing the instruction of location N (current
instruction).
2. At that point, the value of program counter is N+1(next instruction) .
3. Processor services the interrupt after completion of current instruction execution (N).
4. First, it moves the content of general registers to system stack.
5. Then it moves the program counter value to the system stack(N+1).
6. Top of the system stack is maintained by stack pointer.
7. The value of stack pointer is modified to point to the top of the stack.
8. If M elements are moved to the system stack, the value of stack pointer is changed from T to T-M.
9. Next, the program counter is loaded with the starting address of the interrupt service routine.
10. Processor starts executing the interrupt service routine.
Interrupt service routine starts at location X and the return instruction is in location X + L.
2.
After fetching the return instruction, the value of program counter becomes X + L + 1.
3.
While returning to user's program, processor must restore the earlier values.
4.
From control stack, it restores the value of program counter and the general registers.
5.
Accordingly it sets the value of the top of the stack and accordingly stack pointer is updated.
6.
Now the processor starts execution of the user's program (interrupted program) from memory
location N + 1.
When multiple I/O modules, the processor needs to know / identify which device issued the
interrupt.
2.
If multiple interrupts have occurred the processor needs to map interrupt and the process.
Software poll.
The most straight forward approach is to provide multiple interrupt lines between the processor and
the I/O modules.
It is impractical to dedicate more than a few bus lines or processor pins to interrupt lines.
Thus, though multiple interrupt lines are used, it is most likely that each line will have multiple I/O
modules attached to it. Thus one of the other three techniques must be used on each line.
Software Poll
When the processor detects an interrupt, it branches to an interrupt service routine whose job is
to poll each I/O module to determine which module caused the interrupt.
The poll could be implemented with the help of a separate command line (e.g. TEST I/O). In this
case, the processor raises TEST I/O and place the address of a particular I/O module on the address
lines. The I/O module responds positively if it set the interrupt.
Alternatively, each I/O module could contain an addressable status register. The processor then reads
the status register of each I/O module to identify the interrupting module.
Once the correct module is identified, the processor branches to a device service routine specific to
that device.
The main disadvantage of software poll is that it is time consuming. Processor has to check the
status of each I/O module and in the worst case it is equal to the number of I/O modules.
Daisy Chain :
In this method for interrupts all I/O modules share a common interrupt request lines. However the
interrupt acknowledge line is connected in a daisy chain fashion. When the processor senses an
interrupt, it sends out an interrupt acknowledgement.
The interrupt acknowledge signal propagates through a series of I/O module until it gets to a
requesting module.
The requesting module typically responds by placing a word on the data lines. This word is
referred to as a vector and is either the address of the I/O module or some other unique
identification.
In either case, the processor uses the vector as a pointer to the appropriate device service routine.
This avoids the need to execute a general interrupt service routine first. This technique is referred
to as a vectored interrupt. The daisy chain arrangement is shown in the Figure 6.7.
The I/O transfer rate is limited by the speed with which the processor can test and service a
device.
To transfer large block of data at high speed, a special control unit may be provided to allow transfer
of a block of data directly between an external device and the main memory, without continuous
intervention by the processor. This approach is called direct memory access or DMA. DMA transfers are
performed by a control circuit associated with the I/O device and this circuit is referred as DMA controller.
The DMA controller allows direct data transfer between the device and the main memory without involving
the processor.
To transfer data between memory and I/O devices,DMA controller takes over the control of the
system from the processor and transfer of data take place over the system bus. For this purpose, the DMA
controller must use the bus only when the processor does not need it, or it must force the processor to
suspend operation temporarily. The later technique is more common and is referred to as cycle stealing,
because the DMA module in effect steals a bus cycle.
The processor then continues with other works. It has delegated this I/O operation to the DMA
module.The DMA module checks the status of the I/O devise whose address is communicated to DMA
controller by the processor. If the specified I/O devise is ready for data transfer, then DMA module
generates the DMA request to the processor. Then the processor indicates the release of the system bus
through DMA acknowledge.
The DMA module transfers the entire block of data, one word at a time, directly to or from memory,
without going through the processor. When the transfer is completed, the DMA module sends an interrupt
signal to the processor. After receiving the interrupt signal, processor takes over the system bus. Thus the
processor is involved only at the beginning and end of the transfer. During that time the processor is
suspended.
It is not required to complete the current instruction to suspend the processor. The processor may be
suspended just after the completion of the current bus cycle. On the other hand, the processor can be
suspended just before the need of the system bus by the processor, because DMA controller is going to use
the system bus, it will not use the processor. The point where in the instruction cycle the processor may be
suspended shown in the Figure 6.10.
The net effect is that the processor will go slow. But the net effect is the enhancement of
performance, because for a multiple word I/O transfer, DMA is far more efficient than interrupt driven
or programmed I/O.
==========================================================================
9. Write short notes on different DMA mechanism and its configuration?
(Note: if DMA is asked for 16 marks combine both Q.No 8 and Q.No 9 put points
according to the question)
The DMA mechanism can be configured in different ways. The most common amongst them are:
through this I/O bus. In this transfer, system bus is not in use and so it is not needed to suspend the
processor. There is another transfer phase between DMA module and memory. In this time system bus is
needed for transfer and processor will be suspended for one bus cycle.
The configuration is shown in the Figure 6.13.
UNIT V (MEMORY)
1. Write short notes on Memories?
Digital computer works on stored programmed concept introduced by Von Neumann. We use
memory to store the information, which includes both program and data. Due to several reasons, we have
different kind of memories. We use different kind of memory at different level. The memory of computer is
broadly categories into two categories:
1. Internal and
2. external
Internal memory is used by CPU to perform task and external memory is used to store bulk
information, which includes large software and data. Memory is used to store the information in digital
form. The memory hierarchy is given by:
Register
Cache Memory
Main Memory
Magnetic Disk
Removable media (Magnetic tape)
Register: This is a part of Central Processor Unit, so they reside inside the CPU. The information from
main memory is brought to CPU and keep the information in register. Due to space and cost constraints, we
have got a limited number of registers in a CPU. These are basically faster devices.
Cache Memory: Cache memory is a storage device placed in between CPU and main memory. These are
semiconductor memories. These are basically fast memory device, faster than main memory.We cannot have
a big volume of cache memory due to its higher cost and some constraints of the CPU. Due to higher cost
we cannot replace the whole main memory by faster memory. Generally, the most recently used information
is kept in the cache memory. It is brought from the main memory and placed in the cache memory. Now a
days, we get CPU with internal cache.
Main Memory: Like cache memory, main memory is also semiconductor memory. But the main memory is
relatively slower memory. We have to first bring the information (whether it is data or program), to main
memory. CPU can work with the information available in main memory only.
Magnetic Disk: This is bulk storage device. We have to deal with huge amount of data in many
application. But we don't have so much semiconductor memory to keep these information in our computer.
On the other hand, semiconductor memories are volatile in nature. It loses its content once we switch off the
computer. For permanent storage, we use magnetic disk. The storage capacity of magnetic disk is very high.
Removable media: For different application, we use different data. It may not be possible to keep all the
information in magnetic disk. So, which ever data we are not using currently, can be kept in removable
media. Magnetic tape is one kind of removable medium. CD is also a removable media, which is an optical
device.
Register, cache memory and main memory are internal memory. Magnetic Disk, removable media
are external memory. Internal memories are semiconductor memory. Semiconductor memories are
categoried as volatile memory and non-volatile memory.
RAM: Random Access Memories are volatile in nature. As soon as the computer is switched off,
the contents of memory are also lost.
ROM: Read only memories are non volatile in nature. The storage is permanent, but it is read only
If the MAR is k-bit long, then the total addressable memory location will be 2k.If the MDR is n-bit long,
then the n bit of data is transferred in one memory cycle. The transfer of data takes place through memory
bus, which consist of address bus and data bus. In the above example, size of data bus is n-bit and size of
address bus is k bit.
It also includes control lines like Read, Write and Memory Function Complete (MFC) for
coordinating data transfer. In the case of byte addressable computer, another control line to be added to
indicate the byte transfer instead of the whole word.
For memory operation, the CPU initiates a memory operation by loading the appropriate data i.e.,
address to MAR.
If it is a memory read operation, then it sets the read memory control line to 1. Then the contents
of the memory location is brought to MDR and the memory control circuitry indicates this to the CPU by
setting MFC to 1.
If the operation is a memory write operation, then the CPU places the data into MDR and sets the
write memory control line to 1. Once the contents of MDR are stored in specified memory location, then
the memory control circuitry indicates the end of operation by setting MFC to 1.
==========================================================================
2. Brief about principle of locality? ( 6 marks, or add in any memory related answer)
Principle of locality states that programs access a relatively small portion of their address space at
any instant of time.
Temporal locality (locality in time): If an item is referenced, it will tend to be referenced again
soon.
Example: If you recently brought a book to your desk to look at, you will probably
need to look at it again soon.
Spatial locality (locality in space): If an item is referenced, items whose addresses are close by
will tend to be referenced soon. Example: Libraries put books on the same topic together on
the same shelves to increase spatial locality.
Most programs contain loops, so instructions and data are likely to be accessed repeatedly, showing high
amounts of temporal locality.
Sequential accesses to elements of an array or a record will naturally have high degrees of spatial
locality.
==========================================================================
3. Write short Notes on Memory Hierarchy Levels ?
The goal is to present the user with as much memory as is
available in the cheapest technology, while providing access at the
speed offered by the fastest memory. When we move away from the
processor, the levels take progressively longer to access. Data is
copied between only two adjacent levels at a time.
The upper levelthe one closer to the
processor is smaller and faster Fig. shows that the
minimum unit of information that can be either
present or not present in the two-level hierarchy is
called a block or a line.
Block (line): unit of copying,
May be multiple words / word / byte.
If accessed data is present in upper level is called as
HIT. The hit rate, or hit ratio, is the fraction of
memory accesses found in the upper level.
Hit ratio: hits/accesses
If accessed data is absent in upper level is called as
MISS. The lower level in the hierarchy is then
accessed to retrieve the block containing the requested data. The miss rate (1hit rate) is the fraction of
memory accesses not found in the upper level.
Miss ratio: misses/accesses
Then accessed data supplied from upper level.
Hit time is the time to access the upper level of the memory hierarchy, which includes the time needed to
determine whether the access is a hit or a miss.
The miss penalty is the time to replace a block in the upper level with the corresponding block from the
lower level, plus the time to deliver this block to the processor.
The hit time will be much smaller than the time to access the next level in the hierarchy(miss penalty),
which is the major component of the miss penalty.
============================================================================
4. Discuss about the memory technologies that are implemented in computer storage systems.
Memory Technology
SRAM Technology
integrated circuits that are memory arrays with (usually) a single access port that can provide
either a read or a write.
SRAMs have a fixed access time, but read and write access times differ.
It does not refresh and so the access time is very close to the cycle time
SRAMs typically use six to eight transistors per bit to prevent the information from being
disturbed when read.
only minimal power needed to retain the charge in standby mode.
DRAM Technology
Disk Memory
magnetic hard disk consists of a collection of platters, which rotate on a spindle at 5400 to 15,000
revolutions per minute.
metal platter Magnetic coat. A movable arm containing a small electromagnetic coil called a
read-write head is located just above each surface.
TRACKS - Each disk surface is divided into concentric circles (> 10000).
SECTORS Each track is divided in to sectors.(>1000 sectors per track)
Sector size 512 to 4096 bytes
Sequence in sector sector no ----- inf. About sector(err. corection code) ------ next sector
number.
The disk heads connected together, - move in conjunction Every head is over the same
track.cylinder is used to refer to all the tracks under the heads at a given point on all surfaces.
3. Transfer time: The time to transfer a block of bits.The transfer time is a function of the sector size,
the rotation speed, and the recording density of a track. (100 and 200 MB/sec year 2012)
ADV: Magnetic disks are nonvolatile like flash, but unlike flash there is no write wear-out problem.
============================================================================
5. Explain cache memory? (for 16 marks, write Q. no 5 and Q.no 6)
Cache was the name chosen to represent the level of the memory hierarchy between the processor
and main memory in the first commercial computer to have this extra level. The memories in the data
path(Buffers IF/ID, ID/EX, EX/MEM,
MEM/WB) in are simply replaced by
caches.
Processor
request for a word in the cache, each
block in the cache is equal to the size of
the word define in processor.
The processor requests a
word Xn that is not in the cache. This
request results in a miss, and the word
Xn is brought from memory into the
cache. Each word in memory is to assign the cache location based on the address of the word in memory.
This cache structure is called direct mapped.Each memory location is mapped directly to exactly one
location in the cache.
DIRECT MAPPING IN CACHE
Location in cache for a given data is determined by the address in the memory. The No of blocks in the
cache memory is equal to 2 power I , where I is the number of bits used to address the cache memory. It
uses or matches its address to the lower order address bits in the memory.
The
mapping is done by using the formula (Block Address of memory) Modulo (No of blocks in cache).
To identify the data's original mapping area in memory, Add a set of tags to the cache. The tags contain the
address information required to identify whether a word in the cache corresponds to the requested word. The
tag contains only the upper portion of the address. (It will not include bits that are used as an index into the
cache)
As per previous example, the address has 5 bits in which higher order 2 bits are used for tag
and lower order 3 bits are used for indexing cache. Index Value of cache always used or referred as
block no for the Cache.
To check whether, the data present in the cache block is valid(in use) or not, is found by adding another 1 bit
called as valid bit. If this bit =1, then data is valid else if bit=0 then data is in valid (block is free). To
validate the data present in the cache, i.e., mapped to a particular memory location or not, then the lower
order address bits of the memory must match the index field of the cache block and the higher order
address bits must match the tag field value of that particular cache block.
memory address = (tag field) combines with (cache index)
============================================================================
6. Write short notes on handling the cache Misses and data inconsistency?
The steps to be taken on an instruction cache miss:
1. Send the original PC value (current PC 4) to the memory.
2. Instruct main memory to perform a read and wait for the memory to complete its access.
3. Write the cache entry, putting the data from memory in the data portion of the entry, writing the upper
bits of the address (from the ALU) into the tag field, and turning the valid bit on.
4. Restart the instruction execution at the first step, which will refetch the instruction, this time finding it
in the cache.
The
control of the cache on a data access is essentially identical: on a miss, we simply stall the processor
until the memory responds with the data.
Scheme -- write-through.
Suppose a store instruction wrote the data into only the data
cache (without changing main memory); then, after the write into the cache, memory would have a
different value from that in the cache. Now the cache and memory are said to be inconsistent.
To keep main memory and the cache consistent, always data write is performed in both the memory
and the cache.This will not provide good performance as these memory writes can take 100 CPI (cycles
per Instruction).
One solution to this problem is to use a write buffer.The memory write is done , by writing the
content in the write Buffer.When the buffer data are written in memory, then that part of the buffer is
freed.
Alternative scheme -- write-back.
The new value is written only to the block in the cache. The modified block is written to the lower
level of the hierarchy when it is replaced. Write-back schemes can improve performance, more complex
to implement than write-through.
==========================================================================
7. Explain Virtual Memory ?(if asked for 16 marks, write Q. No 7 and Q. No 8)
The main memory can act as a cache for the secondary storage, This technique is called virtual
memory.Need for Virtual memory: To allow efficient and safe sharing of memory among multiple
programs.Virtual memory role is to protect and ensuring that a program can only read and write the
portions of main memory that have been assigned to it.Compilation of each program is done into its own
address spacea separate range of memory locations accessible only to this program.
Virtual memory implements the translation of a programs address space to physical addresses.
This translation process enforces protection of a programs address space
physical address : An address in main memory.
Protection: A set of mechanisms for ensuring that multiple processes sharing the processor, memory, or
I/O devices cannot interfere, intentionally or unintentionally, with one another by reading or writing each
others data. It is also used to isolate OS process and user process.
User program: virtual memory allow a single user program to exceed the size of primary memory.
Programmers divided programs into pieces and then identified the pieces that were mutually exclusive.
These overlays were loaded or unloaded (to and from main memory) under user program control during
execution. User program control ensures that program never access overlays that are not loaded and
these overlays never exceeds the allocated size of the memory. Overlays are organized modules, each
containing both code and data.
In virtual memory,
A virtual memory block is called a page, and a virtual memory miss is called a page fault. The
processor produces a virtual address, which is translated by a combination of hardware and soft ware
to a physical address, which in turn can be used to access main memory. This process is called address
mapping or address translation.
Fixed-size pages (e.g., 4K)
Address Translation:
Virtual memory also simplifies loading the program for execution by providing relocation. This
relocation allows us to load the program anywhere in main memory.All virtual memory systems in uses
relocation, the program is divided as a set of fixed-size blocks (pages), it eliminates the need to find a
contiguous block of memory for a program; instead, the operating system need only find a sufficient
number of pages in main memory.In virtual memory, the address is broken into a virtual page
number and a page off set. (refer previous diagram)
The
physical page number constitutes the upper portion of the physical address, while the page off set,
which is not changed, constitutes the lower portion. The number of bits in the page off set field
determines the page size.A larger number of virtual pages than physical pages is essentially the concept
to show unbounded amount of virtual memory.
Page fault
leads to enormous miss penalty. (millions of clock cycles)
To design
a virtual memory system, following issues are considered.
1. Pages should be large enough to reduce high access time. Sizes from 4 KiB to 16 KiB are typical
today (depends on computer types).
2. The primary technique used here is to allow fully associative placement of pages in memory.
3. Page faults can be handled in software because the overhead will be small compared to the disk access
time. (it reduces miss penalty)
4.
Write-through will not work for virtual memory, since writes take too long. Instead, virtual memory
systems use write-back.
==========================================================================
8. Explain page management in virtual memory and translation using page table?
Placing a Page and Finding It Again:
Location of pages by using a table that indexes the memory; this structure is called a page
table.(It presents in memory).A page table is indexed with the page number from the virtual address to
discover the corresponding physical page number. Each program has its own page table, which maps
the virtual address space of that program to main memory. To indicate the location of the page table in
memory, the hardware includes a register that points to the start of the page table; we call this the
page table register. (Refer figure) The page table register, the virtual address, and the indicated page
Table shows how the hardware can form a physical address. A valid bit is used in each page table
entry,(as in cache).
If the bit is off , the page is not present in main memory and a page fault occurs.
If the bit is on, the page is in memory and the entry contains the physical page number. The
page table contains a mapping for every possible virtual page, no tags are required.
is set when any word in a page is written. The dirty bit indicates whether the page needs to be written
out. (in disk) . So modified page is often called a dirty page.
=======================================================================
9. Explain Fast translation using a TLB?
Page tables are stored in main memory, Memory access by a program can take at least twice as long. 1.
One memory access to obtain the physical address
2.
Second access to get the data.
To improve performance, rely on locality of reference i.e., when a virtual page is translated, it will be
referenced again(due to temporal and spatial locality).So, special address translation cache is used and is
traditionally referred to as a translation-lookaside buffer (TLB) or translation cache,TLB is acessed
instead of the page table on every reference, the TLB will need to include other status bits, such as the
dirty and the reference bits.
When a miss in the TLB occurs, we
must determine whether it is a page
fault or merely a TLB miss. If the
page exists in memory, then the
TLB miss indicates only that the
translation is missing, so the
processor can handle the
TLB miss by loading the translation
from the page table into the TLB. If
the page is not present in memory,
then the TLB miss indicates a true
page fault.
Details about TLB might be:
1. TLB size: 16512 entries
2. Block size: 12 page table entries
(typically 48 bytes each)
3. Hit time: 0.51 clock cycle
4. Miss penalty: 10100 clock
5. cycles Miss rate: 0.01%1%
TLB uses small, fully associative mapping (lower miss rate).Handling a TLB miss or a page fault
requires using the exception mechanism to interrupt the active process, transferring control to the
operating system, and later resuming execution of the interrupted process.