Computer Architecture CT 2 Paper Solution: K M K M K M K M T T) S (M
Computer Architecture CT 2 Paper Solution: K M K M K M K M T T) S (M
Computer Architecture CT 2 Paper Solution: K M K M K M K M T T) S (M
Q1. ULTRA SPARC IV Plus is a 4 way superscalar 14 stage pipeline. Find speedup. Obtain the
relation used. Comment on the specific requirement for cache. [1+2+1]
A1.
Calculating the speedup - [1 mark]
Given m=4, k= 14
S (m, 1) = S (4, 1) = T (1, 1) / T (4, 1) = (4*(N+14-1)) / (N+4*(14-1)) = (4*(N+13)) / (N+52)
S (4, 1) 4
N ∞
Obtaining the relation used to calculate the speedup - [2 marks]
For a k-stage linear pipeline,
No. of stages in the pipeline = k
No. of tasks to be processed = N
No. of cycles used to fill up the pipeline = k,
(also, this is the time required to complete execution of the first task), therefore,
No. of remaining tasks = N-1, and
No. of cycles needed to complete N-1 tasks = N-1
Therefore, the time required by the scalar base machine,
T (1, 1) = Tk = k + (N-1) clock periods
Similarly, the time required by an m-issue superscalar machine,
T (m, 1) = k + (N – m)/m
= N/m + (k – 1)
where,
k = time required to execute the first ‘m’ instructions through the m-pipelines simultaneously
(N – m)/m = time required to execute the remaining (N – m) instructions ‘m’ per cycles through
‘m’ pipelines
Speedup of the superscalar machine over the base machine,
T (1,1) + k − 1 m( + k − 1)
S(m,1 ) = = =
T (m,1) + k − 1 + m(k − 1)
m
S (m,1) →∝ → m
A2.
Calculating the block size - [0.25 marks]
Given, Size of Cache = 64 KB
Index Bit = 12
Size of index ≡ f (cache size, block size, set associativity)
cachesize
2index =
blocksize× set associativity
212 = 64KB / (block size * 1), (set associativity = 1 for direct mapping)
Block size = 16 Bytes
Similarly, for Index bit 16,
216 = 64KB / (block size * 1)
Block size = 1 Bytes
A1.
Reasons for “VLIW is not a main stream computer” - [1 mark for any 2 points]
• Due to lack of compatibility with conventional hardware and software.
• Due to the dependence on Trace-scheduling compiling and code compaction.
• VLIW instructions are larger than the mainstream computer since each specifies several
independent operations.
• VLIW uses random parallelism among scalar operations instead of regular synchronous
parallelism.
Salient features of a VLIW machine - [2 marks for any 4 points & diagram]
• Instruction word length: hundreds of bits.
• Use of (i) multifunctional units concurrently, and
(ii) Common Large Register File shared by all functional Units.
• In VLIW run-time scheduling and synchronization are eliminated, because Instruction
parallelism and data movement in VLIW architecture are completely specified at compile
time.
• VLIW processor is an extreme of a superscalar processor in which all independent or
unrelated operations are already synchronously compacted.
• The instruction parallelism embedded in the compacted code may require a different
latency to be executed by different FUs even though the instructions are issued at the
same time.
• It uses random parallelism among scalar operations instead of regular synchronous
parallelism (as used in vectorized supercomputer or as in an SIMD computer).
Processor
Instruction Format
Q4. State various conditions to be satisfied for code compaction. Why is the trace scheduling
done in VLIW? [2]
A4.
Conditions for code compaction - [1 mark]
Code compaction is an attempt to tune the code automatically in order to make it occupy less
space while maintain all of its original functionality. Thus, the conditions are -
• It must reduce the space used.
• It must retain the original functionality (i.e. the dataflow of the program must not
change and the exception behavior must be preserved).
Trace scheduling done in VLIW - [1 mark]
Many compilers for first-generation ILP processors used a three phase method to generate code.
The phases were,
• Generate a sequential program. Analyze each basic block in the sequential program for
independent operations.
• Schedule independent operations within the same block in parallel if sufficient hardware
resources are available.
• Move operations between blocks when possible.
This three phase approach fails to exploit much of the ILP available in the program for two
reasons,
• Often times, operations in a basic block are dependent on each other. Therefore sufficient
ILP may not be available within a basic block.
• Arbitrary choices made while scheduling basic blocks make it difficult to move
operations between blocks.
Trace scheduling is a form of global code motion. In trace scheduling, a set of commonly
executed sequence of blocks is gathered together into a trace and the whole trace is scheduled
together. It works by converting a loop to long straight-line code sequence using loop unrolling
and static branch prediction. This process separates out "unlikely" code and adds handlers for
exits from trace. The goal is to have the most common case executed as a sequential set of
instructions without branches.
Q5. Explainn set associative memory mapping. [3]
A5.
Explanation – [0.5 marks]
It is a compromise between the two extreme cache designs based on direct mapping and full
associativity mapping. In a k-way
way associative search, the tag needs to be compared only with the
k-tags within the identified set. Since k is small in practice, the kk-way
way associative search is much
more economical than the full associativity
associativity.
In general, a block Bj can be mapped into any onone of the available frames in a set Si, Bj Є Si, if
j (modulo v) = i. Matched tag identifies the current block which resides in the frame.
Design Trade-offs – [0.25 marks]
• The set size (associativity) k and the number of sets v are inversely related;
related m = v*k
• For a fixed cache size there exists a tradeoff between the set size and the number of sets.
• Set-Associative
Associative Cache is used in most of the high performance computer systems.
Advantages – [0.25 marks]
1. The block replacement algorithm needs to consider only a few blocks in the same set.
The replacement policy can be more economically implemented with limited choices, as
compared with the fully associative cache.
2. The k-way
way associative search is easier to implement.
3. Many design tradeoffs can be considered ((m =v*k) to yield a higher hit ratioatio in the cache.
Figure with explanation – [1 mark]
Example with figure – [1 mark]
A6.
Formal statement – [1 mark]
Given,
I. a vector task system V
II. a vector computer, with
III. m identical pipelines, and
IV. a deadline D,
Does there exist a parallel schedule f for V with Finish Time ω such that ω ≤ D?
It is a feasibility problem.
Desirable algorithm: Heuristic scheduling algorithm.
Q7. A sum of first k components of a vector (k = 0 to n - 1), when number () of 3 processors
and length (n) of the vector, both are 8 is obtained in three steps. Will the number of steps
increase if both and n are raised to 16? Justify your answer. [3]
A7.
Formulation – [0.25 marks]
We have to find the sum, S (k) of the first k components in vector A for each k varying from 0 to
15.
k
S k = Σ Ai for k = 0, 1,2, ...., 15
i=0
PE’s not involved in routing: PE15 PE’s not involved in routing: PE14 & PE15
PE’s inactive in addition: PE0 PE’s inactive in addition: PE0 & PE1
Mask-off: PE0 & PE7 Mask-off: PE0, PE1, PE14 & PE15
PE’s not involved in routing: PE12 to PE15 PE’s not involved in routing: PE8 to PE15
PE’s inactive in addition: PE0 to PE3 PE’s inactive in addition: PE0 to PE7
Mask-off: PE0 to PE3, PE12 to PE15 Mask-off: PE0 to PE15
So we can see the number of steps required is 4.
Calculation
alculation of number of steps using the formula – [0.25 marks]
If number of processors is a perfect square then the number of steps would be equals to r = √@.
So,
r = √16
r=4
this is also the result we obtained diagrammatically.