Assignment Questions
Assignment Questions
Assignment Questions
Assignment
3. 30% of a benchmark program’s execution time is from multiply operations. Uber cool
hardware speeds up these operations 12 times! Suppose the program took 20 seconds to
execute without the enhanced hardware, what is the overall speedup achieved? During its
enhanced operation, what is the new execution time, and what is the percentage of time
multiply operations take?
5. The largest configuration of a Cray T90 (Cray T932) has 32 processors, each
capable of generating 4 loads and 2 stores per clock cycle. The processor clock
cycle is 2.167 ns, while the cycle time of the SRAMs used in the memory system is 15 ns.
Calculate the minimum number of memory banks required to allow all processors to run at
full memory bandwidth.
7. Assume a GPU architecture that contains 10 SIMD processors. Each SIMD instruction
has a width of 32 and each SIMD processor contains 8 lanes for single-precision
arithmetic and load/store instructions, meaning that each non-diverged SIMD instruction
can produce 32 results every 4 cycles. Assume a kernel that has divergent branches that
causes on average 80% of threads to be active. Assume that 70% of all SIMD instructions
executed are single- precision arithmetic and 20% are load/store. Since not all memory
latencies are covered, assume an average SIMD instruction issue rate of 0.85. Assume
that the GPU has a clock speed of 1.5 GHz. Compute the throughput, in GFLOP/sec, for
this kernel on this GPU.
8. You have been asked to investigate the relative performance of a banked versus pipelined
L1 data cache for a new microprocessor. Assume a 64 KB two-way set associative cache
with 64 byte blocks. The pipelined cache would consist of three pipe stages, similar in
capacity to the Alpha 21264 data cache. A banked implementation would consist of two
32 KB two-way set associative banks. Use CACTI and assume a 65 nm (0.065 μm)
technology to answer the following questions. The cycle time output in the Web version
shows at what frequency a cache can operate without any bubbles in the pipeline. What
is the cycle time of the cache in comparison to its access time, and how many pipestages
will the cache take up (to two decimal places)?
9. Show how the following code sequence lays out in convoys, assuming a single
copy of each vector functional unit:
How many chimes will this vector sequence take? How many cycles per FLOP
(floating-point operation) are needed, ignoring vector instruction issue overhead?
******