CPU-Assisted GPGPU On Fused CPU-GPU Architectures
CPU-Assisted GPGPU On Fused CPU-GPU Architectures
CPU-Assisted GPGPU On Fused CPU-GPU Architectures
Skipped_thread_number
sample the number of L3 cache hits for the CPU and
200
compare the current sample with the last one. If the
difference is larger than a threshold, which is set as a 150
fixed value of 10 (our algorithm is not sensitive to this 100
threshold setting as shown in Section 5.4), we need to
update the skip_factor. The implementation of this 50
adaptive approach for the code in Figure 3 is shown in 0
Figure 5. Such update code is inside the second-level
loop with iterator ‘i’ in the code shown in Figure 3 and
is executed after we process a batch of thread ids. In cycles
other words, the variable ‘batch_size’ determines how Figure 6. The value of the variable
often we update the ‘skip_factor’. In our ‘skip_factor’ over time using our adaptive
implementation, batch size is set to 16, meaning that update approach shown in Figure 5.
we update the ‘skip_factor’ once we process (16 x 8 x Since many GPU workloads have regular memory
skip_factor) thread ids. access patterns, as shown in Figure 6, we also propose
//Accessing an L3 cache Performance counter to use profiling to simplify the update of the variable
ptlcall_cpu_read_value(PTLCALL_CPU_NAME_CPU_ ‘skip_factor’. In the profiling process, the compiler sets
HIT_L3, &hitnumber); the ‘skip_factor’ to a fixed value from the set {32, 64,
if (hitnumber-last_hit>threshold) skip_factor += 32;
96, 128, 160, 192, 224} and selects the one with
else if (back_dis != skip_factor -32) {
//preventing skip_factor bouncing between two values highest performance during test runs. This way, the
skip_factor -= 32; periodic update of ‘skip_factor’ can be removed from
back_dis = skip_factor; the CPU pre-execution code and there is also no need
} for a new instruction to access the L3 cache hits for the
last hit = hitnumber; GPU. The CPU pre-execution code with a fixed ‘skip
Figure 5. Adaptive update of the variable factor’ is shown in Figure 7, from which we can see
‘skip_factor’. that the code related to ‘batch_size’ is also removed.
Since our adaptive approach needs the L3 cache hit
statistics for CPU, we introduce a new instruction to 4.3. Generating the CPU Pre-Execution Code
access this L3 cache performance counter and this new from HWST GPU Kernels
instruction is implemented through a new HWST GPU kernels contain one or more loops,
‘ptlcall_cpu_read_value’ function in our simulator. As which contain global memory access instructions. We
shown in Figure 5, if the CPU has too many L3 cache refer to such a loop as a kernel loop. Among our
hits, we increase the ‘skip_factor’ by 32. If the CPU benchmarks, MM, Con, TMV, MV, and MC are of the
has too few L3 cache hits, the variable is decreased by HWST type. To generate the CPU pre-execution
32. The check ‘if(back_dis != skip_factor -32)’ program for an HWST GPU Kernel, we process one
prevents ‘skip_factor’ from bouncing between its kernel loop at a time. For each kernel loop, a CPU
current value and (the current value – 32). The variable function is generated, which contains the global
memory access instructions and the address concurrent threads. Similar to LWST kernels, the outer
computation operations in the loop body. Both the loop is used to prefetch data for concurrent TBs.
thread id and the kernel loop iterator are replaced with Before going through concurrent threads, however, we
function input parameters. If the kernel loop is a nested insert the second-level loop to account for the kernel
one, the iterators from all loop levels are replaced with loop structure. The third-level loop traverses through
function parameters. Our proposed compiler algorithm all concurrent threads, similar to step 2b in Figure 2 for
is shown in Figure 8. LWST kernels. We illustrate our algorithm using the
float cpu_prefetching( … ) { simplified version of transpose-matrix-vector
unroll_factor = 8; skip_factor = 160; multiplication. The GPU kernel and the generated CPU
//added loop to traverse thread blocks pre-execution program are shown in Figure 9.
for (j = 0; j < N_tb; j+= concurrent_tb) { As shown in Figure 9, the CPU function
//added loop to traverse concurrent threads ‘memory_fetch_for_thread_loop_1’ is generated from
for (i = 0; i < concurrent_tb*tb_size ; the loop body of the GPU kernel and the loop iterator
i+=skip_factor*unroll_factor) { and thread id are replaced with function parameters. In
int thread_id=i+skip_factor*unroll_factor+j* tb_size;
CPU function ‘cpu_prefetching’, the second-level loop
// unrolled loop
float a0=memory_fetch_for_thread ( (with iterator ‘m’) corresponds to the kernel loop. The
thread_id+ skip_factor *0); iterator update is a fixed value 8 rather than 1 so as to
float a1=memory_fetch_for_thread ( unroll the loop body for 8 times. The third level loop
thread_id+ skip_factor *1); (with iterator ‘i’) traverses through concurrent threads
…… for prefetching. The reason for such loop organization
sum += a0+a1+a2+a3+a4+a5+a6+a7; /* operation is that GPU executes many threads in parallel.
inserted to overcome dead code elimination */ Therefore, instead of prefetching data of multiple
}}} iterations for one thread, we prefetch data of one
Figure 7. The CPU pre-execution code for the iteration for many threads before moving on to the next
vector-add GPU kernel with a fixed skip factor iteration. The ‘skip_factor’ update part in Figure 9 is
of 160. the same as discussed in Section 4.2 and both adaptive
and profiling approaches can be applied.
1. For each kernel loop, extract memory operations and From our algorithms shown in Figures 2 and 8, we
the associated address computations from the loop body can see that the granularity of our CPU prefetch
and put them in a CPU function, replace thread id function is one loop iteration, with LWST as a special
computation with an input parameter, and replace the
kernel loop iterators with input parameters.
case of HWST. One may suggest finer granularity such
2. Add a nested loop structure into the CPU code to as prefetching one memory access a time. In other
prefetch data for concurrent threads. words, the CPU fetches one datum (e.g., A[n] in Figure
a. The outer loop traverse through all TBs. The 3) for many threads before moving on the next (B[n] in
iterator starts from 0 and the loop bound is the Figure 3) rather than fetching all the data in one
number of thread blocks in the GPU program. The iteration for a thread (A[n], B[n], and C[n] in Figure 3).
iterator update is the number of concurrent TBs, We do not choose this approach since it requires the
meaning the number of TBs that can run concurrently CPU and GPU to follow the exactly same access order
on the GPU. and the GPU compiler is more likely to re-order the
b. The second-level loop corresponds to the kernel
accesses within a loop body than to reorder accesses
loop and we use the same loop bound. The loop
update is increased to unroll the next level loop. If the across different loops. Furthermore, using one CPU
kernel loop is nested, this second-level loop is also function call to issue one access incurs too much
nested. control overhead for CPU execution.
c. The third-level loop traverses through concurrent Note that the algorithms in Figures 2 and 8 assume
threads. The loop iterator starts from 0 and the loop that the TBs are dispatched to SMs in-order, which is
bound is the number of concurrent threads (which is the case based on our experiments on current discrete
the product of TB size and the number of concurrent GPUs. If out-of-order TB dispatch is used, our scheme
TBs). The iterator update is set as a product of three would require GPU to send the active TB ids to the
parameters, unroll-factor, batch_size, and skip_factor.
CPU and the prefetching is done accordingly for those
Figure 8. The compiler algorithm to generate active blocks. In other words, we will replace the
CPU pre-execution program from HWST GPU implicit block ids in the loop “for (j = 0; j < N_tb; j+=
kernels. concurrent_tb)” (step 2a in Figure 2 and Figure 8) with
As shown in Figure 8, after generating the function explicit ones forwarded from the GPU. Our sequential
to load data for one loop iteration of a kernel loop (i.e., dispatch assumption eliminates such GPU-to-CPU
step 1), we insert loops to prefetch data for many communication.
__global__ void tmv_naive(float* A, float* B, examine the performance improvements achieved with
float* C, int width, int height) { our proposed CPU-assisted execution. In Figure 10, we
int x = blockIdx.x *blockDim.x+threadIdx.x; report the GPU performance using instruction per cycle
float sum = 0; (IPC) for each benchmark for three configurations, no
for (int i=0; i<height; i++) { CPU pre-execution (labeled ‘no-preex’), CPU pre-
sum += A[i*width+x]*B[i]; execution with adaptive update of skip factor (labeled
} ‘adaptive’), CPU pre-execution with a fixed skip factor
C[x] = sum; } (a)
determined from profiling (labeled ‘profiling’). Since
float memory_fetch_for_thread_loop_1 (int n, int m) our GPU has 4 SMs and each SM has 32 scalar SPs,
{ // n is the thread id and m is the loop iterator the peak IPC is 128. We also include the GPU
return (A[m*width+n] + B[m]); /* A, B are the performance results for a perfect L3 cache (labeled
CPU pointers mapped to the GPU memory */ ‘perfect L3) in Figure 10 for reference.
} 128 no‐preex adaptive
float cpu_prefetching( … ) { 112 profiling perfect L3
unroll_factor = 8; 96
//added loop to traverse thread block 80
IPC
for (j = 0; j < N_tb; j+= concurrent_tb) { 64
//the loop corresponding to kernel loop 48
32
for (m = 0; m < loop_counter_in_thread; m+=8) { 16
//added loop to traverse concurrent threads 0
for (i = 0; i < concurrent_tb*tb_size; BSc VD FFT MM Con TMV BS MC MV HM
i+=skip_factor*batch_size*unroll_factor)
for(k = 0; k < batch_size; k++) { Figure 10. GPU performance comparison
int thread_id = i + k*skip_factor*unroll_factor among no-pre execution (no-preex), CPU pre
+ j*tb_size; execution with adaptive update of ‘skip_factor’
// unrolled loop (adaptive) and CPU pre execution with a fixed
float a0 = memory_fetch_for_thread_loop_1 (
‘skip_factor’ determined from profiling
thread_id+ skip_factor*0, m+ 0);
float a1 = memory_fetch_for_thread_loop_1 ( (profiling)
thread_id+ skip_factor*1, m+ 0); Form Figure 10, we can see that our proposed CPU
…… pre-execution improves performance significantly, up
sum+=a0+a1+a2+a3+a4+a5+a6+a7; /*operation to 113% (MC) and 21.4% on average with adaptive
inserted to overcome dead code elimination */} update of ‘skip_factor’ and up to 126% and 23.1% on
for (i = 0; i < concurrent_tb*tb_size; average using a fixed ‘skip_factor’ determined from
i+=skip_factor*batch_size*unroll_factor) profiling. Among these benchmarks, BSc, VD and
for(k = 0; k < batch_size; k++) { TMV are memory intensive and we achieve about 30%
int thread_id = i + k*skip_factor*unroll_factor
speedups. The high performance gains from MC are
+ j*concurrent_tb*tb_size;
float a0 = memory_fetch_for_thread_loop_1 (
due to the fact that the GPU kernel (without CPU pre-
thread_id+ skip_factor *0, m+ 1); execution) suffers from partition conflicts [9] while our
float a1 = memory_fetch_for_thread_loop_1 ( pre-execution exploits the partition-level parallelism of
thread_id+ skip_factor *1, m+ 1); off-chip memory when it prefetches data across
…… multiple TBs. As the GPU requests hit in L3 cache,
sum+=a0+a1+a2+a3+a4+a5+a6+a7; } they do not go to off-chip memory, thereby avoiding
… the partition conflicts. The speedups for BS, MV, FFT
// Updating skip factor (See Section 4.2) are from 4% to 11% due to their irregular address
}}} (b) patterns and cache conflicts. There are no performance
Figure 9. A code example for HWST. (a) A benefits for MM and Con because they are highly
Transpose Matrix Vector Multiplication GPU optimized and have good locality and data reuse in L1
kernel; (b) the (partial) CPU pre-execution and L2 cache of GPU, which makes the L3 cache not
program. critical. Even with a perfect L3 cache, the performance
gains are negligible for these two benchmarks.
5. Experimental Results Another observation from Figure 10 is that both
adaptive update of ‘skip_factor’ and a fixed
5.1 Performance of CPU-Assisted GPGPU ‘skip_factor’ selected from profiling are effective in
After the CPU pre-execution program is generated, improve the GPGPU performance. The profiling
we let the CPU to execute this program right after the approach is slightly better as the adaptive approach
GPU kernel is launched. In our first experiment, we
usually quickly converges to the optimal value, as our approach is to examine how many instructions the
shown in Figure 6. CPU needs to execute in order to achieve the
performance gains. In Figure 13, we report the ratio of
5.2 The Efficacy of Data Prefetching using the number of instructions executed by the CPU over
CPU Pre-execution the number of instructions executed by GPU for both
In this experiment, we examine the efficacy of data adaptive update of ‘skip_factor’ (labeled ‘adaptive’)
prefetching using CPU pre-execution. First, we and fixed ‘skip_factor’ selected using profiling
evaluate the coverage of this prefetching scheme by (labeled ‘profiling’). From Figure 13, we can see that
examining the L3 cache hit rate for GPU accesses with the performance gains shown in Figure 10 are achieved
and without CPU prefetching. The results are shown in with little instruction overhead. On average, our CPU
Figure 11. The hit rates for GPU execution without assisted GPGPU using adaptive update of ‘skip_factor’
CPU pre-execution is labeled ‘no-preex’ and GPU (fixed value of ‘skip_factor’) only executes 0.74%
execution with CPU pre-execution using the adaptive (0.69%) extra instructions to deliver the performance
update of ‘skip_factor’ is labeled ‘adaptive’. The gains.
results for CPU pre-execution using fixed ‘skip_factor’ 3.0% adaptive profiling
2.5%
are very close to adaptive update. From Figure 11, we 2.0%
can see that the L3 cache hit rates are highly improved 1.5%
1.0%
by CPU pre-execution. On average, it improves from 0.5%
12.9% to 39.2%. The L3 cache hit rate improvements 0.0%
BSc VD FFT MM Con TMV BS MC MV GM
for MM and Conv do not translate to performance
gains as shown in Figure 10. The reason is that TLP Figure 13. The ratio of (number of instruction
and higher level of caches provide sufficient latency executed by CPU / number of instruction
hiding for these benchmarks. executed by GPU).
80%
70% no‐preex adaptive
60% 5.3 Understanding the Impact of GPU
Hit rate
50%
40% Architectures
30% In this experiment, we vary the following GPU
20%
10% architecture parameters to understand the impact on
0% our CPU-assisted GPGPU, the GPU SP frequency, the
BSc VD FFT MM Con TMV BS MC MV GM off-chip memory frequency, and the number of SPs in
Figure 11. L3 hit rate for GPU execution an SM. First, we vary the GPU SP frequency from the
without execution (no-preex) and with CPU pre default 480 MHz to 267 MHz and 800 MHz. The CPU
execution (adaptive). frequency remains at 2.4 GHz and the DRAM
Another metric for data prefetching is accuracy and bandwidth remains at 19.2GB/s. In Figure 14, we
we evaluate it by computing the ratio of how many L3 report the speedups that are achieved from CPU
misses generated from CPU pre-execution are actually assisted execution for each SP frequency compared to
accessed by GPU threads and the results are shown in no CPU pre-execution (labeled ‘sp267_speedup’,
Figure 12. It can be seen from the figure that our ‘sp480_speedup’, and ‘sp800_speedup’). All CPU pre-
proposed CPU pre-execution has very high accuracy. execution uses adaptive update of ‘skip_factor’ in the
On average, 98.6% data blocks loaded from the experiments in this section and the fixed ‘skip factor’
memory by CPU are accessed by the GPU threads. has slightly better performance gains. The results
100% labeled ‘sp480_speedup’ are what reported in Figure
10.
98% 2.1
sp267_speedup
1.9
sp480_speedup
Speedup
1.7
96% sp800_speedup
1.5
1.3
94% 1.1
BSc VD FFT MM Con TMV BS MC MV GM 0.9
BSc VD FFT MM Con TMV BS MC MV GM
Figure 12. Prefetch accuracy of CPU pre-
Figure 14. The speedups from CPU pre-
execution.
execution for GPUs running at different
Since our proposed CPU pre-execution needs to
frequencies and the normalized execution
execute instructions to generate prefetching requests
time without pre-execution.
for GPU threads, one way to evaluate the overhead of
From Figure 14, we can see that when SP frequency compared to no CPU pre-execution (labeled
is reduced, the relative memory latency is also reduced. ‘w16_speedup’, ‘w32_speedup’, and ‘w64_speedup’).
Therefore, CPU pre-execution provides less In our baseline GPU configuration, each SM has 32
performance gains. On the other hand, when we SPs. From the figure, we can see that when the number
increase SP frequency, these benchmarks show of SPs is reduced in an SM (while keep the same
different trends. First, the benchmarks, VD, FFT, BSc, number of SMs in the GPU), the GPU becomes more
have higher performance gains as the memory latency latency tolerant as each instruction in a warp will take
becomes more significant. Secondly, for the more cycles to finish. Given the same application,
benchmarks, TMV, MV, and MC, the impact is reducing the number of SPs is equivalent to increasing
opposite and the reason is that the GPU SPs nearly TLP, thereby reducing the performance gains achieved
double the rate of its memory requests, which enforce from CPU pre-execution. On the other hand, increasing
the CPU to skip more threads and to prefetch less data. the number of SPs also increases the rate of their
For the benchmark BS, its baseline IPC is very high memory requests, similar to increasing the SP speed,
(close to 90) when SP frequency is 480 MHz, thereby which can also reduce the effectiveness of CPU pre-
limiting the pre-execution impact as shown in Figure execution. Nevertheless, on average, for these three
10. Overall, increasing the speed of SPs has less SM configurations, CPU pre-execution achieves 14.7%,
impact than decreasing the speed and CPU pre- 21.4%, and 12.4% performance improvement,
execution is still effective for all these different SP respectively.
speeds. 2.1 w16_speedup
2
Next, we vary the off-chip memory frequency from 1.9 w32_speedup
the default 600 MHz to 300 MHz and 1200 MHz and 1.8
1.7
Speedup
w64_speedup
the results are show in Figure 15. From Figure 15, we 1.6
1.5
can see that when memory frequency is increased, the 1.4
1.3
memory latency is reduced. Therefore, CPU pre- 1.2
1.1
execution provides less performance gains. When we 1
0.9
reduce memory frequency, these trends of these
BSc VD FFT MM Con TMV BS MC MV GM
benchmarks are similar to increasing the SP frequency.
For example, for BSc, VD and FFT, the CPU pre- Figure 16. The speedups from CPU pre-
execution shows better speedups when the memory execution for different numbers of SPs in an
frequency is reduced from 600 MHz to 300 MHz, SM.
because the memory latency dominates these three
benchmarks. For TMV, MV, and MC, memory latency 5.4 Sensitivity of the Parameters in CPU Pre-
does not dominate the execution time. Therefore, the
Execution Program
speedups of CPU pre-execution are reduced when the
In this experiment, we study the sensitivity of the
memory frequency is reduced from 600 MHz to 300
two parameters used in our CPU pre-execution
MHz. Overall, reducing the speed of memory has
program to update the variable ‘skip_factor’ (see
much less impact than increasing the memory speed
Section 4.2). The first is the ‘batch_size’, which
and our proposed CPU pre-execution is effective for all
determines how often the skip_factor is updated. We
these different memory speeds.
2.3
vary this variable from 8, 16, and 32 and the GPU
2.1 mem300_speedup performance results are shown in Figure 17. As seen
mem600_speedup
1.9 from the figure, although the batch size of 16 achieves
Speedup
1.7 mem1200_speedup
1.5 the best performance, the performance difference for
1.3 different batch sizes is limited, except for BSc, which
1.1 prefers a large batch size.
0.9
140 no_preex batch_8 batch_16 batch_32
BSc VD FFT MM Con TMV BS MC MV GM 120
100
Figure 15. The speedups from CPU pre- 80
IPC
60
execution for off-chip memory running at 40
different frequencies and the normalized 20
0
execution time without pre-execution. BSc VD FFT MM Con TMV BS MC MV HM
Then, we vary the number of SPs in an SM and
keep the same (4) SMs in our GPU model. In Figure 16, Figure 17. The GPU performance for CPU pre-
we report the speedups that are achieved from CPU execution using different batch_sizes.
assisted execution for each SM configuration
In another experiment, we also change the threshold interesting insight that when a workload is parallelized
used to determine whether there are too many or too into many threads, each thread may be short and inter-
few L3 cache hit. We vary the threshold from 10 to 50 thread/inter-warp prefetching is more effective than
and the results are nearly identical, showing that our intra-thread/intra-warp prefetching. Compared to this
algorithm is not sensitive to this parameter. work, our proposed CPU pre-execution does not rely
on stride access patterns and provides both intra- and
5.5 Using CPU to Execute GPU Threads inter-warp prefetching. More importantly, all the
In this experiment, we consider the option of using previous works on GPU prefetching do not fit well
the CPU to directly execute some GPU threads to with fused architectures as both demand cache misses
reduce the GPU workload. On the GPU side, the thread and prefetches compete for critical resource, such as
blocks are distributed to SMs based on the order of L2 cache miss handling status registers (L2 MHSRs),
thread block id from small to large. On the CPU side, on the GPU side while leaving the CPU side resource
the CPU executes the thread blocks from the opposite idle. Our proposed approach, in contrast, leverages
direction, starting from the one with the largest thread such critical resources on CPU side for prefetches and
block id. In our simulator, we implemented a special keeps those on GPU side for demand misses, thereby
instruction for the CPU to get the largest thread block achieving better resource utilization. We also
id issued in the GPU. This way, we ensure that there is implemented the per-PC stride prefetcher with
no overlap workload between the CPU and the GPU. enhanced warp id indexing [12] in our simulator,
The speedups of such workload distribution between which shows a 5.24% speedup on average.
CPU and GPU over GPU-only execution are shown in To take advantage of fused architectures, it is
Figure 18. From the figure, we can see that the proposed in [7] that the GPUs run prefetching
performance gains of most of benchmarks are less than algorithms to prefetch data for CPU programs. In
2%. The main reason is the limited floating-point comparison, our goal is to accelerate GPU programs
throughput of the CPU and the high overhead of CPU and we believe it is a better fit to fused architectures
to access GPU memory partition. Among the since both GPU and CPU are used to do what they are
workloads, the benchmark, VD, shows the highest good at: GPU for ALU/floating-point computations
(about 5%) speedup since it does not have many ALU and CPU for flexible and accurate data prefetching.
operations to expose the ALU bottleneck of the CPU. Our proposed CPU-assisted GPGPU is also inspired
0.06
from many works on CPU-based pre-execution
[2][4][6][8][10][11][17][18][20][24], in which a pre-
execution thread is used to provide data prefetching
Speedup
0.04
and/or accurate control flow to the main thread. The
0.02 novelty of our work is to use a single CPU thread to
prefetch data for many concurrent GPU threads and a
0
simple yet effective way to control how far the pre-
BSc VD FFT MM Con TMV BS MC MV GM
execution thread runs ahead.
Figure 18. The speedups of workload
distribution between GPU and CPU over GPU- 7. Conclusion
only execution.
In this paper, we propose to collaboratively utilize
6. Related work CPU and GPU resources for GPGPU applications. In
our scheme, the CPU runs ahead of GPU threads to
Although a key design philosophy of GPU is to use prefetch the data into the shared L3 cache for the GPU.
TLP to hide long memory access latency, the Novel compiler algorithms are developed to
importance of GPU memory hierarchy has been widely automatically generate CPU pre-execution programs
recognized to achieve high performance GPGPU. In from GPU kernels. We also provide flexible
[23], software prefetching is used to overlap memory mechanisms to control how far the CPU runs ahead of
access latency with computations. However, GPU threads. Our experimental results show that our
prefetching data into registers or shared memory proposed CPU pre-execution has very high prefetching
increases the register pressure and may hurt the accuracy and achieves significant performance gains at
performance due to reduced TLP [27]. In [12], Lee et the cost of minor instruction overhead from the CPU
al. proposed many-thread aware GPU prefetching side. Furthermore, our results show that the proposed
approaches for L1 cache. Besides leveraging the well- scheme remains effective for different GPU
known stride access pattern, they revealed the configurations.
applications. IEEE/ACM International Symposium on
Acknowledgements Microarchitecture, 2010.
[13] MARSSx86,
We thank the anonymous reviewers for their http://marss86.org/~marss86/index.php/Home
insightful comments to improve our paper. This work [14] N. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J.
is supported by an NSF CAREER award CCF- Manferdelli, High performance discrete Fourier
0968667 and a gift fund from AMD Inc. transforms on graphics processors. Proceedings of
Supercomputing, 2008.
References [15] NVIDIA CUDA C Programming Guide 3.1, 2010.
[16] NVIDIA GPU Computing SDK 3.1,
http://developer.nvidia.com/gpu-computing-sdk, 2011.
[1] A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, and T.
[17] O. Mutlu, J. Stark, C. Wilkerson and Y. N. Patt, Run
M. Aamodt, Analyzing CUDA workloads using a
ahead Execution: An Alternative to Very Large
detailed GPU simulator. IEEE International Symposium
Instruction Windows for Out-of-Order Processors.
on Performance Analysis of Systems and Software,
IEEE International Symposium on High Performance
April 2009.
Computer Architecture, February 2003.
[2] A. Roth and G. Sohi, Speculative data driven
[18] R. Balasubramonian, S. Dwarkadas, and D. Albonesi,
multithreading. IEEE International Symposium on High
Dynamically allocating processor resources between
Performance Computer Architecture, 2001.
nearby and distant ILP. International Symposium on
[3] AMD Accelerated Parallel Processing (APP) SDK V2.3,
Computer Architecture, 2001.
http://developer.amd.com/gpu/atistreamsdk/pages/defaul
[19] P. Boudier, Memory System on Fusion APUs - The
t.aspx, 2011
Benefits of Zero Copy. AMD fusion developer summit,
[4] C. K. Luk, Tolerating memory latency through soft-
2011.
ware-controlled pre-execution in simultaneous
[20] P. H. Wang, H. Wang, J. D. Collins, E. Grochowski, R.
multithreading processors. International Symposium on
M. Kling, and J. P. Shen, Memory latency-tolerance
Computer Architecture, 2001.
approaches for Itanium processors: out-of-order
[5] C. Luk, S. Hong, H. Kim, Qilin: exploiting parallelism
execution vs. speculative precomputation. IEEE
on heterogeneous multiprocessors with adaptive
International Symposium on High Performance
mapping. IEEE/ACM International Symposium on
Computer Architecture, 2002.
Microarchitecture, 2009.
[21] Sandy Bridge,
[6] C. Zilles and G. Sohi, Execution-based prediction using
http://en.wikipedia.org/wiki/Sandy_Bridge.
speculative slices. International Symposium on
[22] S. I. Lee, T. Johnson, and R. Eigenmann. Cetus – an
Computer Architecture, 2001.
extensible compiler infrastructure for source-to-source
[7] D. H. Woo, H. S. Lee, COMPASS: a programmable
transformation. In Proc. Workshops on Languages and
data prefetcher using idle GPU shaders. Proceedings of
Compilers for Parallel Computing, 2003
the Architectural support for programming languages
[23] S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi,
and operating systems, March 13-17, 2010.
S. Ueng, J. A. Stratton, and W. W. Hwu, Optimization
[8] D. Kim and D. Yeung, Design and evaluation of
space pruning for a multi-threaded GPU. International
compiler algorithms for pre-execution. Proceedings of
Symposium on Code Generation and Optimization,
the Architectural support for programming languages
2008.
and operating systems, 2002.
[24] Y. Solihin, J. Lee and J. Torrellas, Using a user-level
[9] G. Ruetsch and P. Micikevicius, Optimize matrix
memory thread for correlation prefetching, ISCA 2002
transpose in CUDA. NVIDIA, 2009.
[25] Streaming SIMD Extensions,
[10] H. Zhou, Dual-Core Execution: Building a Highly
http://en.wikipedia.org/wiki/Streaming_SIMD_Extensio
Scalable Single-Thread Instruction Window,
ns.
Proceedings of the International Conference on Parallel
[26] The AMD Fusion Family of APUs,
Architectures and Compilation Techniques, Sept. 2005
http://sites.amd.com/us/fusion/apu/Pages/fusion.aspx.
[11] J. D. Collins, H. Wang, D. Tullsen, C. Hughes, Y.-F.
[27] Y. Yang, P. Xiang, J. Kong and H. Zhou, A GPGPU
Lee, D. Lavery, and J. P. Shen, Speculative
Compiler for Memory Optimization and Parallelism
precomputation: long range prefetching of delinquent
Management. ACM SIGPLAN conference on
loads. International Symposium on Computer
Programming Language Design and Implementation,
Architecture, 2001.
2010.
[12] J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc,
Many-thread aware prefetching mechanisms for gpgpu