CPU-Assisted GPGPU On Fused CPU-GPU Architectures

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

CPU-Assisted GPGPU on Fused CPU-GPU Architectures

Yi Yang, Ping Xiang, Mike Mantor*, Huiyang Zhou


Department of Electrical and Computer Engineering *Graphics Products Group
North Carolina State University Advanced Micro Devices
{yyang14, pxiang, hzhou}@ncsu.edu [email protected]

Abstract applications and flexible execution from CPUs for


This paper presents a novel approach to utilize the irregular workloads. In this paper, we assume that the
CPU resource to facilitate the execution of GPGPU fused CPU-GPU architecture has a shared L3 cache
programs on fused CPU-GPU architectures. In our and shared off-chip memory between CPUs and GPUs
model of fused architectures, the GPU and the CPU and we propose a novel approach to collaboratively
are integrated on the same die and share the on-chip utilize the CPU and GPU resources. In our proposed
L3 cache and off-chip memory, similar to the latest approach, called CPU-assisted GPGPU (general
Intel Sandy Bridge and AMD accelerated processing purpose computation on graphics processor units), after
unit (APU) platforms. In our proposed CPU-assisted the CPU launches a GPU program, it starts a pre-
GPGPU, after the CPU launches a GPU program, it execution program to prefetch the off-chip memory
executes a pre-execution program, which is generated data into the shared L3 cache for GPU threads.
automatically from the GPU kernel using our proposed Our proposed CPU-assisted GPGPU works as
compiler algorithms and contains memory access follows. First, we develop a compiler algorithm to
instructions of the GPU kernel for multiple thread- generate the CPU pre-execution program from GPU
blocks. The CPU pre-execution program runs ahead of kernels. It extracts memory access instructions and the
GPU threads because (1) the CPU pre-execution associated address computations from GPU kernels
thread only contains memory fetch instructions from and then adds loops to prefetch data for concurrent
GPU kernels and not floating-point computations, and threads with different thread identifiers (ids). The
(2) the CPU runs at higher frequencies and exploits update of the loop iterators provides a flexible way to
higher degrees of instruction-level parallelism than select/skip thread ids for prefetching (see Section 4).
GPU scalar cores. We also leverage the prefetcher at Second, when the GPU program is launched, the CPU
the L2-cache on the CPU side to increase the memory runs the pre-execution program. Such pre-execution is
traffic from CPU. As a result, the memory accesses of expected to warm up the shared L3 cache for GPU
GPU threads hit in the L3 cache and their latency can threads since (1) the pre-execution program only
be drastically reduced. Since our pre-execution is contains the memory operations (and address
directly controlled by user-level applications, it enjoys calculations) but not floating point/ALU computations;
both high accuracy and flexibility. Our experiments on and (2) the CPU runs at a higher frequency and is more
a set of benchmarks show that our proposed pre- aggressive in exploiting instruction-level parallelism
execution improves the performance by up to 113% (ILP). To make the proposed pre-execution effective, it
and 21.4% on average. is critical to control the timing of the prefetches since
they need to be issued early enough to hide memory
1. Introduction access latencies while at the same time not so early that
the prefetched data might be replaced before being
The integration trend of CMOS devices has led to utilized. We propose two mechanisms to achieve this
fused architectures, in which the central processing through the loop iterator update code in the pre-
units (CPUs) and graphics processing units (GPUs) are execution program. The first is an adaptive approach,
integrated onto the same chip. Recent examples which requires a new CPU instruction to inquire the
include Intel's sandy bridge [21], on which CPUs and performance counter of L3 cache hits periodically so as
GPUs are on one chip with a shared on-chip L3 cache, to adjust the loop iterator update. The insight is that if
and AMD accelerated processing unit (APU) [26] on there are too many L3 hits experienced by the CPU
which both on-chip CPUs and GPUs share the same pre-execution program, it means that CPU is not
off-chip memory [19]. Such heterogeneous effective in fetching new data into the L3 cache. As a
architectures provide the opportunity to leverage both result, the CPU needs to run further ahead by
the high computational power from GPUs for regular increasing the amount of loop iterator update. On the
other hand, if there are too few L3 cache hits, meaning organized in a hierarchy. A number of threads are
that the CPU pre-execution program may skip too grouped in a warp (32 threads for NVIDIA GPUs) or a
many threads, it reduces the loop iterator update so as wavefront (64 threads for AMD GPUs), which are
to select more thread ids for prefetching. The second is executed in the single-instruction multiple-data (SIMD)
a static approach, which determines the best loop manner. Multiple warps/wavefronts are assembled
iterator update value based on profiling and does not together into a thread block/workgroup with the
require any new hardware support. capability to communicate data through on-chip shared
In summary, this paper makes the following memory. Each SM can host one or more thread blocks
contributions: (1) we propose to utilize the otherwise depending on the resource usage of each thread.
idle CPU to assist GPGPU through pre-execution; (2)
we propose compile algorithms to generate the CPU 2.2. Fused CPU-GPU architectures
pre-execution program from different types of GPU Advances in the CMOS technology make it possible
kernels; (3) we propose simple yet effective to integrate multi-core CPUs and many-core GPUs on
approaches to control how far the CPU code runs the same chip, as exemplified with the latest AMD’s
ahead of GPU threads; and (4) we implemented our accelerated processing units (APUs) and Intel’s Sandy
proposed schemes by integrating and modifying the Bridge processors. Figure 1 shows such a fused
MARSS X86 [13] and the GPGPUsim [1] timing architecture with a shared L3 cache and shared off-chip
simulators and our results show that our proposed CPU memory.
pre-execution can improve the performance of GPU
CPU GPU
programs by up to 113% (126%) and 21.4% (23.1%)
on average using adaptive iterator update (fixed iterator Core SM SM

update). The cost of achieving such performance gains L1I L1D L1/Shared Mem L1/Shared Mem
is nominal: the average instruction overhead of the
CPU pre-execution program is 0.74% (0.69%) of the L2
L2
number of instructions executed by GPU using
……
adaptive iterator update (fixed iterator update).
The remainder of the paper is organized as follows. On-chip shared L3
In Section 2, we present a brief background on GPGPU
and fused CPU-GPU architectures. In Section 3, we Off-chip memory
present our modeling of fused CPU-GPU architecture
and our experimental methodology. Section 4 Figure 1: A fused CPU-GPU architecture with a
discusses our proposed CPU-assisted GPGPU in detail. shared on-chip L3 cache and off-chip memory.
The experimental results are presented in the Section 5. Compared to high-end discrete GPUs, the GPUs on
Related work is discussed in Section 6. Finally, Section current fused architectures have less computation
7 concludes the paper. throughput and lower memory access bandwidths.
However, with the shared off-chip memory, fused
2. Background architectures eliminate the costly CPU-GPU data
transfers [19] and the existence of the shared on-chip
L3 cache opens more opportunities for close
2.1. General Purpose Computation on
collaboration between GPUs and CPUs. Compared to
Graphics Processor Units (GPGPUs) CPUs, the on-chip GPUs still deliver high
State-of-the-art GPUs use many-core architecture to computational power. For example, on AMD E2-3200
deliver very high computational throughput. The APU, the GPU module (HD6370D) has the throughput
processor cores in a GPU are organized in a hierarchy of 141.8 GFLOPS while the CPU has a throughput of
[15]. In NVIDIA/AMD GPUs, a GPU has a number of 38.4 GFLOPS with SSE [25]. In this paper, we propose
streaming multiprocessors (SMs) / SIMD engines and a new approach to collaboratively utilize both CPU and
each SM/SIMD engine in turn contains multiple GPU resources on fused architectures efficiently to
streaming processors (SPs) / thread processors (TP). achieve high performance GPGPU.
The on-chip memory resource includes register files,
shared memory, and relatively small caches for
different memory regions. In GPUs, a large number of 3. Architectural Modeling and
threads are supported to run concurrently in order to Experimental Methodology
hide long off-chip memory access latencies. The
threads follow the single-program multiple-data In this paper, we model the fused architecture as
(SPMD) program execution model and are also shown in Figure 1. To build our simulator
infrastructure, we use the full system X86 timing parameters to explore the design space of the on-chip
simulator MARSSx86 [13] for the CPU part and the GPU. Our dram memory model has 8 memory
GPGPUSIM [1] for the GPU part. To merge two modules and every module support a bus width of 32
simulators to model the fused architecture, we consider bits. So, the bandwidth of DRAM memory in the
the CPU simulator as the host, meaning that the GPU simulator is 600M Hz * 32 bits * 8=19.2GB/s. Because
simulator is invoked from the CPU simulator. In every the frequency of GPU is 1/5 of the CPU frequency, the
few CPU cycles determined by the frequency ratio of simulator executes one GPU cycle every 5 CPU cycles
CPU over GPU, the CPU simulator invokes a GPU and a DRAM cycle every 4 CPU cycles. We also add
cycle. We also ported the DRAM model in the the L3 cache to the simulator, which is 4MB and is
GPGPUSIM into marssx86. shared by the CPU and the GPU. The L3 cache hit
To enable the full system simulator to run CPU latency for CPU is 40 CPU cycles while the L3 hit
code and GPU code collaboratively, we partition the latency for GPU is 20 GPU cycles (i.e., 80 CPU
memory space of the fused CPU-GPU architecture into cycles). The difference is to account for the fact the
two parts: the lower address memory space that is used CPU core is located closer to the L3 cache than GPU
solely by CPU and the upper address memory space and GPU is less latency sensitive than CPU.
that can be accessed by both GPU and the CPU. For The benchmarks used in our experiments are listed
example, if we allocate 256MB memory for the whole in Table 1. Among the benchmarks, Blackscholes,
system, we reserve the upper 128MB memory by Vecadd, and Montecarlo are from NVIDIA SDK [16].
passing “-mem=128M” as a parameter to boot the BitonicSort and Convolution are from AMD SDK [3].
Linux operating system so that the operating system Matrix multiplication is from [1]. We implemented and
only uses the lower 128MB memory and reserves the optimized transpose-matrix-vector multiplication and
upper 128MB memory for GPU. The GPU accesses matrix-vector multiplication, which have similar
the DRAM directly with physical addresses starting performance to the CUBLAS library. The Fast Fourier
from 128MB. If a CPU application needs to access the Transform implementation is based on the work by
upper memory space, we need to first use the Linux Govindaraju et al. [14].
function ‘ioremap’ to map the upper 128MB memory
Table 1. Benchmarks used in experiments
space into the operating system as a memory device.
Benchmarks Input sizes Number Threads per
Then, applications in user space can use the ‘mmap’
of threads thread block
system call to map the memory device into the user Blackscholes (BSc) 1M 1M 128
virtual memory space. This way, applications in user Vector-Add (VD) 1M 1M 128
space and GPU programs can access the same DRAM Fast Fourier 1M 512K 128
memory. As a side effect of our simulator Transform (FFT)
infrastructure, the CPU memory accesses have higher Matrix Multiplication 512x512 256K 16x16
overheads than GPU memory accesses since (MM)
applications in CPU user space need to first perform Convolution (Con) 1kx1k & 3x3 1M 16x16
address mapping to get physical addresses. However, Transpose matrix 512x2048 512 256
as shown in Section 5, our proposed CPU pre- vector multiplication
execution can achieve significant performance (TMV)
improvement even with our pessimistic handling of BitonicSort (BS) 2M 1M 256
CPU memory accesses compared to GPU memory MonteCarlo (MC) 256K 128K 256
Matrix-vector 16x65k 65K 256
accesses.
multiplication (MV)
The parameters in our simulator are set up to model
Our proposed compiler algorithms to generate CPU
AMD APU E2-3200 [26], except the shared L3 cache.
code from GPU kernels (See Section 4) are
For the CPU part, we model a 4-way-issue out-of-order
implemented using a source-to-source compiler
CPU running at 2.4 GHz with a 128KB L1 cache and a
infrastructure, Cetus [22].
512KB L2 cache. For the GPU part, we model a GPU
with 4 SMs and each SM contains 32 SPs running at
480M Hz. Each SP can deliver up to 2 flops (a fused 4. CPU-Assisted GPGPU
multiplication and add) every cycle. As a result, the
overall GPU computation power in our model is about In our proposed CPU assisted GPGPU, after the
122.8GFLOPS. The register file and the shared CPU launches the GPU kernel, it starts running a pre-
memory in each SM are 32kB (8k registers) and 16KB, execution program to prefetch data into the shared L3
respectively. Each SM can support up to 768 active cache for GPU threads. Although we use the same
threads. In our experiments, we also vary the key compiler algorithm to generate CPU pre-execution
programs from different GPU kernels, for the purpose
of clarity, we classify GPU kernels into two types, (a) resource usage information is available from the GPU
lightweight workload in a single GPU thread (LWST), compiler such as nvcc and the hardware information is
and (b) heavyweight workload in a single GPU thread dependent on the target platform. The second-level
(HWST), and discuss the generation of CPU pre- loop is used to traverse through all concurrent threads
execution programs separately. The difference between in these concurrent TBs and the iterator is used to
the two types is that HWST GPU kernels have one or compute the thread ids, for which the data will be
more loops, which contain global memory accesses prefetched. As shown in Figure 2, the iterator update is
while LWST kernels do not. set as a product of three parameters (unroll_factor *
In this section, we first present our compiler batch_size * skip_factor). The last two are used in our
algorithm to generate the CPU pre-execution program proposed mechanisms to control how far the pre-
from LWST GPU kernels. Next, we discuss the execution program should run ahead of GPU threads
mechanisms to control how far the CPU pre-execution (See Section 4.2). The unroll factor is used to boost the
can run ahead of GPU threads. Then, we present the memory requests from the CPU. Before dissecting this
generic compiler algorithm to generate the CPU pre- parameter, we first illustrate our compiler algorithm
execution program from LWST/HWST GPU kernels. using a vector-add GPU kernel, which is an LWST
kernel as it does not have any loops containing global
4.1. Generating the CPU Pre-Execution Code memory accesses. Both the GPU code and the CPU
from LWST GPU Kernels pre-execution code are shown in Figure 3.
For LWST GPU kernels, our proposed compiler __global__ void VecAdd(const float* A, const float* B,
algorithm to generate the pre-execution program is float* C, int N) {
shown in Figure 2. int i = blockDim.x * blockIdx.x + threadIdx.x;
C[i] = A[i] + B[i];
1. For a GPU kernel, extract its memory operations and } (a)
the associated address computations and put them in a
CPU function; replace thread id computation with an float memory_fetch_for_thread (int n) {
input parameter. return (A[n] + B[n] + C[n]); /* A, B, C are the CPU
2. Add a nested loop structure into the CPU code to pointers mapped to the GPU memory */
prefetch data for concurrent threads. }
a. The outer loop traverse through all TBs. The iterator float cpu_prefetching( … ) {
starts from 0 and the loop bound is the number of unroll_factor = 8;
thread blocks of the GPU program. The iterator update //added loop to traverse thread blocks
is the number of concurrent TBs, meaning the number for (j = 0; j < N_tb; j+= concurrent_tb) {
of TBs that can run concurrently on the GPU. //added loop to traverse concurrent threads
b. The second-level loop traverses through concurrent for (i = 0; i < concurrent_tb*tb_size ;
threads. The loop iterator starts from 0 and the loop i+=skip_factor* batch_size*unroll_factor) {
bound is the number of concurrent threads (which is for (k=0; k<batch_size; k++) {
the product of TB size and the number of concurrent int thread_id = i + skip_factor* k*unroll_factor
TBs). The iterator update is set as a product of three + j* tb_size;
parameters, unroll-factor, batch_size, and skip_factor. // unrolled loop
Figure 2. The compiler algorithm to generate float a0 = memory_fetch_for_thread (thread_id+
skip_factor *0);
CPU pre-execution program from LWST GPU
float a1 = memory_fetch_for_thread (thread_id+
kernels. skip_factor *1);
As shown in Figure 2, the algorithm contains two ……
parts. First, it extracts all the memory access operations sum += a0+a1+a2+a3+a4+a5+a6+a7; /* operation
and the associated address computations from a GPU inserted to overcome dead code elimination */
kernel and puts them in a CPU function. The store }
operations are converted to load operations. The thread //Updating skip factor (See Section 4.2)
id is converted to a function parameter. Second, the …
compiler adds loops so as to prefetch data for }}} (b)
concurrent threads. Since GPU threads are organized in Figure 3. A code example for LWST. (a) A
thread blocks (TBs), a nested loop is inserted. The vector-add GPU kernel; (b) the (partial) CPU
outer loop is used to traverse through all TBs. The pre-execution program.
iterator starts from 0 and is updated with the number of As shown in Figure 3, the function
concurrent TBs, which is determined by the resource ‘memory_fetch_for_thread’ is a result of extracting the
usage (registers, shared memory, and the TB size) of memory accesses from the GPU kernel and can be used
each TB and the GPU hardware configuration. The to prefetch data for the thread with thread id ‘n’. The
memory update operation on ‘C[n]’ is converted to a 16, the memory requests are about 1.5X of those with a
load operation. The loaded values are summed together unroll factor of 1. (2) Prefetching is also important to
so as to avoid the compiler eliminating this function as maximize the memory requests from the CPU side. An
dead code when the CPU program is compiled. The unroll factor of 8 combined with next line prefetching
outer loop in the function ‘cpu_prefetch’ is the one that achieve good utilization of the request queue at the
traverse all TBs. The loop bound is ‘N_tb’ is computed CPU L2 cache, which is the reason why we set the
as (N / TB size), i.e., number of TBs. The iterator value of unroll factor as 8 in Figure 3. (3) Given the
update ‘concurrent_tb’ is the number of TBs that can high number of GPU SPs, GPU memory requests are
run concurrently on the GPU. The second loop with the about 3X compared to the CPU memory requests using
iterator ‘i’ is introduced to prefetch data for concurrent our default ‘un_8_pref’ configuration. We also tried
threads. The loop iterator ‘i’ is used to compute the with increasing the number of cache lines which are
thread ids, for which the data will be prefetched. Since prefetched in L2 to further increase the number of
only those thread ids: (thread_id + skip_factor * 0), memory requests from CPU, our results with
(thread_id + skip_factor * 1), …, (thread_id + prefetching 2 or 4 cache lines only show less than 0.2%
skip_factor * (unroll_factor – 1)) will be used for speedup compared to the next-line prefetcher.
prefetching, the variable ‘skip_factor’ determines the Therefore, we use the next-line prefetcher in our
number of threads to be skipped before a thread id is experiments.
used for prefetching. We initialize this variable 5000
‘skip_factor’ to be ‘L3 cache line size / the size of float’ Number of memory 
4000
(16 according to our model) so that we do not waste fetch 3000
CPU instructions prefetching the data from the same 2000
cache line. Similarly, if the next-line prefetching is 1000
enabled, ‘skip_factor’ is initialized to ‘2 x L3 cache 0
line size / the size of float’.
In order to make the CPU pre-execution program
effective, we need to consider the memory traffic
carefully as the GPU in fused GPU-CPU architectures
can easily dominate the memory traffic, in which case Figure 4. Comparing memory requests
the CPU prefetching impact will be very limited. The generated from different versions of CPU code
reasons are (1) a GPU has many SPs (128 in our model) and from the GPU for every 100,000 cycles.
and every SP can issue one memory fetch in one GPU “un_N” means the CPU pre-execution
cycle. If these fetches miss the GPU caches and cannot program with the unroll factor of N;
be merged, GPU will have a high rate of off-chip “un_N_pref” means the CPU pre-execution
memory accesses; (2) GPU is designed to support high program with the unroll factor of N and with
degrees of thread-level parallelism (TLP) and the CPU L2 cache next-line prefetch enabled.
independent threads can issue memory requests as long
as there are no structural hazards on resources to 4.2. Mechanisms to Control How Far the CPU
support outstanding requests such as miss status Code Can Run Ahead of GPU Threads
handling registers (MSHRs) or memory request queues. As shown in Figure 2, the CPU pre-execution code
As a result, although the GPU frequency is slower than primarily contains the memory operations from the
the CPU frequency, the number of memory fetches GPU kernel. Considering the fact that CPU runs at a
from the GPU can be much larger than those from the higher frequency and employs out-of-order execution
CPU. We analyze this effect using the example code in to exploit instruction-level parallelism (ILP), we expect
Figure 3 and the results are shown in Figure 4. that the CPU pre-execution code runs ahead of the
In Figure 4, we compare the memory requests GPU kernels, meaning that when the CPU code selects
generated from the CPU running the pre-execution a thread id for prefetching, the corresponding GPU
code with different unroll factors (labeled ‘un_N’). We thread has not yet reached to the corresponding
also enable the L2 cache next-line prefetcher from the memory access instruction. As discussed in Section 4.1,
CPU side to boost the number of memory requests the parameter ‘skip_factor’ determines how many
from CPU (labeled ‘un_N_pref’). The rate of memory thread ids to be skipped before one is used for
requests generated by the GPU is also included for prefetching. Adjusting this parameter provides a
comparison. From the figure, we can see that (1) flexible way to control the timeliness of the CPU
increasing the unroll factor increases the number of prefetches. Here, we propose two mechanisms to adjust
memory requests significantly. With a unroll factor of
this parameter. The first one is an adaptive approach ‘back_dis’ is initialized as 0. As discussed in section
and the second is a fixed value based on profiling. 4.1 the constant value of 32 is used as ‘2 x L3 cache
In our adaptive approach, we design the update line size / the size of float’. In Figure 6, we examine
algorithm for the variable ‘skip_factor’ based on the the effectiveness of our proposed adaptive update
following observations. If the CPU pre-execution approach. In this figure, we report the value of the
program has experienced too many L3 cache hits, it variable ‘skip_factor’ over time for different
means that the memory requests generated from the benchmarks. We can see that for most benchmarks, the
CPU are not useful because no new data are brought variable quickly converges to a fixed value, implying
into the L3 cache and it can be that the GPU threads that the GPU kernel has a stable memory access pattern
are running ahead and already brought in the data. and the CPU pre-execution code keeps a fixed distance
Therefore, we need to increase the ‘skip_factor’ to ahead of GPU threads. For the benchmarks BS and
make the CPU run further ahead. On the other hand, if MV, the value of this variable changes over time,
there are too few L3 cache hits for CPU pre-execution, indicating that their memory access patterns are not
it means that CPU side is running too far ahead and we stable due to the data dependent nature of sorting
can reduce ‘skip_factor’ to skip fewer thread ids to algorithms (BS) and the L1/L2 caching effects (MV).
generate more prefetches. To determine whether there BSc VD FFT MM Con
TMV BS MC MV
are too many or too few L3 cache hits, we periodically 250

Skipped_thread_number
sample the number of L3 cache hits for the CPU and
200
compare the current sample with the last one. If the
difference is larger than a threshold, which is set as a 150
fixed value of 10 (our algorithm is not sensitive to this 100
threshold setting as shown in Section 5.4), we need to
update the skip_factor. The implementation of this 50
adaptive approach for the code in Figure 3 is shown in 0
Figure 5. Such update code is inside the second-level
loop with iterator ‘i’ in the code shown in Figure 3 and
is executed after we process a batch of thread ids. In cycles
other words, the variable ‘batch_size’ determines how Figure 6. The value of the variable
often we update the ‘skip_factor’. In our ‘skip_factor’ over time using our adaptive
implementation, batch size is set to 16, meaning that update approach shown in Figure 5.
we update the ‘skip_factor’ once we process (16 x 8 x Since many GPU workloads have regular memory
skip_factor) thread ids. access patterns, as shown in Figure 6, we also propose
//Accessing an L3 cache Performance counter to use profiling to simplify the update of the variable
ptlcall_cpu_read_value(PTLCALL_CPU_NAME_CPU_ ‘skip_factor’. In the profiling process, the compiler sets
HIT_L3, &hitnumber); the ‘skip_factor’ to a fixed value from the set {32, 64,
if (hitnumber-last_hit>threshold) skip_factor += 32;
96, 128, 160, 192, 224} and selects the one with
else if (back_dis != skip_factor -32) {
//preventing skip_factor bouncing between two values highest performance during test runs. This way, the
skip_factor -= 32; periodic update of ‘skip_factor’ can be removed from
back_dis = skip_factor; the CPU pre-execution code and there is also no need
} for a new instruction to access the L3 cache hits for the
last hit = hitnumber; GPU. The CPU pre-execution code with a fixed ‘skip
Figure 5. Adaptive update of the variable factor’ is shown in Figure 7, from which we can see
‘skip_factor’. that the code related to ‘batch_size’ is also removed.
Since our adaptive approach needs the L3 cache hit
statistics for CPU, we introduce a new instruction to 4.3. Generating the CPU Pre-Execution Code
access this L3 cache performance counter and this new from HWST GPU Kernels
instruction is implemented through a new HWST GPU kernels contain one or more loops,
‘ptlcall_cpu_read_value’ function in our simulator. As which contain global memory access instructions. We
shown in Figure 5, if the CPU has too many L3 cache refer to such a loop as a kernel loop. Among our
hits, we increase the ‘skip_factor’ by 32. If the CPU benchmarks, MM, Con, TMV, MV, and MC are of the
has too few L3 cache hits, the variable is decreased by HWST type. To generate the CPU pre-execution
32. The check ‘if(back_dis != skip_factor -32)’ program for an HWST GPU Kernel, we process one
prevents ‘skip_factor’ from bouncing between its kernel loop at a time. For each kernel loop, a CPU
current value and (the current value – 32). The variable function is generated, which contains the global
memory access instructions and the address concurrent threads. Similar to LWST kernels, the outer
computation operations in the loop body. Both the loop is used to prefetch data for concurrent TBs.
thread id and the kernel loop iterator are replaced with Before going through concurrent threads, however, we
function input parameters. If the kernel loop is a nested insert the second-level loop to account for the kernel
one, the iterators from all loop levels are replaced with loop structure. The third-level loop traverses through
function parameters. Our proposed compiler algorithm all concurrent threads, similar to step 2b in Figure 2 for
is shown in Figure 8. LWST kernels. We illustrate our algorithm using the
float cpu_prefetching( … ) { simplified version of transpose-matrix-vector
unroll_factor = 8; skip_factor = 160; multiplication. The GPU kernel and the generated CPU
//added loop to traverse thread blocks pre-execution program are shown in Figure 9.
for (j = 0; j < N_tb; j+= concurrent_tb) { As shown in Figure 9, the CPU function
//added loop to traverse concurrent threads ‘memory_fetch_for_thread_loop_1’ is generated from
for (i = 0; i < concurrent_tb*tb_size ; the loop body of the GPU kernel and the loop iterator
i+=skip_factor*unroll_factor) { and thread id are replaced with function parameters. In
int thread_id=i+skip_factor*unroll_factor+j* tb_size;
CPU function ‘cpu_prefetching’, the second-level loop
// unrolled loop
float a0=memory_fetch_for_thread ( (with iterator ‘m’) corresponds to the kernel loop. The
thread_id+ skip_factor *0); iterator update is a fixed value 8 rather than 1 so as to
float a1=memory_fetch_for_thread ( unroll the loop body for 8 times. The third level loop
thread_id+ skip_factor *1); (with iterator ‘i’) traverses through concurrent threads
…… for prefetching. The reason for such loop organization
sum += a0+a1+a2+a3+a4+a5+a6+a7; /* operation is that GPU executes many threads in parallel.
inserted to overcome dead code elimination */ Therefore, instead of prefetching data of multiple
}}} iterations for one thread, we prefetch data of one
Figure 7. The CPU pre-execution code for the iteration for many threads before moving on to the next
vector-add GPU kernel with a fixed skip factor iteration. The ‘skip_factor’ update part in Figure 9 is
of 160. the same as discussed in Section 4.2 and both adaptive
and profiling approaches can be applied.
1. For each kernel loop, extract memory operations and From our algorithms shown in Figures 2 and 8, we
the associated address computations from the loop body can see that the granularity of our CPU prefetch
and put them in a CPU function, replace thread id function is one loop iteration, with LWST as a special
computation with an input parameter, and replace the
kernel loop iterators with input parameters.
case of HWST. One may suggest finer granularity such
2. Add a nested loop structure into the CPU code to as prefetching one memory access a time. In other
prefetch data for concurrent threads. words, the CPU fetches one datum (e.g., A[n] in Figure
a. The outer loop traverse through all TBs. The 3) for many threads before moving on the next (B[n] in
iterator starts from 0 and the loop bound is the Figure 3) rather than fetching all the data in one
number of thread blocks in the GPU program. The iteration for a thread (A[n], B[n], and C[n] in Figure 3).
iterator update is the number of concurrent TBs, We do not choose this approach since it requires the
meaning the number of TBs that can run concurrently CPU and GPU to follow the exactly same access order
on the GPU. and the GPU compiler is more likely to re-order the
b. The second-level loop corresponds to the kernel
accesses within a loop body than to reorder accesses
loop and we use the same loop bound. The loop
update is increased to unroll the next level loop. If the across different loops. Furthermore, using one CPU
kernel loop is nested, this second-level loop is also function call to issue one access incurs too much
nested. control overhead for CPU execution.
c. The third-level loop traverses through concurrent Note that the algorithms in Figures 2 and 8 assume
threads. The loop iterator starts from 0 and the loop that the TBs are dispatched to SMs in-order, which is
bound is the number of concurrent threads (which is the case based on our experiments on current discrete
the product of TB size and the number of concurrent GPUs. If out-of-order TB dispatch is used, our scheme
TBs). The iterator update is set as a product of three would require GPU to send the active TB ids to the
parameters, unroll-factor, batch_size, and skip_factor.
CPU and the prefetching is done accordingly for those
Figure 8. The compiler algorithm to generate active blocks. In other words, we will replace the
CPU pre-execution program from HWST GPU implicit block ids in the loop “for (j = 0; j < N_tb; j+=
kernels. concurrent_tb)” (step 2a in Figure 2 and Figure 8) with
As shown in Figure 8, after generating the function explicit ones forwarded from the GPU. Our sequential
to load data for one loop iteration of a kernel loop (i.e., dispatch assumption eliminates such GPU-to-CPU
step 1), we insert loops to prefetch data for many communication.
__global__ void tmv_naive(float* A, float* B, examine the performance improvements achieved with
float* C, int width, int height) { our proposed CPU-assisted execution. In Figure 10, we
int x = blockIdx.x *blockDim.x+threadIdx.x; report the GPU performance using instruction per cycle
float sum = 0; (IPC) for each benchmark for three configurations, no
for (int i=0; i<height; i++) { CPU pre-execution (labeled ‘no-preex’), CPU pre-
sum += A[i*width+x]*B[i]; execution with adaptive update of skip factor (labeled
} ‘adaptive’), CPU pre-execution with a fixed skip factor
C[x] = sum; } (a)
determined from profiling (labeled ‘profiling’). Since
float memory_fetch_for_thread_loop_1 (int n, int m) our GPU has 4 SMs and each SM has 32 scalar SPs,
{ // n is the thread id and m is the loop iterator the peak IPC is 128. We also include the GPU
return (A[m*width+n] + B[m]); /* A, B are the performance results for a perfect L3 cache (labeled
CPU pointers mapped to the GPU memory */ ‘perfect L3) in Figure 10 for reference.
} 128 no‐preex adaptive
float cpu_prefetching( … ) { 112 profiling perfect L3
unroll_factor = 8; 96
//added loop to traverse thread block 80

IPC
for (j = 0; j < N_tb; j+= concurrent_tb) { 64
//the loop corresponding to kernel loop 48
32
for (m = 0; m < loop_counter_in_thread; m+=8) { 16
//added loop to traverse concurrent threads 0
for (i = 0; i < concurrent_tb*tb_size; BSc VD FFT MM Con TMV BS MC MV HM
i+=skip_factor*batch_size*unroll_factor)
for(k = 0; k < batch_size; k++) { Figure 10. GPU performance comparison
int thread_id = i + k*skip_factor*unroll_factor among no-pre execution (no-preex), CPU pre
+ j*tb_size; execution with adaptive update of ‘skip_factor’
// unrolled loop (adaptive) and CPU pre execution with a fixed
float a0 = memory_fetch_for_thread_loop_1 (
‘skip_factor’ determined from profiling
thread_id+ skip_factor*0, m+ 0);
float a1 = memory_fetch_for_thread_loop_1 ( (profiling)
thread_id+ skip_factor*1, m+ 0); Form Figure 10, we can see that our proposed CPU
…… pre-execution improves performance significantly, up
sum+=a0+a1+a2+a3+a4+a5+a6+a7; /*operation to 113% (MC) and 21.4% on average with adaptive
inserted to overcome dead code elimination */} update of ‘skip_factor’ and up to 126% and 23.1% on
for (i = 0; i < concurrent_tb*tb_size; average using a fixed ‘skip_factor’ determined from
i+=skip_factor*batch_size*unroll_factor) profiling. Among these benchmarks, BSc, VD and
for(k = 0; k < batch_size; k++) { TMV are memory intensive and we achieve about 30%
int thread_id = i + k*skip_factor*unroll_factor
speedups. The high performance gains from MC are
+ j*concurrent_tb*tb_size;
float a0 = memory_fetch_for_thread_loop_1 (
due to the fact that the GPU kernel (without CPU pre-
thread_id+ skip_factor *0, m+ 1); execution) suffers from partition conflicts [9] while our
float a1 = memory_fetch_for_thread_loop_1 ( pre-execution exploits the partition-level parallelism of
thread_id+ skip_factor *1, m+ 1); off-chip memory when it prefetches data across
…… multiple TBs. As the GPU requests hit in L3 cache,
sum+=a0+a1+a2+a3+a4+a5+a6+a7; } they do not go to off-chip memory, thereby avoiding
… the partition conflicts. The speedups for BS, MV, FFT
// Updating skip factor (See Section 4.2) are from 4% to 11% due to their irregular address
}}} (b) patterns and cache conflicts. There are no performance
Figure 9. A code example for HWST. (a) A benefits for MM and Con because they are highly
Transpose Matrix Vector Multiplication GPU optimized and have good locality and data reuse in L1
kernel; (b) the (partial) CPU pre-execution and L2 cache of GPU, which makes the L3 cache not
program. critical. Even with a perfect L3 cache, the performance
gains are negligible for these two benchmarks.
5. Experimental Results Another observation from Figure 10 is that both
adaptive update of ‘skip_factor’ and a fixed
5.1 Performance of CPU-Assisted GPGPU ‘skip_factor’ selected from profiling are effective in
After the CPU pre-execution program is generated, improve the GPGPU performance. The profiling
we let the CPU to execute this program right after the approach is slightly better as the adaptive approach
GPU kernel is launched. In our first experiment, we
usually quickly converges to the optimal value, as our approach is to examine how many instructions the
shown in Figure 6. CPU needs to execute in order to achieve the
performance gains. In Figure 13, we report the ratio of
5.2 The Efficacy of Data Prefetching using the number of instructions executed by the CPU over
CPU Pre-execution the number of instructions executed by GPU for both
In this experiment, we examine the efficacy of data adaptive update of ‘skip_factor’ (labeled ‘adaptive’)
prefetching using CPU pre-execution. First, we and fixed ‘skip_factor’ selected using profiling
evaluate the coverage of this prefetching scheme by (labeled ‘profiling’). From Figure 13, we can see that
examining the L3 cache hit rate for GPU accesses with the performance gains shown in Figure 10 are achieved
and without CPU prefetching. The results are shown in with little instruction overhead. On average, our CPU
Figure 11. The hit rates for GPU execution without assisted GPGPU using adaptive update of ‘skip_factor’
CPU pre-execution is labeled ‘no-preex’ and GPU (fixed value of ‘skip_factor’) only executes 0.74%
execution with CPU pre-execution using the adaptive (0.69%) extra instructions to deliver the performance
update of ‘skip_factor’ is labeled ‘adaptive’. The gains.
results for CPU pre-execution using fixed ‘skip_factor’ 3.0% adaptive profiling
2.5%
are very close to adaptive update. From Figure 11, we 2.0%
can see that the L3 cache hit rates are highly improved 1.5%
1.0%
by CPU pre-execution. On average, it improves from 0.5%
12.9% to 39.2%. The L3 cache hit rate improvements 0.0%
BSc VD FFT MM Con TMV BS MC MV GM
for MM and Conv do not translate to performance
gains as shown in Figure 10. The reason is that TLP Figure 13. The ratio of (number of instruction
and higher level of caches provide sufficient latency executed by CPU / number of instruction
hiding for these benchmarks. executed by GPU).
80%
70% no‐preex adaptive
60% 5.3 Understanding the Impact of GPU
Hit rate

50%
40% Architectures
30% In this experiment, we vary the following GPU
20%
10% architecture parameters to understand the impact on
0% our CPU-assisted GPGPU, the GPU SP frequency, the
BSc VD FFT MM Con TMV BS MC MV GM off-chip memory frequency, and the number of SPs in
Figure 11. L3 hit rate for GPU execution an SM. First, we vary the GPU SP frequency from the
without execution (no-preex) and with CPU pre default 480 MHz to 267 MHz and 800 MHz. The CPU
execution (adaptive). frequency remains at 2.4 GHz and the DRAM
Another metric for data prefetching is accuracy and bandwidth remains at 19.2GB/s. In Figure 14, we
we evaluate it by computing the ratio of how many L3 report the speedups that are achieved from CPU
misses generated from CPU pre-execution are actually assisted execution for each SP frequency compared to
accessed by GPU threads and the results are shown in no CPU pre-execution (labeled ‘sp267_speedup’,
Figure 12. It can be seen from the figure that our ‘sp480_speedup’, and ‘sp800_speedup’). All CPU pre-
proposed CPU pre-execution has very high accuracy. execution uses adaptive update of ‘skip_factor’ in the
On average, 98.6% data blocks loaded from the experiments in this section and the fixed ‘skip factor’
memory by CPU are accessed by the GPU threads. has slightly better performance gains. The results
100% labeled ‘sp480_speedup’ are what reported in Figure
10.
98% 2.1
sp267_speedup
1.9
sp480_speedup
Speedup

1.7
96% sp800_speedup
1.5
1.3
94% 1.1
BSc VD FFT MM Con TMV BS MC MV GM 0.9
BSc VD FFT MM Con TMV BS MC MV GM
Figure 12. Prefetch accuracy of CPU pre-
Figure 14. The speedups from CPU pre-
execution.
execution for GPUs running at different
Since our proposed CPU pre-execution needs to
frequencies and the normalized execution
execute instructions to generate prefetching requests
time without pre-execution.
for GPU threads, one way to evaluate the overhead of
From Figure 14, we can see that when SP frequency compared to no CPU pre-execution (labeled
is reduced, the relative memory latency is also reduced. ‘w16_speedup’, ‘w32_speedup’, and ‘w64_speedup’).
Therefore, CPU pre-execution provides less In our baseline GPU configuration, each SM has 32
performance gains. On the other hand, when we SPs. From the figure, we can see that when the number
increase SP frequency, these benchmarks show of SPs is reduced in an SM (while keep the same
different trends. First, the benchmarks, VD, FFT, BSc, number of SMs in the GPU), the GPU becomes more
have higher performance gains as the memory latency latency tolerant as each instruction in a warp will take
becomes more significant. Secondly, for the more cycles to finish. Given the same application,
benchmarks, TMV, MV, and MC, the impact is reducing the number of SPs is equivalent to increasing
opposite and the reason is that the GPU SPs nearly TLP, thereby reducing the performance gains achieved
double the rate of its memory requests, which enforce from CPU pre-execution. On the other hand, increasing
the CPU to skip more threads and to prefetch less data. the number of SPs also increases the rate of their
For the benchmark BS, its baseline IPC is very high memory requests, similar to increasing the SP speed,
(close to 90) when SP frequency is 480 MHz, thereby which can also reduce the effectiveness of CPU pre-
limiting the pre-execution impact as shown in Figure execution. Nevertheless, on average, for these three
10. Overall, increasing the speed of SPs has less SM configurations, CPU pre-execution achieves 14.7%,
impact than decreasing the speed and CPU pre- 21.4%, and 12.4% performance improvement,
execution is still effective for all these different SP respectively.
speeds. 2.1 w16_speedup
2
Next, we vary the off-chip memory frequency from 1.9 w32_speedup
the default 600 MHz to 300 MHz and 1200 MHz and 1.8
1.7
Speedup

w64_speedup
the results are show in Figure 15. From Figure 15, we 1.6
1.5
can see that when memory frequency is increased, the 1.4
1.3
memory latency is reduced. Therefore, CPU pre- 1.2
1.1
execution provides less performance gains. When we 1
0.9
reduce memory frequency, these trends of these
BSc VD FFT MM Con TMV BS MC MV GM
benchmarks are similar to increasing the SP frequency.
For example, for BSc, VD and FFT, the CPU pre- Figure 16. The speedups from CPU pre-
execution shows better speedups when the memory execution for different numbers of SPs in an
frequency is reduced from 600 MHz to 300 MHz, SM.
because the memory latency dominates these three
benchmarks. For TMV, MV, and MC, memory latency 5.4 Sensitivity of the Parameters in CPU Pre-
does not dominate the execution time. Therefore, the
Execution Program
speedups of CPU pre-execution are reduced when the
In this experiment, we study the sensitivity of the
memory frequency is reduced from 600 MHz to 300
two parameters used in our CPU pre-execution
MHz. Overall, reducing the speed of memory has
program to update the variable ‘skip_factor’ (see
much less impact than increasing the memory speed
Section 4.2). The first is the ‘batch_size’, which
and our proposed CPU pre-execution is effective for all
determines how often the skip_factor is updated. We
these different memory speeds.
2.3
vary this variable from 8, 16, and 32 and the GPU
2.1 mem300_speedup performance results are shown in Figure 17. As seen
mem600_speedup
1.9 from the figure, although the batch size of 16 achieves
Speedup

1.7 mem1200_speedup
1.5 the best performance, the performance difference for
1.3 different batch sizes is limited, except for BSc, which
1.1 prefers a large batch size.
0.9
140 no_preex batch_8 batch_16 batch_32
BSc VD FFT MM Con TMV BS MC MV GM 120
100
Figure 15. The speedups from CPU pre- 80
IPC

60
execution for off-chip memory running at 40
different frequencies and the normalized 20
0
execution time without pre-execution. BSc VD FFT MM Con TMV BS MC MV HM
Then, we vary the number of SPs in an SM and
keep the same (4) SMs in our GPU model. In Figure 16, Figure 17. The GPU performance for CPU pre-
we report the speedups that are achieved from CPU execution using different batch_sizes.
assisted execution for each SM configuration
In another experiment, we also change the threshold interesting insight that when a workload is parallelized
used to determine whether there are too many or too into many threads, each thread may be short and inter-
few L3 cache hit. We vary the threshold from 10 to 50 thread/inter-warp prefetching is more effective than
and the results are nearly identical, showing that our intra-thread/intra-warp prefetching. Compared to this
algorithm is not sensitive to this parameter. work, our proposed CPU pre-execution does not rely
on stride access patterns and provides both intra- and
5.5 Using CPU to Execute GPU Threads inter-warp prefetching. More importantly, all the
In this experiment, we consider the option of using previous works on GPU prefetching do not fit well
the CPU to directly execute some GPU threads to with fused architectures as both demand cache misses
reduce the GPU workload. On the GPU side, the thread and prefetches compete for critical resource, such as
blocks are distributed to SMs based on the order of L2 cache miss handling status registers (L2 MHSRs),
thread block id from small to large. On the CPU side, on the GPU side while leaving the CPU side resource
the CPU executes the thread blocks from the opposite idle. Our proposed approach, in contrast, leverages
direction, starting from the one with the largest thread such critical resources on CPU side for prefetches and
block id. In our simulator, we implemented a special keeps those on GPU side for demand misses, thereby
instruction for the CPU to get the largest thread block achieving better resource utilization. We also
id issued in the GPU. This way, we ensure that there is implemented the per-PC stride prefetcher with
no overlap workload between the CPU and the GPU. enhanced warp id indexing [12] in our simulator,
The speedups of such workload distribution between which shows a 5.24% speedup on average.
CPU and GPU over GPU-only execution are shown in To take advantage of fused architectures, it is
Figure 18. From the figure, we can see that the proposed in [7] that the GPUs run prefetching
performance gains of most of benchmarks are less than algorithms to prefetch data for CPU programs. In
2%. The main reason is the limited floating-point comparison, our goal is to accelerate GPU programs
throughput of the CPU and the high overhead of CPU and we believe it is a better fit to fused architectures
to access GPU memory partition. Among the since both GPU and CPU are used to do what they are
workloads, the benchmark, VD, shows the highest good at: GPU for ALU/floating-point computations
(about 5%) speedup since it does not have many ALU and CPU for flexible and accurate data prefetching.
operations to expose the ALU bottleneck of the CPU. Our proposed CPU-assisted GPGPU is also inspired
0.06
from many works on CPU-based pre-execution
[2][4][6][8][10][11][17][18][20][24], in which a pre-
execution thread is used to provide data prefetching
Speedup

0.04
and/or accurate control flow to the main thread. The
0.02 novelty of our work is to use a single CPU thread to
prefetch data for many concurrent GPU threads and a
0
simple yet effective way to control how far the pre-
BSc VD FFT MM Con TMV BS MC MV GM
execution thread runs ahead.
Figure 18. The speedups of workload
distribution between GPU and CPU over GPU- 7. Conclusion
only execution.
In this paper, we propose to collaboratively utilize
6. Related work CPU and GPU resources for GPGPU applications. In
our scheme, the CPU runs ahead of GPU threads to
Although a key design philosophy of GPU is to use prefetch the data into the shared L3 cache for the GPU.
TLP to hide long memory access latency, the Novel compiler algorithms are developed to
importance of GPU memory hierarchy has been widely automatically generate CPU pre-execution programs
recognized to achieve high performance GPGPU. In from GPU kernels. We also provide flexible
[23], software prefetching is used to overlap memory mechanisms to control how far the CPU runs ahead of
access latency with computations. However, GPU threads. Our experimental results show that our
prefetching data into registers or shared memory proposed CPU pre-execution has very high prefetching
increases the register pressure and may hurt the accuracy and achieves significant performance gains at
performance due to reduced TLP [27]. In [12], Lee et the cost of minor instruction overhead from the CPU
al. proposed many-thread aware GPU prefetching side. Furthermore, our results show that the proposed
approaches for L1 cache. Besides leveraging the well- scheme remains effective for different GPU
known stride access pattern, they revealed the configurations.
applications. IEEE/ACM International Symposium on
Acknowledgements Microarchitecture, 2010.
[13] MARSSx86,
We thank the anonymous reviewers for their http://marss86.org/~marss86/index.php/Home
insightful comments to improve our paper. This work [14] N. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J.
is supported by an NSF CAREER award CCF- Manferdelli, High performance discrete Fourier
0968667 and a gift fund from AMD Inc. transforms on graphics processors. Proceedings of
Supercomputing, 2008.
References [15] NVIDIA CUDA C Programming Guide 3.1, 2010.
[16] NVIDIA GPU Computing SDK 3.1,
http://developer.nvidia.com/gpu-computing-sdk, 2011.
[1] A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, and T.
[17] O. Mutlu, J. Stark, C. Wilkerson and Y. N. Patt, Run
M. Aamodt, Analyzing CUDA workloads using a
ahead Execution: An Alternative to Very Large
detailed GPU simulator. IEEE International Symposium
Instruction Windows for Out-of-Order Processors.
on Performance Analysis of Systems and Software,
IEEE International Symposium on High Performance
April 2009.
Computer Architecture, February 2003.
[2] A. Roth and G. Sohi, Speculative data driven
[18] R. Balasubramonian, S. Dwarkadas, and D. Albonesi,
multithreading. IEEE International Symposium on High
Dynamically allocating processor resources between
Performance Computer Architecture, 2001.
nearby and distant ILP. International Symposium on
[3] AMD Accelerated Parallel Processing (APP) SDK V2.3,
Computer Architecture, 2001.
http://developer.amd.com/gpu/atistreamsdk/pages/defaul
[19] P. Boudier, Memory System on Fusion APUs - The
t.aspx, 2011
Benefits of Zero Copy. AMD fusion developer summit,
[4] C. K. Luk, Tolerating memory latency through soft-
2011.
ware-controlled pre-execution in simultaneous
[20] P. H. Wang, H. Wang, J. D. Collins, E. Grochowski, R.
multithreading processors. International Symposium on
M. Kling, and J. P. Shen, Memory latency-tolerance
Computer Architecture, 2001.
approaches for Itanium processors: out-of-order
[5] C. Luk, S. Hong, H. Kim, Qilin: exploiting parallelism
execution vs. speculative precomputation. IEEE
on heterogeneous multiprocessors with adaptive
International Symposium on High Performance
mapping. IEEE/ACM International Symposium on
Computer Architecture, 2002.
Microarchitecture, 2009.
[21] Sandy Bridge,
[6] C. Zilles and G. Sohi, Execution-based prediction using
http://en.wikipedia.org/wiki/Sandy_Bridge.
speculative slices. International Symposium on
[22] S. I. Lee, T. Johnson, and R. Eigenmann. Cetus – an
Computer Architecture, 2001.
extensible compiler infrastructure for source-to-source
[7] D. H. Woo, H. S. Lee, COMPASS: a programmable
transformation. In Proc. Workshops on Languages and
data prefetcher using idle GPU shaders. Proceedings of
Compilers for Parallel Computing, 2003
the Architectural support for programming languages
[23] S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi,
and operating systems, March 13-17, 2010.
S. Ueng, J. A. Stratton, and W. W. Hwu, Optimization
[8] D. Kim and D. Yeung, Design and evaluation of
space pruning for a multi-threaded GPU. International
compiler algorithms for pre-execution. Proceedings of
Symposium on Code Generation and Optimization,
the Architectural support for programming languages
2008.
and operating systems, 2002.
[24] Y. Solihin, J. Lee and J. Torrellas, Using a user-level
[9] G. Ruetsch and P. Micikevicius, Optimize matrix
memory thread for correlation prefetching, ISCA 2002
transpose in CUDA. NVIDIA, 2009.
[25] Streaming SIMD Extensions,
[10] H. Zhou, Dual-Core Execution: Building a Highly
http://en.wikipedia.org/wiki/Streaming_SIMD_Extensio
Scalable Single-Thread Instruction Window,
ns.
Proceedings of the International Conference on Parallel
[26] The AMD Fusion Family of APUs,
Architectures and Compilation Techniques, Sept. 2005
http://sites.amd.com/us/fusion/apu/Pages/fusion.aspx.
[11] J. D. Collins, H. Wang, D. Tullsen, C. Hughes, Y.-F.
[27] Y. Yang, P. Xiang, J. Kong and H. Zhou, A GPGPU
Lee, D. Lavery, and J. P. Shen, Speculative
Compiler for Memory Optimization and Parallelism
precomputation: long range prefetching of delinquent
Management. ACM SIGPLAN conference on
loads. International Symposium on Computer
Programming Language Design and Implementation,
Architecture, 2001.
2010.
[12] J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc,
Many-thread aware prefetching mechanisms for gpgpu

You might also like