Academia.eduAcademia.edu

A Performance Model for GPUs with Caches

To exploit the abundant computational power of the world's fastest supercomputers, an even workload distribution to the typically heterogeneous compute devices is necessary. While relatively accurate performance models exist for conventional CPUs, accurate performance estimation models for modern GPUs do not exist. This paper presents two accurate models for modern GPUs: a sampling-based linear model, and a model based on Machine-Learning (ML) techniques which improves the accuracy of the linear model and is applicable to modern GPUs with and without caches. We first construct the sampling-based linear model to predict the runtime of an arbitrary OpenCL kernel. Based on an analysis of NVIDIA GPUs' scheduling policies we determine the earliest sampling points that allow an accurate estimation. The linear model cannot capture well the significant effects that memory coalescing or caching as implemented in modern GPUs have on performance. We therefore propose a model based on ML techniques that takes several compiler-generated statistics about the kernel as well as the GPU's hardware performance counters as additional inputs to obtain a more accurate runtime performance estimation for modern GPUs. We demonstrate the effectiveness and broad applicability of the model by applying it to three different NVIDIA GPU architectures and one AMD GPU architecture. On an extensive set of OpenCL benchmarks, on average, the proposed model estimates the runtime performance with less than 7% error for a second-generation GTX 280 with no on-chip caches and less than 5% for the Fermi-based GTX 580 with hardware caches. On the Kepler-based GTX 680, the linear model has an error of less than 10%. On an AMD GPU architecture, Radeon HD 6970, the model estimates with 8% of error rates. The proposed technique outperforms existing models by a factor of 5 to 6 in terms of accuracy.

http://dx.doi.org/10.1109/TPDS.2014.2333526 1 A Performance Model for GPUs with Caches Thanh Tuan Dao, Jungwon Kim, Sangmin Seo, Member, IEEE, Bernhard Egger, Member, IEEE, and Jaejin Lee, Member, IEEE Abstract—To exploit the abundant computational power of the world’s fastest supercomputers, an even workload distribution to the typically heterogeneous compute devices is necessary. While relatively accurate performance models exist for conventional CPUs, accurate performance estimation models for modern GPUs do not exist. This paper presents two accurate models for modern GPUs: a sampling-based linear model, and a model based on Machine-Learning (ML) techniques which improves the accuracy of the linear model and is applicable to modern GPUs with and without caches. We first construct the sampling-based linear model to predict the runtime of an arbitrary OpenCL kernel. Based on an analysis of NVIDIA GPUs’ scheduling policies we determine the earliest sampling points that allow an accurate estimation. The linear model cannot capture well the significant effects that memory coalescing or caching as implemented in modern GPUs have on performance. We therefore propose a model based on ML techniques that takes several compiler-generated statistics about the kernel as well as the GPU’s hardware performance counters as additional inputs to obtain a more accurate runtime performance estimation for modern GPUs. We demonstrate the effectiveness and broad applicability of the model by applying it to three different NVIDIA GPU architectures and one AMD GPU architecture. On an extensive set of OpenCL benchmarks, on average, the proposed model estimates the runtime performance with less than 7% error for a second-generation GTX 280 with no on-chip caches and less than 5% for the Fermi-based GTX 580 with hardware caches. On the Kepler-based GTX 680, the linear model has an error of less than 10%. On an AMD GPU architecture, Radeon HD 6970, the model estimates with 8% of error rates. The proposed technique outperforms existing models by a factor of 5 to 6 in terms of accuracy. Index Terms—GPU, performance modeling, caches, scheduling, OpenCL, NVIDIA, AMD ✦ 1 I NTRODUCTION T ODAY, the word’s second-fastest supercomputer, Titan, and many more on the Top500 list [25] are heterogeneous systems comprising both CPUs and GPUs. In order to achieve optimal performance the workload needs to be distributed evenly to the different computing nodes, and to do so an accurate performance model is required. Static workload distributions [8], [28] based on throughput and threshold values are often far away from the optimal distribution because the performance difference between CPUs and GPUs heavily depends on the program characteristics. A performance estimation model can be used to dynamically distribute the workload inversely proportional to the estimated execution time. While performance modeling for general-purpose CPUs has been researched actively and accurate performance models are available, the state-of-the-art models for GPUs still suffer from a relatively large estimation error. Models for older GPU architectures do not consider the GPU’s hardware caches [11], [2], [29], and many are not suited for performance estimation at runtime [12], [14], [2], [29]. A performance model suitable for runtime workload distribution thus needs to capture and charac• All authors are with the School of Computer Science and Engineering, Seoul National University, Seoul, Korea. E-Mail: [email protected], [email protected], [email protected], [email protected], [email protected] terize the intricate interactions of the most important hardware and software features of modern GPUs. This is undoubtedly a challenging problem because of the GPU’s complex hardware that enables its massively parallel processing capability. Aside from a large number of processors, each with multiple scalar execution units, the GPU’s hardware thread context switching mechanism and the on-board memory subsystem with different levels of cache memories further complicate the task. Several approaches [2], [3], [11], [13], [15], [29] have been proposed to model GPU performance, including analytical modeling and Machine Learning (ML)-based modeling. The analytical modeling approaches [2], [11], [15], [29] typically rely on micro-architecture information to predict the performance of a program in a handcrafted manner. As GPU architectures continue to evolve dramatically this approach is not that attractive: a minor change in the architecture may require extensive work to adapt the model to the new architecture. On the other hand, instead of evaluating a number of predefined formulas and quickly reporting the execution time, ML models [3], [4], [13] rely on training data to learn the mapping between program features and execution time. ML-based approaches seem to be more appropriate for this task as they are more robust to changes in the GPU architecture. In this paper, we present two models to accurately predict the performance of an OpenCL kernel on GPUs. Both the linear and the ML-based model rely on sampling information to overcome the limitations 2 of analytical methods. Sampling incurs an overhead because a small part of the actual workload has to be executed before the performance of the entire workload can be predicted; thus, it is important to keep the sampling overhead to a minimum. In order to determine the earliest sampling points that allow an accurate estimation, it is essential to have a clear understanding how GPUs schedule a workload. Unfortunately, GPU manufacturers do not disclose how the work is distributed to the different compute units of a GPU, or how the more than 1000 active threads are scheduled inside these units. We thus reverseengineer the scheduling policies of GPUs by analyzing the results of a set of hand-crafted OpenCL microbenchmarks. Based on these observations we formulate and verify the scheduling policies of modern NVIDIA GPUs which then form the basis of the two models. The analysis of the GPU’s scheduling policy allows us to compute the minimal number of work-items at which the GPU attains maximum throughput. The sampled data is used by a linear model to extrapolate the total execution time of the entire kernel. This linear model works reasonably well because the amount of work per work-item in scientific workloads is typically distributed evenly. On modern GPUs with on-chip caches and memory coalescing, however, the execution time shows much more variance. To cope with such architectures we combine the linear model with an ML-based approach that takes the data of the GPU’s hardware performance counters as additional inputs. Training the ML-based model with over 300 data sets allows the model to detect the complex correlations between a workload’s sampled data and the actual execution time. The experiments show that the ML-based model is able to improve the accuracy of the linear model significantly. We evaluate the accuracy of the proposed model with 70 different OpenCL kernels by comparing the estimated kernel execution time with the actual measured execution time. We evaluate the model on different GPU architectures from two representative GPU vendors: NVIDIA and AMD. On the GTX 580 [16], a representative of NVIDIA’s Fermi GPUarchitecture, we achieve, on average, an error rate of 5.72% for the linear model and 4.76% for the MLbased model. To demonstrate that our model is not tied to a certain GPU architecture, we have applied it to an NVIDIA GTX 280 and an NVIDIA GTX 680. The GTX 280 is a second-generation GPU with no hardware caches. The model achieves an error rate of 7.19% for the linear and 6.42% for the ML-based model. The GTX 680 is based on NVIDIA’s latest Kepler architecture [17]. It is currently impossible to read the GPU’s performance counters from OpenCL, hence we were only able to run the linear model. An error rate of less than 10% shows that the linear model, although simple, can cover a wide range of GPU generations with acceptable error rates. Interestingly, on the AMD Radeon HD 6970, the model achieves an error rate of 8%. This shows that although the model is formed based on an analysis on a specific NVIDIA GPU, it could be used with not only different GPU generations from NVIDIA, but also from AMD. The contributions of this paper are: • • • • An in-depth analysis of scheduling policies of NVIDIA GPUs that allows us to determine the sampling points with the fewest number of workitems that still allow an accurate performance estimation. A simple linear performance model that is very accurate for applications with an evenly distributed workload. An ML-based performance model that can cover a wide range of applications and significantly improve the performance of the linear model. An implementation and evaluation of the models with 70 OpenCL kernels from four different benchmark suites and OpenCL kernel collections on four different generations of GPU architectures from two different vendors. The rest of this paper is organized as follows. Section 2 reviews related work. Section 3 contains a brief introduction to the OpenCL execution model and the Fermi GPU architecture. In Section 4, we provide a comprehensive analysis of the way modern GPUs execute kernels. The performance estimation models described in detail in Section 5 are built upon these observations. Section 6 evaluates and compares the accuracy of our models, and Section 7 concludes this paper. 2 R ELATED W ORK Wong et al. [27] have analyzed the NVIDIA GTX 280 and revealed a number of undisclosed characteristics by running micro-benchmarks. However, their analysis does not consider contention and does thus not provide much information about how warps are scheduled. Liu et al. [15] have developed a performance predictor for a specific application. Although the prediction is precise, this method requires a priori knowledge of the application’s throughput. Bitirgen et al. [4] proposed an interval-based framework to dynamically select resource allocation decisions on chip multiprocessors. Their model takes program behavior as input and estimates the performance of the program at certain intervals. Though interesting, this interval-based approach is not applicable to GPU performance estimation. Zhang and Owens [29] have developed a performance model for CUDA programs executing on NVIDIA GeForce 200-series GPUs. Their main goal is to help the programmer identify performance bottlenecks, potential optimizations and architecture improvements. Static program features are collected by the compiler and dynamic features are collected in a 3 simulator [5]. Together with machine-specific characteristics, these features are used to analytically calculate the execution time of CUDA applications. This overhead can easily exceed the running time of the kernel on a GPU itself. Jia et al. [14] proposed a regression-based performance model for GPU design space exploration. Training this model requires executing the program over a large number of design points which makes this approach unsuitable for workload distribution. Kerr et al. [30] have proposed a Machine Learning based model to predict performance of CUDA programs on GPUs and CPUs (based on Ocelot compiler). The static features of the program are used to derive the relationship between the program behavior and the performance on a target architecture. This model uses static features of the program so it is limited to an average error of 30%. Luk et al. [32] have proposed a linear regression-based model to distribute workload to a heterogeneous system of CPUs and GPUs. Unlike the proposed technique, the linear regression-model needs to be trained for each new kernel encountered. Hong and Kim [11], Baghsorkhi [2] and Grewe [31] have presented performance models that can be used for workload distribution. These models are based on static information of CUDA/OpenCL programs and require the programs to be written in a parametrized way so that variables related to the problem size can be analyzed symbolically. Another drawback of these models is that they do not consider the presence of caches and are thus not applicable to modern GPUs with caches. Finally, none of these models take into account the interaction between the application and the hardware as well as the effect of compiler optimizations. The models proposed in this paper remove the restrictions of static analysis by using dynamic information through sampling. Experiments with a large number of OpenCL kernels from a wide range of applications show that the models can be easily applied to GPU from different architectures and produce a more accurate performance estimate with an average error of less than 5% for the GTX 580, 7% for the GTX 280, 10% for the GTX 680 and 8% for the HD 6970. 3 BACKGROUND 3.1 OpenCL Framework In this section, we briefly introduce the OpenCL framework, the NVIDIA’s Fermi GPU architecture [19], and provide a short introduction to the Machine Learning techniques applied in this work. To make the discussion consistent we use OpenCL terms to describe the GPU architecture. In the OpenCL platform model [9] a host processor is connected to one or more compute devices. A compute device contains a number of compute units (CUs), each of which contains one or more processing elements (PEs) (Figure 1). Host Host Processor Main Memory Interconnect Bus Compute Device Compute Device Compute Device CU ... CU Global/Constant Memory Data Cache Global Memory Compute Device Compute Unit Compute Device CU ... Constant Memory PE PE PE ... Local Memory Private Memory Compute Device Memory Fig. 1: OpenCL Architecture. An OpenCL application consists of a host program and one or several kernels. The host program is executed on the host processor, and the kernels are executed on the devices. For each kernel, the host defines an N -dimensional abstract index space in which the kernel will be executed (N ∈ {1, 2, 3}). Each point in this space defines one execution instance of the kernel, called work-item. The work-items are organized into groups of equal size, called work-groups. The workgroups are executed independently, i.e., concurrently and in any order. There are four different types of memory accesses in an OpenCL kernel: global memory, constant memory, local memory, and private memory. Global and constant memory accesses have the highest latency since the accessed data resides in the device global memory. The local memory is shared by all PEs in the same compute unit. The private memory is local to a PE. Accesses to the global memory or the constant memory may be cached in the global/constant memory data cache if there is such a cache in the device. 3.2 GPU Architecture NVIDIA’s Fermi architecture [19] is designed for massive parallel processing. It comprises a fairly large number of streaming multiprocessors (SMs). Each SM contains 32 streaming processors (SPs) for general computations, 16 load/store (LD/ST) units, and four Special Function Units (SFUs) that handle transcendental operations. There are two levels of data caches: an L1 cache for each SM and an L2 cache shared by all the SMs. Figure 2 shows a schematic block diagram of one of the total 16 SMs in the GTX 580 GPU. In OpenCL terminology, the GPU represents a compute device, an SM corresponds to a CU, and an SP to a PE. When running a work-group on an SM, each workitem is executed by one thread. Threads are bundled 1. 2. 3. 4. 5. 6. print open ViewCrop select save 4 Warp Scheduler 1 Execution Index Space Warp Scheduler 2 warp 0 SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SM L1 Cache/Shared Memory 0 1 64 65 SFU SFU ... warp 1 31 32 33 95 96 97 warp 2 ... ... 63 work-group 0 127 work-group 1 warp 3 ... ... LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST work-item Fig. 3: Relationship of work-groups, work-items, and warps. SFU SFU L2 Cache Global (Device) Memory Fig. 2: Schematic block diagram of one SM in the GTX 580 GPU. into so-called warps. One warp consists of 32 threads (or work-items; Figure 3). All threads in a warp execute in lock-step. The GPU compiler thus converts conditionally executed code (such as if-else-constructs) into predicated code. A divergence describes the situation when, at run-time, not all 32 threads take the same control path. When divergence occurs, both parts of the conditionally executed code must be executed. Divergence inevitably leads to reduced performance. The work-group is the minimal allocation unit of work for an SM. A work-group running on an SM is called an active work-group [22]. Similarly, active warps denote warps that are currently executing or eligible for execution. If enough hardware resources are available, several work-groups can be active simultaneously in one SM. The number of active workgroups depends on the kernel code, the kernel’s index space, and the hardware resources of the SM. The metric to characterize this degree of concurrency is termed occupancy [21]. Occupancy is defined as the ratio of the actual number of active warps to the maximum possible number of active warps per SM. This maximum is hardware specific; in the case of the GTX 580, at most 48 warps (and thus 48 ∗ 32 = 1536 threads or work-items) can be active in one SM at the same time. The work-groups and warps in a work-group are formed from consecutive work-items. If the size of a work-group is not divisible by the warp size (i.e., 32), every work-group contains one warp that has less than 32 work-items. Thus, to maximize resource utilization, the number of work-items in a work-group should always be a multiple of 32. The active work-groups that execute concurrently in an SM might have different starting and finish- ing times. As soon as one work-group finishes, the SM scheduler activates a new work-group. The 32 work-items of a warp are conceptually executed in one cycle. Recall from Figure 2 that one SM in the Fermi architecture contains 32 SPs for general-purpose instructions but only 16 load/store units. Every time a warp issues a load/store instruction, its execution would thus have to be split up into two cycles. Instead of doing so, every SM contains two warp schedulers that, in every clock cycle, schedule two half-warps from different warps [22]. One scheduler handles warps with even IDs, and the other one handles warps with odd IDs. Both schedulers issue the instruction to one of the four execution units (2 x 16 SPs, 1 x 16 LD/ST units, 1 x 4 SFUs). Such a setup uses the hardware resources more efficiently and does not require extra management of half-way executed warps because, for example, a general-purpose instruction from one half-warp and a memory operation from the other half-warp can be executed in parallel. However, for transcendental operations that are scheduled on the 4 SFUs, a half-warp will require at least four cycles to be scheduled. Each SM contains an instruction cache and 64KB of local data memory that can be dynamically configured as scratchpad memory or L1 cache. A unified L2 data cache of 768KB is shared by all SMs in the GPU. 3.3 Support Vector Regression Here, we briefly introduce Support Vector Regression (SVR), the regression learning algorithm used in the ML-based model. We use a non-linear form of a supervised learning algorithm called ǫ-SVR [26]. In this form, the model takes a feature vector as its input; this is simply a vector of input parameters. In a learning phase the model is trained with a large number of input vectors and the corresponding output values. The model learns the complex interactions between the features of the input vector and the desired objective. We use the leave-one-out-cross-validation technique: we train the model with training data from which all data points of the kernel under test have been removed, and then apply the model to the kernel. This method guarantees that the training data set and the test data set are distinct which is important if we only have a limited number of training data points. In our context, the input data comprises a set of performance features of an OpenCL kernel (provided 5 by the compiler, and obtained through sampling), and the objective is the execution time of kernel. For the interested reader, the following paragraph provides some more details about ǫ-SVR. To keep the discussion reasonably simple, we describe the linear form; the basic idea is identical in the non-linear form. The training phase is performed on a collection of training data. A data instance has the form {x, y} where x is a collection of d representative features for this data instance and y is the objective that is associated with this data instance. The goal is to find a function f (x) that deviates by no more than ǫ from the training target yi for all yi while being as flat as possible. In the linear problem statement, the function f (x) has the form f (x) = w · x + b where w denotes the normal vector to the hyperplane and b is the bias with w ∈ Rd , b ∈ R. The flatness in this case means that w is small [24]. Denoting the training data as (x1 , y1 ), (x2 , y2 ), ..., (xN , yN ), the objective then becomes 1 kwk2 (1) minimize 2 subject to |w · xi + b − yi | ≤ ǫ, ∀i (2) This convex optimization problem can be solved using its dual formulation by introducing Lagrange multipliers [26]. Condition 2 assumes that such a function f (x) exists; however, this is often not the case for large datasets. To overcome this limitation, soft margin SVR [6] was introduced. Details of solving the dual formulation and soft margin SVR are out of the scope of this paper. We chose ǫ-SVR because it generalizes well to unseen data. In the training step, rather than only minimizing the observed training error, ǫ-SVR finds the trade-off between the error and the complexity of the objective function. Once the function f (x) has been determined, it can be used to predict the objective value for yet unseen feature vectors x. 4 P REREQUISITES A N INSIGHT TO ING : TO EFFICIENT PROFIL WARP SCHEDULING In order to accurately estimating the runtime performance of an OpenCL kernel, a deep understanding of the inner workings of a GPU is necessary. In addition, a sampling-based model must keep the number of sampled work-groups to a minimum. This section presents our analysis of the NVIDIA GTX 580 scheduler and provides the foundation for the performance model presented in the following section. From pieces of information about GPU scheduling gathered from various sources [16], [18], [19], [21] we first state and verify the following two assumptions: 1) On the device-level, the work-group scheduler assigns a new work-group to an SM as soon as an active work-group on that SM finishes executing. 2) On the SM-level, the two warp-schedulers use a round-robin scheduling policy to schedule the active warps. Whenever a warp blocks or has executed for a certain number of cycles, the scheduler picks a new warp. To accurately measure the execution time of workitems, we add instrumentation code to the very beginning and the very end of real-world and several handcrafted kernels. The code reads the GPU’s clock cycle performance counter. This instrumentation allows us to record the start time, swi (i) and finish time, ewi (i) for each work-item i. The execution time is then given as twi (i) = ewi (i) − swi (i). For a warp w containing N work-items, let swarp (w) ewarp (w) twarp (w) = = = N M INk=1 swi (k) N M AXk=1 ewi (k) ewarp (w) − swarp (w) The start, finish and execution time for work-groups comprising M warps is defined accordingly: swg (j) ewg (j) twg (j) = = = M M INk=1 swarp (k) M M AXk=1 ewarp (k) ewg (j) − swg (j) Recall that several work-groups can be active within one SM. The compiler computes a kernel’s occupancy as a by-product of compiling. Based on the occupancy we can compute how many work-groups can be active simultaneously within on SM. For example, with an occupancy of 1.0 and work-groups containing 256 work-items each, six work-groups can be active at the same time (one work-group contains 256/32 = 8 warps, and to achieve an occupancy of 1.0, we need 6 work-groups with 8 warps to get 48 warps, the maximum number of active warps per SM). Similarly, if the occupancy is 0.833, five work-groups can be active at the same time. We call the number of active work-groups per SM for a given kernel the saturation point, denoted Psat . When executing one, then two, then up to Psat work-groups on one SM we expect that the throughput, i.e., the number of issued instructions per second (IPS) increases up to the saturation point. At that point, the running kernel exploits as much of the hardware resources as possible, and the warp schedulers within the SM have the biggest freedom when picking ready-to-run warps. Micro-benchmarks consisting only of ALU operations and no memory accesses indeed show the expected behavior. For more complex kernels, however, the IPS can increase even after the saturation point due to cache warm-up effects. In Figure 4 (a), the saturation point of the microbenchmark is two, and indeed the throughput achieves a stable maximum at multiples of Psat . In between saturation points, maximum throughput is not achieved because not all hardware resources are fully utilized. For nbody in Figure 4 (b), the kernel from the NBody application, the saturation point Psat is 3, yet, the throughput clearly increases after the 6 instr / ms instr / ms 0.7 0.6 0.5 0.4 1 2 3 4 5 6 7 8 9 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8 1 10 2 3 number of work-groups per SM 4 5 6 7 8 9 10 number of work-groups per SM (a) microkernel IPS (b) nbody IPS 30 milliseconds milliseconds Fig. 4: Instructions per microsecond for the first 160 workgroups of a microkernel and nbody. 20 10 0 0 16 32 48 work-group ID 64 80 95 80 60 40 20 0 0 16 (a) microkernel start/finish times 32 48 work-group ID 64 80 95 (b) nbody start/finish times 120 90 60 30 0 0 16 32 48 work-group ID 64 80 95 64 80 95 (a) milliseconds first saturation point. This increase is caused by cache warm-up effects, i.e., the first work-groups suffer from more cold misses and can profit less from memory coalescing than later work-groups. Figures 5 plots the start and finish time for each work-group for the microbenchmark and nbody. In the case of the microbenchmark, Psat = 2 so the the first 32 workgroups are started at the same time (each of the 16 SMs receives two work-groups). We observe that the work-group scheduler distributes the first 16 workgroups one-by-one to each of the 16 SMs. Workgroups 16 to 31 are again evenly distributed to all SMs, and so on. For nbody in Figure 5(b) Psat = 3 so initially, each SM is given three work-groups. These correspond to the first 16 ∗ 3 = 48 work-groups. For both benchmarks, we observe that whenever a workgroup finishes the work-group scheduler immediately assigns a new work-group to the corresponding SM. This confirms assumption 1 about the work-group scheduler. We expect all active work-groups in an SM to start and finish at the same time. Figure 5(a) confirms this assumption for the microbenchmark. In the case of nbody, however, we observe that the third workgroup on each SM takes almost double the time to finish compared to the first work-group (Figure 5(b)). We call such a behavior a staircase. We have run a significant number of hand-crafted micro-benchmarks that stress different types of instruction mixes in loops to explain the staircase. Figure 6(a) shows the behavior of a kernel that contains a single-precision floating point addition in a loop. There are a total of 96 workgroups in the kernel index space of this benchmark, with each work-group containing 256 work-items. The kernel’s occupancy is 1.0, hence six work-groups or 48 warps are active at the same time per SM. In total, there are 16 SMs on the GPU, so all 96 work-groups can run in parallel. This explains why all work-groups start at time 0. Since there is no resource contention between warps, warps block very infrequently (e.g., only on changes in the control flow caused by iterat- milliseconds Fig. 5: Start/finish times for the first 96 workgroups of a microkernel and nbody. 200 150 100 50 0 0 16 32 48 work-group ID (b) Fig. 6: Micro-benchmark containing a loop with a (a) singleprecision floating point addition (b) a double-precision floating point addition. ing the loop). According to assumption 2, if a warp does not block, the scheduler will continue to run it until it exceeds a certain number of cycles. After this point the scheduler will schedule another warp that is ready-to-run. Since the kernel does not exhibit a staircase, the selection of the next warp seems to be round-robin. To trigger resource contention, we replace the single-precision with a double-precision floating point addition. Single-precision floating point operations are executed on one SP, but for double-precision operations, the two rows of sixteen SPs are logically linked and operate as 16 double-precision SPs (Figure 2). If one of the half-warps executes a double-precision floating point operation, all 32 SPs are occupied and the other half-warp will block unless it executes a memory or transcendental operation. The execution time of the work-groups is shown in Figure 6(b) and clearly exhibits a staircase. If the warp schedulers schedule half-warps in a round-robin fashion, no staircase should appear since all warps will get an equal share of the SM. To better understand how the warp schedulers se- 20% 40% 60% 80% milliseconds 7 100% 450 400 milliseconds 350 300 250 150 100 50 0 0 200 16 32 150 48 work-group ID 64 80 95 64 80 95 64 80 95 (a) 100 0 0 16 32 48 64 80 95 work-group ID milliseconds 50 Fig. 7: Progress within work-groups in the presence of a staircase. 300 200 100 0 0 16 32 48 work-group ID lect the next warps, we have added additional instrumentation code to the kernel. We not only measure the start and finish time of the kernel, but also record the time after processing 20%, 40%, 60%, and 80% of the workload. The result of this experiment is shown in Figure 7. The figure reveals that, while all workgroups start around time 0, they do not progress at the same speed. In fact, the third work-group on each SM does not even start running past the first couple of loop iterations before the first work-group has finished. The same is true for the fourth workgroup (it effectively starts executing when the second one finishes), and the fifth work-group (when the third work-group finishes). Using this result, we refine assumption 2 on warp-scheduling: 2) On the SM-level, the two warp-schedulers use a round-robin scheduling policy when the active warp has executed for a certain number of cycles. If a warp blocks, the scheduler selects the first warp that is ready-to-run from the workgroup with the lowest ID. With this refined assumption, we are able to explain kernels that do not exhibit staircase behavior because warps seldom block and thus run more or less at the same speed thanks to the round-robin scheduling policy. In high-contention situations, however, the warp schedulers give higher priority to earlier work-groups by picking the first warp that is ready to run from the work-group that has been active for the longest time (i.e., has the lowest ID). Clearly, the instruction mix will have an effect on what warp will be selected. Figure 8 shows the start and end times of all 96 simultaneously active work-groups. We run a loop with one, two, and three double-precision floating point additions in the loop body. We anticipate that the staircase is more pronounced the higher the resource contention. This behavior is indeed observed in Figures 8 (a-c): in the case of only one double-precision addition, the mix between operations that cause a warp to block and those that do not is still balanced so that in most cases the scheduler can run a warp until its time quantum expires. As we add more double-precision additions in Figure 8(b-c), the contention becomes more severe. The warp schedulers now clearly favor work-groups milliseconds (b) 400 300 200 100 0 0 16 32 48 work-group ID (c) Fig. 8: Micro-benchmark containing a loop with a (a) one (b) two and (c) three double-precision floating point additions. with a low ID. Indeed, for the case with three doubleprecision additions, the later work-groups will not even start executing before the first work-group has finished executing (Figure 8(c)). A warp can block for several reasons: it blocks when it encounters an instruction whose operands are not available (data dependency), when the required execution units are occupied (hardware contention), or when control flow occurs. We observe that there are three types of operations that can cause hardware contention: double precision instructions, transcendental instructions (such as sine, cosine, exp, etc.), and memory operations. However, whether the warp will actually block or not also depends on the instruction mix of the kernel. If the kernel contains a well-balanced mix of instructions that are handled by different hardware resources, even a kernel with many double-precision floating point operations may not exhibit a staircase. The following section details the construction of the linear model, the ML-based model and, in particular, the points at which the workload is sampled based on the observations on the the scheduling mechanism. 5 P ERFORMANCE E STIMATION Based on our analysis of the execution model of Fermi GPUs in the previous section, we first construct a linear model to estimate the execution time of a workload. This simple linear model, however, cannot capture the non-linear effects of memory accesses on performance. Since recent GPUs such as those based on the Fermi architecture [19] implement several levels of hardware caches, we then propose a model based on machine learning techniques to improve the 8 t3*Psat time staircase workgroups Psat no staircase t2*Psat tPsat 2*Psat 3*Psat Fig. 9: Sampling at saturation points and the linear model. prediction accuracy for modern architectures and nonlinear benchmarks. The ML-based model, guided by the linear model, has more relaxed assumptions than the linear model and can thus make a more accurate execution time estimation for the general case. 5.1 Linear Model The linear model estimates the total execution time of a kernel by extrapolating the execution time obtained from two sampling points. Our goal is to choose two sampling points with as small a number of workgroups as possible while maintaining an acceptable accuracy for the estimation to keep the overhead to a minimum. The previous section shows that even though a kernel fully utilizes the hardware resources of a GPU at the saturation point, Psat , the throughput can increase even after Psat due to warming-up effects of the L1and L2-caches on the GPU (Figure 4(b)). We thus sample each kernel twice, once at Psat1 and once at Psat2 , where both Psat1 and Psat2 are multiples of Psat . The execution time of the first sample, tP sat1 , determines the displacement, and the difference in execution time from the second sample to the first, i.e., tP sat2 − tP sat1 represents the slope. Sampling at integer multiples of the saturation point has the added advantage that the staircase behavior can be ignored in the linear model. In fact, the staircase as explained in Section 4 only manifests itself in between two saturation points, at (integer multiples of) the saturation point the sampled execution times are identical. Figure 9 visualizes this observation. We have found that Psat1 = 2 ∗ Psat and Psat2 = 3∗Psat are sufficient to avoid the non-linearity of the first few work-groups. Using two sampling points at multiples of Psat , the execution time of a kernel can be modeled as follows. The first sampling point determines the displacement, and the difference in execution time from the second to the first sampling point yields the increment (Figure 9) tlinear (N ) = tP sat1 + tP sat2 − tP sat1 × (N − Psat1 ) Psat2 − Psat1 Here, tlinear (N ) denotes the total execution time of the kernel for N work-groups. tP sat1 and tP sat2 stand for the sampled execution time at saturation points Psat1 and Psat2 , respectively. This simple linear model accurately estimates the execution time of a kernel when the following two conditions hold: (1) the workload is evenly distributed and (2) the memory access patterns are similar for all work-groups. The accuracy drops for unevenly distributed workloads or workloads that exhibit a variance in performance caused by caching or memory coalescing effects. The ML-based model tries to eliminate these shortcomings. 5.2 Model based on Machine Learning For the ML-based model, we break the task of estimating the execution time into two smaller tasks. First, we estimate the saturated instructions per second (IPS) of a kernel using a machine learning algorithm. By instructions per seconds we refer to the number of issued instructions per second, not the number of executed instructions. The difference between issued instructions and executed instructions is that issued instructions include instructions that are serialized due to hardware contention, memory conflicts, or divergence. Serialization seriously affects performance, thus using issued instructions improves the accuracy of the model. After approximating the saturated IPS, we compute the total number of instructions for the whole kernel and can easily calculate the total execution time. The ML-based model makes less assumptions and covers a wider range of kernels than the linear model since it only requires the kernel to distribute the workload evenly between the workgroups. Estimation of the saturated IPS. The features of a kernel comprise static features collected by the compiler and dynamic features obtained during sampling runs on the GPU. We use the data from the same two runs as the linear model (at the second and the third saturation point). Table 1 lists the kernel features that were used to construct the IPS estimation model for the GTX 580 and the GTX 280. Training and deploying the learning model. An ML-model requires a large amount of training data during the training phase. To obtain a sufficient amount of data from various kernels, each kernel is run several times, every time with a different number of work-groups. Since the performance factors of a kernel often fluctuate when executing the first few work-groups, several runs at a varying number of work-groups help cover all behavioral differences of the kernel. Each run produces a set of performance counters. We combine these values with the static features obtained from the compiler and the result produced by the linear model using the second and the third saturation point to form a feature vector. The feature vector is associated with the saturated IPS of the kernel to form the training data. The saturated IPS 9 Feature active work-groups registers per work-item execution time branches work-items divergent branches L1/L2 miss rate total L1 accesses issued instructions executed instructions bank conflicts shared loads/stores global loads/coalesced warp serializes IPS Description number of work-groups that can be active at the same time number of registers used by a work-item execution time of the run number of branches number of sampled work-items number of divergent branches ratio of L1/L2 miss count to the issued instructions count L1 miss count plus L1 hit count number of issued instructions number of executed instructions bank conflict count shared load/store count number of global total/coalesced load requests number of warp serializes IPS obtained by the linear model Source compiler compiler performance counter performance counter performance counter performance counter performance counter performance counter performance counter performance counter performance counter performance counter performance counter performance counter computed GPU Model GTX 280/580 GTX 280/580 GTX 280/580 GTX 280/580 GTX 280/580 GTX 280 GTX 580 GTX 580 GTX 280 GTX 280/580 GTX 580 GTX 580 GTX 280 GTX 280 GTX 280/580 TABLE 1: List of features of the ML model for the GTX 280 and the GTX 580 OpenCL kernel OpenCL compiler kernel statistics sample on GPU at 2*Psat kernel binary active work-groups, Psat . 2) Extract the data buffer needed by the first workgroups at both sampling points and copy it to the device memory. 3) Sample at Psat1 and Psat2 work-groups and record the performance counters. 4) Evaluate the linear model to obtain the estimated execution time and compute IP Sest . 5) Combine the data from the performance counters with the estimated execution time from the linear model to form a feature vector for the ML-based model. Then use the output of the model, the estimated saturated IPS, to compute the estimated total execution time of the kernel for the entire workload. runtime t2*Psat sample on GPU at 3*Psat runtime t3*Psat Psat Linear Model t3*Psat time staircase workgroups Psat no staircase t2*Psat feature vector kernel stats runtime stats tlinear tPsat 2*Psat 3*Psat Machine Learning Model tlinear trained model tML-model Fig. 10: Actions and information flow for both models. is obtained by running the kernel with the largest possible input data and recording the IPS. Estimating the total execution time. Calculating the total execution time for a kernel is straightforward. The features of a kernel include: the static features of the kernel; the performance data obtained from the two sampling runs; and the result given by the linear model. The features are combined to form a feature vector that is fed into ML model. The model outputs the estimated IPS for one work-group. Since we assume that the workload is evenly distributed between the work-groups, we simply multiply the estimated IPS by the number of work-groups as follows tM L (N ) = IP sat2 × N Psat2 × IP Sest Here, tM L (N ) denotes the total execution time of the kernel for N work-groups. IP sat2 is the number of instructions issued on one SM at Psat2 work-groups that is recorded when evaluating the linear model. Using sampling information at Psat2 leads to more accurate results than at Psat1 because of lessened warmingup effects and therefore more precise performance counter data. IP Sest is the estimated IPS computed by the ML model. To summarize, as illustrated in figure 10, we describe the processing routine to test an OpenCL kernel (that is not included in the training set): 1) Based on occupancy and work-group size, analytically calculate the number of concurrently 6 E VALUATION 6.1 Evaluation Setup The proposed model has been developed for NVIDIA GPUs with caches. However, its simplicity and generality allow it to be applicable to any GPU that shares similar design concepts. To demonstrate that it can be easily adapted to other GPU architectures, we have evaluated the model on different architectures from the two most significant GPU vendors: NVIDIA and AMD. For NVIDIA, we evaluate the model on three NVIDIA GPUs from three architectural generations: the Kepler-based GeForce GTX 680, the Fermibased GeForce GTX 580, and the GeForce GTX 280 as a representative for NVIDIA’s second-generation GPU architecture without hardware caches. Currently, there are no CUDA/OpenCL drivers for the GTX 680 that support reading the performance counters from within OpenCL programs. For this reason we evaluate only the linear model on the GTX 680. For AMD, we evaluate the model on a recent GPU; the Radeon HD 6970, which comprises hardware caches. To test the models, we use 70 kernels extracted from 39 OpenCL applications. The OpenCL applications stem from the Parboil [10], SNU NPB [23], SHOC [7], AMD [1] and NVIDIA [18] benchmark suites. The selected kernels all satisfy two conditions: (1) the kernel is executed with a large number of work-groups (at least several hundreds), and (2) the kernel’s execution time is longer than one tenth of a second to minimize 10 Source AMD NVIDIA Parboil Shock SNU NPB Application (Kernels) Binomial (binomial) Blackscholes (blackscholes), BoxFilter (BoxRowsLmem), ConvolutionSeparable (convolutionRows, convolutionColumns), CopyComputeOverlap (VectorHypot), DXTCompress (compress), DotProduct (DotProduct), FDTD3d (FiniteDifferences), HiddenMarkov (ViterbiOneStep), Histogram (histrogram64), Matmul (matmul), MatVecMul (uncoal0, uncoal1, coal0, coal1, coal2, coal3), MedialFilter (ckMedian), Nbody (nbody), QuasiRandomGenerator (Quasi, InvCND), RadixSort (reorderDatakey, radixSortBlockKey), SobelFilter (ckSobel), SortingNetworks (sortLocal, sortLocal1, mergeGlobal, mergeLocal), Transpose (trans naive), Tridiagonal (pcr, cycle) CP (cp), Cutcp (lattice), LBM (StreamCollideX), Mri-q (ComputeQ GPU), Mri-gridding (binning, reorder, sort, rearrange, gridding), Sad (mb calc, calc 8, calc 16), RPES (computeX), Tpacf (tpacf) FFT (fft1D 512, fft1D 512), BFS (bfs), Sgemm (sgemmNN, sgemmNT), Spmv (spmv csr vector), Stencil2D (stencilKernel) BT (initialize2), CG (conj grad 2, conj grad 6), EP (ep), FT (cffts1, cffts2, cffts3, indexmap, evolve), LU (setiv), MG (resid, norm2u3, rprj3, psinv, interp) Input 32768 samples default large -s 4 class=B/C TABLE 2: OpenCL kernels the effect of small measurement fluctuations. Table 2 lists the applications, the kernels, the source of the application and the input data. All experiments are performed with the NVIDIA OpenCL driver 1.1 and AMD APP SDK v2.9. By executing each kernel at several saturation points, we extract a total of 683 data instances for NVIDIA GPUs and 578 data instances for AMD GPUs. We use the leave-one-out cross validation (see Section 3.3) technique to evaluate the ML model: for each kernel to be evaluated, all data sets generated by that kernel are removed from the data set before training the ML model. To estimate the execution time of a kernel, we sample at the second and the third saturation point. The kernel features and the performance counters are used to compute the execution time for the linear model. The result of linear model is combined with the recorded performance counters to estimate the final execution time using the ML-based model. 6.2 Performance estimation results 6.2.1 NVIDIA GPUs Figures 11 and 12 show the error rates of the the linear and the ML-based model in terms of estimated execution time compared to the actual execution time for NVIDIA GPUs. The error rates of the linear and ML-based model are, on average, 5.72% and 4.76% for the GTX 580 and 7.19% and 6.42% for the GTX 280, respectively. As expected, the linear model performs extremely well for the regular kernels, i.e., when all performance factors scale linearly. GPUs perform best when executing regular application, so it is not surprising that the performance counter data of 80% of all kernels we have encountered scale linearly. For certain kernels, the linear model over- or underestimates the execution time significantly. This comes from several sources. First, even though we have excluded very short running kernels, a few kernels still exhibit a rather short execution time and cause a high fluctuation in the measurements. The kernels belonging to this group include resid, rprj3, psinv, stencilKernel, tran_naive. These kernels show an average error of approximately 10%. Second, for some kernels we have observed a significant warm-up effect at the beginning of the execution. These effects are caused by the hardware caches, or special instruction scheduling policies that cause the GPU to not evenly distribute the work-groups across the SMs. The learning model observes these non-linear effects through the performance counters and learns to correct the result given by the linear model. On the GTX 580, the significant cases belonging to this type include ifft1D_512, BoxRowsLmem, Quasi, MergeLocal, integrate, StreamCollide, resid, norm2u3, and binning. The ML-based model achieves the biggest improvement for binning. This kernel has a large number of warps that are serialized (due to branch divergence) and warp re-issues (due to cache misses) during its execution. These serializations are unevenly distributed across the saturation points. The learning model observes this warm-up effect and is able to produce a much more accurate result. The kernel Quasi has high error rate with the linear model because during sampling, the GPU does not seem to evenly distribute the work-groups across the SMs. The GPU performance counters only report data for one SM, hence if the SM being sampled is assigned more work-groups than some other SMs the performance counter numbers are distorted. The ML-based model observes this behavior from the performance counter sm cta launch and achieves an improved result. The kernels ifft1D_512 and BoxRowsLmem have high error rates for the linear model because the performance factors, such as the number of L1 and L2 cache misses or the number of coalesced memory requests, do not scale linearly. On the GTX 280, the kernels falling into this category include binning, ComputeX, cffts3, ViterbiOneStep and reorderDataKeys. The third source for large errors are unevenly distributed workloads across the work-groups. This happens when a kernel contains a conditional branch or a loop whose iteration count depends on the value of some input data, such as the work-group ID or a work-item ID. The kernels that exhibit this behavior are binning, gridding and gen_hists, StreamCollide, uncoal1, integrate on the GTX 280 and gridding, lattice, rprj3, trans_naive, 11 45/43 35.00 Linear 30.00 ML Error rate (%) 25.00 20.00 15.00 10.00 5.00 0.00 Fig. 11: Error rates of the linear and ML-based model on the GTX 580 for the execution time of each kernel. 93/93 Linear 35.00 ML Error rate (%) 30.00 25.00 20.00 15.00 10.00 5.00 0.00 Fig. 12: Error rates of the linear and ML-based model on the GTX 280 for the execution time of each kernel. 95 85 35.00 Error rate (%) 30.00 25.00 20.00 15.00 10.00 5.00 0.00 Fig. 13: Error rates of the linear model on the GTX 680 for the execution time of each kernel. integrate on the GTX 580. On most of those kernels, the ML-based model performs slightly better than the linear model. This is due to the fact that the learning model attributes this imbalance to warm-up effects and thus computes a more precise throughput. Comparing the performance of the linear model on GTX 580 and GTX 280, we observe a substantial difference in accuracy between a number of kernels. This includes the kernels gridding, StreamCollide, calc_16, gen_hists, psinv, cffts3, setiv, ifft1D_512, BoxRowsLmem, ViterbiOneStep, uncoal1, StencilKernel and Quasi. There are several reasons for these differences: First, certain kernels have an imbalanced number of replays (across the work-groups) on the GTX 580 but are balanced in terms of replays on GTX 280. A replay is the re-execution of an instruction after a cache-miss [20]. The kernels falling into this category are psinv, BoxRowsLmem and ifft1D_512. For these kernels the linear model produces more accurate results on the GTX 280 than the GTX 580. Similarly, the linear model works better on the GTX 580 for kernels that have imbalanced replays on the GTX 280 but are balanced on the GTX 580. Included in this group are StreamCollide and uncoal1. Second, on the GTX 280 the requirements for memory coalescing are stricter than the GTX 580. For those kernels, calc_16, cffts3 and ViterbiOneStep, the warm-up effect is stronger on the GTX 280 and therefore the linear model performs better for the GTX 580 than the GTX 280. Third, due to different architectural parameters, the saturation points are not necessarily equal for the same kernel. gen_hists has a large workload imbalance in the last quarter of its work-groups. On the GTX 580, the second sampling point includes work-groups from the last quarter whereas on the GTX 280 the second sampling point includes less work-groups which do not exhibit workload imbalance. For this reason, the linear model has more information on the GTX 580 and is thus more accurate. Similarly, gridding is highly imbalanced in terms of the number of instructions 12 Fig. 14: Error rates of the linear and ML-based model on the Radeon HD 6970 for the execution time of each kernel. per work-group. The occupancy of gridding for the GTX 280 and the GTX 580 are different; therefore, the degree of imbalance in the number of instructions at the sampling points is different from the GTX 280 to the GTX 580. Finally, for kernels with a relatively small sampling execution time, the measurement variation is also a source of error, especially for kernels that have large number of work-groups such as psinv or stencialKernel. We note that the irregularity in warp serialization on the GTX 280 affects the result of the performance estimation less than the irregularity in memory access patterns on the GTX 580. This explains why the overall error rate of the linear model on the GTX 280 is lower than that on the GTX 580. We have implemented the model presented by Hong and Kim [11], an analytical model to estimate performance on a GTX 280. Hong’s model has an average error rate of 30%, almost five times as much as the ML-based model in Figure 12. Note that we use the arithmetic average to compare error rates, while Hong et al. report the geometric average in their paper. To demonstrate the portability of the proposed model, we evaluate the linear model on one of the most recent NVIDIA GPUs, a GTX 680 based on the Kepler architecture, to see how it adapts to a newer GPU. It is currently impossible to read the performance counters from OpenCL, thus we cannot apply the ML-based model to this architecture yet. The linear model obtains a slightly worse overall error rate of 10%. Figure 13 details the result on the GTX 680 for each kernel. There are two kernels that suffer from especially high error rates: gridding and setiv. The kernel gridding has an extremely uneven workload distribution across the work-groups as described above. There is no serious workload imbalance in setiv. However on the GTX 680, the number of compute units and memory units has been increased, therefore having only 32 work-items per work-group and 8 active work-groups per SM does not use all the GPU resources effectively. To confirm this conjecture we double the number of work-groups per sampling run, i.e., we sample at 2∗Psat and 4∗Psat . This dramatically reduces the error rate for setiv. Note that ML models perform better the more training data is available. We expect the ML-based model to perform better as we add more benchmarks. 6.2.2 AMD GPUs Figures 14 shows the error rates of our model for AMD HD 6970 GPU. The error rates of the linear and ML-based model are, on average, 8.10% and 7.91% respectively. It is interesting that the linear model, although developed based on an analysis of a specific NVIDIA GPU, predicts quite well for most kernels running on AMD HD 6970 GPU. This is because both NVIDIA GPUs and AMD GPUs use the concept of occupancy to indicate how many active groups of work-items can be executed concurrently. This group is called warp in NVIDIA GPUs and wave-front in AMD GPUs. The difference between warp and wavefront is their size, i.e., the number of work-items executed in a lock-step manner. This size is simply an input to the linear model. Because the sampling points embody the execution behavior of warps and wavefronts on the hardware, the linear model can predict very well on both NVIDIA GPUs and AMD GPUs. The linear model does not predict well on some kernels that include trans_naive, lattice, binning, gridding, setiv, gen_hist, evolve, compress and ViterbiOneStep with an error of more than 20%. There are several reasons for this error. First, as explained in the result section for NVIDIA GPUs, there is a workload imbalance between workgroups for binning, gridding, gen_hist and trans_naive. Second, although there is no significant imbalance in the workload, some performance factors do not scale linearly (e.g., the cycles stalled due to a memory access) across different saturation points as in trans_naive, latice, setiv, compress, evolve and ViterbiOneStep. On the Radeon HD 6970, the ML-based model predicts almost equally as well as the linear model. This is because there is a limited number of performance counters we can access using OpenCL. Especially, there is no performance counters related to hardware caches on AMD GPUs that can be accessed through the OpenCL interface. The performance counters used as inputs to the machine learning model on AMD are insufficient to capture the correlation between the execution time and input performance factors, hence its performance is similar that of the linear model. 13 Error class 0-5% 5-10% >10% LM 1.88 6.65 21.24 GTX 280 ML delta 2.36 -0.48 3.13 3.50 18.67 2.57 LM 2.03 7.59 18.46 GTX 580 ML delta 2.13 -0.10 6.57 1.01 13.41 5.06 Error rates (%) TABLE 3: Performance for different error classes. Linear 10.00 8.00 6.00 4.00 2.00 0.00 1+2 2+3 ML 3+4 4+5 Saturation points 5+6 Error rates (%) (a) NVIDIA GTX 580 Linear 10.00 8.00 6.00 4.00 2.00 0.00 ML does not perform significantly better; the reason is that AMD GPUs provide much fewer performance counters compared to NVIDIA GPUs. For example, there is not sufficient information about the cache usage for the L1 and the L2 cache. Without these performance counters, the ML model is unable to improve the accuracy of the estimation significantly. For both architectures, sampling at the second/third sampling point provides reasonable accuracy at the lowest possible overhead. The overhead of sampling compared to the total runtime of the kernels amounts to, on average, 8% for NVIDIA and 15% for AMD GPUs. It is important to note that the sampling overhead is independent of the size of the data input size and will thus be significantly smaller for realworld kernels with very large input data sets. 7 1+2 2+3 3+4 4+5 Saturation points 5+6 (b) AMD Radeon HD 6790 Fig. 15: Average error rates of the linear and ML-based model at different saturation points. 6.3 The ML-based model on different classes of kernels Table 3 compares the linear model to the ML-based model in relation to the error class of the kernel for two NVIDIA GPUs. The error class 0-5% comprises all kernels for which the error of the linear model is 0 to 5%, and the error classes 5-10% and >10% are constructed accordingly. The results confirm that the ML-based model performs better on kernels for which the linear based model has a relatively big estimation error. For regular kernels the linear model slightly outperforms the ML-model. While the MLbased model is capable of adapting to irregularities in the kernels, this flexibility comes at the expense of a small degradation in accuracy for such kernels. 6.4 The performance at different saturation points Figure 15 shows the average error rate of the linear and ML-based model at different sampling points for (a) the NVIDIA GTX 580 and (b) the AMD Radeon HD 6970, respectively. The x-axis denotes the first sampling point at x-multiples of the saturation point (Psat ); the second sampling point is at (x+1)∗Psat . For the GTX 580, warm-up effects in the caches lead to a reduced accuracy when sampled at the first saturation point. From the second saturation point on, the warmup effects can be mostly covered and the accuracy of the model remains similar even with a higher amount of sampling. For the Radeon HD 6970, the linear model achieves error rates of 8-11% for the first five saturation points. This implies that the model benefits from the architectural similarities between GPUs from different vendors. The ML-based model C ONCLUSIONS We have presented a linear and an ML-based performance estimation model for GPUs with or without hardware caches. The wide applicability of the models is demonstrated by running the benchmarks on 70 OpenCL kernels on three different generations of NVIDIA GPU architectures and one AMD GPU. The linear model outperforms existing models and achieves error rates below 10%. On modern GPUs, the complex effects of cached memory accesses are difficult to capture by a linear model. On NVIDIA GPUs, with performance counters reflecting cache usage and branch divergences, the ML-based model detects the correlation of these factors and the execution time and reduces the average estimation error to about 5%. The ML-based model performs especially well on benchmarks that exhibit non-linear throughput and are thus difficult to estimate for the linear model. On the other hand, with AMD GPUs, the ML-based model cannot detect this correlation due to the lack of certain important performance counters. The proposed models significantly outperform related work on various GPU architectures. The models achieve a very good accuracy for a wide range of benchmarks on different GPU architectures and can thus be used as performance estimation models in a dynamic workload-distribution framework. ACKNOWLEDGEMENTS This work was supported by the National Research Foundation of Korea (NRF) grants funded by the Korean government (MSIP) (No. 2013R1A3A2003664) and the Ministry of Education, Science and Technology (No. 2012R1A1A1042938). ICT at Seoul National University provided research facilities for this study. R EFERENCES [1] [2] AMD. ATI Stream Software Development Kit (SDK) v2.1, 2010. S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W. W. Hwu. An adaptive performance modeling tool for GPU architectures. In Proc. of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010. 14 [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] Bradley J. Barnes, Barry Rountree, David K. Lowenthal, Jaxk Reeves, Bronis de Supinski, and Martin Schulz. A regressionbased approach to scalability prediction. In Proceedings of the 22nd annual International Conference on Supercomputing, 2008. R. Bitirgen, E. Ipek, and J.F. Martinez. Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning approach. In Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture, 2008. S. Collange, D. Defour, and D. Parello. Barra, a parallel functional GPGPU simulator. In Technical Report hal-00359342, Université de Perpignan, 2009. C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995. Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. The Scalable Heterogeneous Computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, 2010. M. Fatica. Accelerating linpack with CUDA on heterogenous clusters. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, 2009. Khronos OpenCLWorking Group. The OpenCL Specification Version 1.0. Khronos Group, 2009. The IMPACT Research Group. Parboil Benchmark Suite, 2009. S. Hong and H. Kim. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proceedings of the 36th International Symposium on Computer Architecture, June 2009. Sunpyo Hong and Hyesoon Kim. An integrated GPU power and performance model. In Proceedings of the 37th annual International Symposium on Computer architecture, 2010. Engin Ipek, Bronis R. De Supinski, Martin Schulz, and Sally A. Mckee. An approach to performance prediction for parallel applications. In Euro-Par, Springer LNCS, page 2005, 2005. W. Jia, K.A. Shaw, and M. Martonosi. Stargazer: Automated regression-based GPU design space exploration. In 2012 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2012. W. Liu, W. Muller-Wittig, and B. Schmidt. Performance Predictions for General-Purpose Computation on GPUs. In International Conference on Parallel Processing, September 2007. NVIDIA. NVIDIA GeForce GTX 580. NVIDIA. NVIDIA GeForce GTX 680. NVIDIA. NVIDIA OpenCL SDK code samples. NVIDIA. NVIDIA Fermi Compute Architecture White Paper. NVIDIA NVIDIA CUDA Profiler. NVIDIA. NVIDIA Compute - PTX: Parallel Thread Execution, ISA Version 1.4, March 2011. NVIDIA. OpenCL Programming Guide for the CUDA Architecture, Version 4.0, February 2011. Sangmin Seo, Gangwon Jo, , and Jaejin Lee. Performance Characterization of the NAS Parallel Benchmarks in OpenCL. In Proceedings of the 2011 IEEE International Symposium on Workload Characterization, 2011. A.J. Smola and B. Schölkopf. A tutorial on support vector regression. Statistics and Computing, 14(3):199–222, 2004. Top500.org. Top500 Supercomputer Sites - November 2013, 2013. V. Vapnik. The Nature of Statistical Learning. Springer, 1999. H. Wong, M.M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying GPU microarchitecture through microbenchmarking. In 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), 2010. Canqun Yang, Feng Wang, Yunfei Du, Juan Chen, Jie Liu, Huizhan Yi, and Kai Lu. Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing. Cluster Computing, IEEE International Conference on, 0:19–28, 2010. Yao Zhang and J.D. Owens. A quantitative performance analysis model for GPU architectures. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture, February 2011. Kerr, Andrew and Diamos, Gregory and Yalamanchili, Sudhakar. Modeling GPU-CPU workloads and systems. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, 2010. Grewe, Dominik and OBoyle, Michael FP. A static task partitioning approach for heterogeneous systems using opencl. Compiler Construction, 2011. Luk, Chi-Keung and Hong, Sunpyo and Kim, Hyesoon. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009. Thanh Tuan Dao received the BS degree in computer science from VNU University of Engineering and Technology in 2008, and the Master degree in computer science and engineering from Seoul National University in 2011. He is currently a PhD student at Center for Manycore Programming, Seoul National University. His research interests include high-performance computing, compilers and runtime systems for multicores/manycores. Jungwon Kim received the BS degree in computer science and engineering and the PhD degree in electrical engineering and computer science from Seoul National University in 2006 and 2013, respectively. He is currently a postdoctoral research associate at Oak Ridge National Laboratory. His research interests include compilers, runtime systems for multicores/manycores, and scalable heterogeneous HPC computing. Sangmin Seo received the BS degree in computer science and engineering and the PhD degree in electrical engineering and computer science from Seoul National University in 2007 and 2013, respectively. After obtaining his PhD degree, he spent half a year as a CEO at ManyCoreSoft Co., Ltd. He is currently a postdoctoral appointee in the Mathematics and Computer Science division at Argonne National Laboratory. His research interests include high-performance computing, parallel programming models, compilers, and parallel runtime systems. He is a member of the IEEE and the IEEE Computer Society. Bernhard Egger received the diploma in computer science from the Swiss Federal Institute of Technology in Zürich, Switzerland and the PhD degree in computer science and engineering from Seoul National University, Korea. He then spent three years at Samsung Electronic’s research institute before joining Seoul National University in 2011 where is currently is an assistant professor at the School of Computer Science and Engineering. His research interests include programming language design, compilers, and operating systems for heterogeneous manycore systems. He is a member of the IEEE and ACM. More information can be found at http://csap.snu.ac.kr/∼bernhard. Jaejin Lee received the BS degree in physics from Seoul National University in 1991, the MS degree in computer science from Stanford University in 1995, and the PhD degree in computer science from the University of Illinois at Urbana-Champaign in 1999. He is a professor in the Department of Computer Science and Engineering at Seoul National University, Korea, where he has been a faculty member since September 2002. Before joining Seoul National University, he was an assistant professor in the Computer Science and Engineering Department at Michigan State University. He was a recipent of an IBM Cooperative Fellowship and a fellowship from the Korea Foundation for Advanced Studies during his PhD study. His research interests include compilers, computer architectures, embedded computer systems, and systems in highperformance computing. He is a member of the IEEE and ACM. More information can be found at http://aces.snu.ac.kr/∼jlee.