Academia.eduAcademia.edu

Adaptive performance prediction for integrated GPUs

2016, Proceedings of the 35th International Conference on Computer-Aided Design

Integrated GPUs have become an indispensable component of mobile processors due to the increasing popularity of graphics applications. The GPU frequency is a key factor both in application throughput and mobile processor power consumption under graphics workloads. Therefore, dynamic power management algorithms have to assess the performance sensitivity to the GPU frequency accurately. Since the impact of the GPU frequency on performance varies rapidly over time, there is a need for online performance models that can adapt to varying workloads. This paper presents a lightweight adaptive runtime performance model that predicts the frame processing time. We use this model to estimate the frame time sensitivity to the GPU frequency. Our experiments on a mobile platform running common GPU benchmarks show that the mean absolute percentage error in frame time and frame time sensitivity prediction are 3.8% and 3.9%, respectively.

Adaptive Performance Prediction for Integrated GPUs Raid Ayoub, Michael Kishinevsky, Francesco Paterna Arizona State University Intel Corporation 1. INTRODUCTION Graphically-intensive mobile applications, such as games, are now one of the most popular smartphone application categories. There are more than a quarter million games, which led to several million downloads on Android devices alone [1]. When running many of these applications, the GPU power consumption accounts for more than 35% of application processor power. It is not always viable to decrease the GPU frequency to reduce power consumption, since graphics performance is highly sensitive to the frequency. Therefore, there is a need for accurate performance models that can be used to control the GPU frequency judiciously. The primary GPU performance metric is the number of frames that can be processed per second, since this number determines the display frame rate. Therefore, we use the time the GPU takes to process a frame as the performance metric. The frame time varies significantly for different time periods of an application, as shown in Figure 1. Furthermore, it is highly correlated with the GPU frequency and dependent on the target application. Hence, the frame time is a multivariate function of the frequency and workload, where the latter is captured by the performance counters. An accurate GPU performance model can enable us to predict the change in performance as a function of change in frequency. The Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ICCAD ’16, November 07-10, 2016, Austin, TX, USA c 2016 ACM. ISBN 978-1-4503-4466-1/16/11. . . $15.00 DOI: http://dx.doi.org/10.1145/2966986.2966997 (a) (b) Frame time (ms) ABSTRACT Integrated GPUs have become an indispensable component of mobile processors due to the increasing popularity of graphics applications. The GPU frequency is a key factor both in application throughput and mobile processor power consumption under graphics workloads. Therefore, dynamic power management algorithms have to assess the performance sensitivity to the GPU frequency accurately. Since the impact of the GPU frequency on performance varies rapidly over time, there is a need for online performance models that can adapt to varying workloads. This paper presents a light-weight adaptive runtime performance model that predicts the frame processing time. We use this model to estimate the frame time sensitivity to the GPU frequency. Our experiments on a mobile platform running common GPU benchmarks show that the mean absolute percentage error in frame time and frame time sensitivity prediction are 3.8% and 3.9%, respectively. Frame time (ms) Ujjwal Gupta, Joseph Campbell, Umit Y. Ogras Suat Gumussoy IEEE Member 60 40 20 30 0 5 10 0 5 10 15 20 25 30 20 25 30 20 10 15 Time (s) Figure 1: The change in frame time for ice-storm application for (a) 200 MHz and (b) 489 MHz GPU frequencies. performance model needs to capture the impact of dynamic workload variations, and have a low computational overhead. Combined with a power model, this sensitivity model can be integrated into dynamic power management algorithms to select the best GPU frequency. This paper presents a systematic methodology for constructing a tractable runtime model for GPU frame time prediction. The proposed methodology consists of two major steps. The first step is an extensive analysis to collect frame time and GPU performance counter data. This analysis enables us to construct a frame time model template and select the most significant feature set. Our model employs differential calculus to express the change in frame time as a function of the partial derivatives of the frame time with respect to the GPU frequency and performance counters. The second step is an adaptive algorithm that learns the coefficients of the proposed model online. More specifically, we employ a light-weight recursive least squares (RLS) algorithm to predict the change in frame time dynamically. RLS is a good choice since the correlation between different frames decays quickly unlike the fractal behavior observed at the macroblock level [19]. Besides frame time prediction, we use our model to predict the frequency sensitivity, which is defined as the derivative of the frame time with respect to the GPU frequency. This information can be utilized by dynamic power management algorithms, which often have to make a decision to change the frequency [2, 17]. To validate our approach, we performed experiments on a state-of-the-art mobile platform using both custom applications and commonly used graphics benchmarks [8]. The experiments show that the mean absolute percentage error in frame time and frame time sensitivity prediction are 3.8% and 3.9%, respectively. The major contributions of this work are: • A methodology for collecting offline data and developing an adaptive GPU performance model, • A concrete RLS based adaptive runtime performance model, • Extensive evaluations on a commercial platform using common GPU benchmarks. Probability Density Function (c) Mean = 5.80 ms Variance = 0.31 ms 1 Kernel instr. Power instr. Mean = 5.61 ms Variance = 0.44 ms 0.5 0 4 4.5 5 5.5 6 6.5 Frametime (ms) 7 7.5 Figure 2: (a) Total power consumption when the GPU is rendering Art3 application at 60 FPS. (b) Zoomed portion, which shows three frames in the first 50ms. (c) Frame time distribution for kernel and power instrumentations for Art3 application. The rest of the paper is organized as follows. Section 2 presents the related work. Section 3 details the challenges and lays out the groundwork required for frame time prediction. Section 4 presents the techniques for offline analysis and online learning. Finally, Section 5 discusses the experimental results, and Section 6 concludes the paper. 2. RELATED RESEARCH Dynamic power management techniques require accurate performance models to decide when and how much the frequency can be slowed down to save power. Therefore, this work focuses on light-weight performance models that can guide DPM algorithms in conjunction with runtime power models [12, 16]. The authors in [3] proposed a framework to optimize the CPU-GPU efficiency by classifying the application phases as rendering and loading. The GPU frequency governor boosts the GPU frequency during the rendering phase to improve performance, while reducing the frequency during the loading phase to reduce power consumption. A more direct approach to govern GPU performance based on the CPU, GPU frequencies and utilizations is presented in [17]. However, this model relies on utilization, instead of using the performance counters which provide a fine-grain measure of the workload. Kadjo et al. [14] use the performance counters, but learn the models using offline data only. Another work on performance modeling [4] uses an auto-regressive (AR) model for frame time prediction. The authors employ a tenth order AR model, whose weights were learned offline using ten minutes of frame time data for each application. Similarly, the authors in [5] utilize the Least Mean Squares estimation technique to predict graphics workloads with a model whose features are based on prior frame times. On the one hand, relying solely on offline data does not generalize well to other data sets, as it is not feasible to account for all possible workloads. On the other hand, online learning is challenging due to limited observability and computing resources. We address these concerns by providing an efficient technique for GPU performance prediction. 3. FRAME TIME CHARACTERIZATION 3.1 Challenges and Notation The first step towards constructing a high fidelity frame time model is to understand the dependence of the frame time on the GPU frequency and workload. As mentioned before, the workload characteristics are captured by the performance counters x = [x1 , x2 , . . . , xN ], where N is the total number of counters. All of these counters are functions of the frame complexity C, while some of them also depend on the GPU frequency f . Since the frequency changes in discrete time steps in practical systems, we characterize the frame time tF in any given time step k using a multivariate function tF,k (fk , xk (fk )). Besides showing the dependency of the frame time on the frequency and counters, this notation also reveals that the counters themselves can vary with frequency. There are two major challenges in the characterization of tF,k (fk , xk (fk )). The first challenge is to establish a trusted reference that provides a rich set of samples of this function. This set needs to provide the frame time for an exhaustive list of frequencies and counter values. The second and bigger challenge is to understand the sensitivity of frame time to frequency, i.e., finding the partial derivative of the frame time with respect to the frequency. This information is vital for dynamic power management algorithms to find out how the performance would be affected by a change in the GPU frequency. However, finding the frequency sensitivity is very challenging, since it requires decoupling the impact of the change in frame time due to the frequency and frame complexity. In the rest of this section, we describe our solutions to address these challenges. 3.2 Frame Time and Counter Data Collection Frame Time Measurement: Establishing the ground truth frame time is crucial for both developing the models and validating them later on benchmarks. Therefore, we modified Android’s Direct Rendering Manager [6] driver to mark the times when the GPU starts and completes a new frame. This enables us to retrieve the frame time and frame count from the kernel at runtime. Validation: To validate the correctness of our non-trivial modification, we also measured the platform power consumption using a data acquisition system. Figure 2a shows the total power consumed as a function of time when running a custom target application (Art3) at 60 frames per second (FPS). By maintaining a low CPU activity, we know that the peaks in the power consumption occur due to the GPU activity. For instance, the zoomed version of 50ms time in Figure 2b shows three frames as expected for 60 FPS and about 6ms frame time. Hence, we can test the accuracy of frame time and frame count instrumentations by correlating them with power measurements. Figure 2c shows the frame time probability density functions obtained by kernel instrumentation and power measurements. We observe that our kernel instrumentation and power measurements yield only 3% difference in mean frame time. We also find that the kernel instrumentation is more practical and accurate than the power measurements, since it does not depend on external equipment and suffer from measurement noise. Data Collection: We used the Intel GPU tools [9] to log the counter values at runtime [11]. Our modified version of the kernel collects a trace in the format shown below: Sweep CPU frequency (!! ) Sweep GPU frequency (!" ) Frame Frame GPU Perf. Perf. Perf. Time ... Time Count Frequency Cntr 1 Cntr 2 Cntr N Sweep Complexity (!# ) Each row corresponds to a 50ms interval, which matches the rate at which the frequency governors change the GPU frequency. We also tested that this data collection does not induce any noticeable impact on the application performance. Decoupling the Impact of Frequency and Workload Multiple frames Single frame (iter 1) Single frame (iter 2) 20 15 Variance = 0.017 ms 10 Variance = 0.022 ms 5 Variance = 0.314 ms 15 5 5.5 6 Frametime (ms) 6.5 7 Figure 3: The frame time distribution obtained for rendering the same frame and rendering multiple similar frames. 3.4 in a given frame. We note that different frame complexities enable us to exercise the performance counters in a controlled manner. Finally, we run each configuration multiple times to suppress the random variations. In our experiments, we collected 80 samples for each configuration, which led to 2×9×64×80 = 92160 lines with 1152 different configurations. The proposed methodology is applied to both of our Art3 and RenderingTest applications. Our data set confirms that the frame time is a function of both the GPU frequency and the workload. For example, Figure 5 shows how the frame time changes with the GPU frequency at a CPU frequency of 1.3GHz. Different curves on this plot show that increasing frame complexity implies larger frame time, as expected. Similarly, Figure 6 shows the relation between the Rendering Engine Busy counter and the frame time. As the name implies, Rendering Engine Busy counts the number of cycles for which the rendering engine was active [11]. We observe that a larger cycle count (i.e., higher complexity) results in an almost linear increase in frame time. Different curves on this plot also show that this counter itself is a function of the frequency, since it is counting the busy clock cycles. In summary, our data set enables characterizing the multivariate function tF,k (fk , xk (fk )). We use this data at design time to construct a template for the frame time model. Then, our online learning algorithm updates the coefficients in this model to predict the frame time for arbitrary applications. Data Collection Methodology As mentioned in the previous section, a consistent appleto-apple comparison is possible only if the same frame is frozen and rendered repeatedly. To facilitate reference data collection, we built two custom Android applications, Art3 and RenderingTest, as detailed in Section 5.1. These applications enable us to precisely control the frame content and target frame rate. The proposed data collection methodology is shown in Figure 4. We first set the CPU frequency for the repeatability of the results. Then, we sweep the GPU frequency across the set of frequencies supported by the target system. In our target platform, we used 9 frequencies ranging from 200MHz to 511MHz, as shown in Figure 4. Each of these combinations was further repeated for 64 frame complexities, which is determined by the number and variety of features Frame Time (ms) Probability Density Function One way to isolate the changes due to the GPU frequency is running the entire application repeatedly at each supported GPU frequency. Theoretically, the collected data could be used to identify the effect of GPU frequency on frame time. However, this approach is intractable for a number of reasons. First, there may not be a one-to-one correspondence between the frames in different runs. For example, consider an application that runs at 60 or 30 FPS depending on the GPU frequency. At the lower frame rate, the application will drop the 30 frames that it failed to render, rather than rendering them later. Second, even processing the same frame may take different amounts of time due to the variations in the memory access time from one run to another, as shown in Figure 3. We observe that frame time variations grow significantly even if the frame complexity changes marginally. These challenges are aggravated in many GPU intensive applications. Therefore, the most reliable approach to collect reference data is by varying the GPU frequency while freezing the workload, as described next. Figure 4: The proposed methodology for collecting a rich set of training and test data. Each frame is repeated nr times for every configuration. Complexity: 64 Complexity: 32 Complexity: 16 Complexity: 2 10 5 200 250 300 350 400 GPU Frequency (MHz) 450 500 Figure 5: Frame time for the RenderingTest application with increasing GPU frequency at different frame complexities. Frame Time (ms) 3.3 Repeat the frame (!$ ) GPU Freq (MHz) 200 244 !! " !" " 289 311 !# " !$ 355 Data lines 400 444 489 511 15 10 5 244 MHz 0 0.5 311 MHz 400 MHz 1 Rendering Engine Busy Value 511 MHz 1.5 × 107 Figure 6: Frame time for the Rendering Test application with increasing complexity for four different GPU frequencies. 4. FRAME TIME PREDICTION This section presents the proposed frame time prediction methodology. We first derive a mathematical model to express the change in frame time. Then, we describe the offline learning process needed to select the features that will be used during online learning. Finally, we present the proposed adaptive frame time prediction algorithm. 4.1 Differential Frame Time Model The frame time at any given instant k can be obtained by summing up the measured frame time in the previous instant k − 1 and the change in the frame time. This change can be approximated as a function of the GPU frequency and performance counters using partial derivatives as follows: dtF (fk , xk (fk )) = ∂tF (fk , xk (fk )) dfk ∂fk N X ∂tF (fk , xk (fk )) + dxi,k (fk ) ∂xi (fk ) i=1 (1) N X ∂tF,k ∂tF,k ∆fk + ∆xi,k ∂fk ∂xi,k i=1 where ai ’s are the coefficients that change at runtime as a function of the workload. Therefore, they are learned online. By combining Equation 4 and Equation 5, we can re-write our mathematical model in Equation 2 as: ◆ X ✓ N fk−1 − 1 + ai ∆xi,k (fk ) ∆tF,k (fk , xk (fk )) ≈ a0 tF,k−1 fk i=1 This equation reveals that the variation in frame time is a combined effect of the change in the GPU frequency (the first term), and the changes in the counters, which reflect the workload (the summation term). Equation 1 holds, if the frequency and counters are continuous variables. Since they are discrete variables in practice, we can approximate the change in frame time as: ∆tF (fk , xk (fk )) ≈ Hardware performance counter change: The frame time changes linearly with many hardware performance counters, such as the one shown in Figure 6. If any counters cause a non-linear change in frame time, they can be taken as piecewise linear. Thus, we express the second term in Equation 2, i.e., the change in frame time with counters as: N N X X ∂tF,k ∆xi,k ≡ ai ∆xi,k (5) ∆tF (xk ) ≈ ∂xi,k i=1 i=1 (2) Note that ∂tF /∂fk is the partial derivative of frame time with respect to frequency. The frame time change due to ∂xi,k (fk )/∂fk is included in the difference term ∆xi,k . This equation forms the basis of our mathematical model. The differential form is useful, since the current frame time is known, and we are interested in the change. Moreover, it utilizes the difference of counters, which alleviates the need for feature normalization. Next, we analyze each term in detail to derive our frame time model. Change due to the GPU frequency: In general, the part of the processing time confined within the GPU pipeline is inversely proportional with the frequency. However, memory access and stall times do not scale with the frequency. Therefore, the frame time is a nonlinear function of the GPU frequency, as shown in Figure 5. Using this observation, we can approximate the frame time tF for a given workload (i.e. x) in terms of a frequency scalable portion tF,s and an unscalable portion tF,us [2]. More specifically, tF (fk−1 , x) = tF,s (fk−1 , x) + tF,us (x) (3) fk−1 + tF,us (x) tF (fk , x) = tF,s (fk−1 , x) fk Hence, the change in frame time when jumping from fk−1 to fk can be found by subtracting the first line in Equation 3 from the second line as follows: ✓ ✓ ◆ ◆ ∂tF,k fk−1 fk−1 ∆fk ≈ tF,s (fk−1 , x) − 1 ≡ a0 tF,k−1 − 1 (4) ∂fk fk fk where tF,k−1 is the frame ⌘ the previous instant k −1. ⇣ time from f We note that tF,k−1 k−1 can be easily calculated at − 1 fk run time. Since the scalable frame time is in general not known, we express it as an unknown parameter a0 that our online learning algorithm will learn at runtime. (6) We use Equation 6 for ⌘ online frame time prediction. The ⇣ f terms tF,k−1 k−1 and ∆xi,k (fk ) ∀ i form the feature − 1 fk set hk , while the parameters a ∈ RN +1 are learned online. 4.2 Feature Selection Real-time prediction requires an extremely efficient learning algorithm to facilitate fast evaluation of a GPU frequency change. One approach to reduce the overhead of regression is dimensionality reduction on the input data. The goal of this approach is to reduce the complexity of the data and speed up computation, while maintaining a good prediction accuracy. In addition to algorithm efficiency, this can help remove the features that either add duplicate information to the output or do not change with our parameters. There are several reduction techniques including Least Absolute Shrinkage and Selection Operator regression (Lasso), Sequential Feature Selection (SFS), and Principle Component Analysis (PCA), which reduce the feature size in the model appropriately by selecting the most representative set of features. Choosing a specific technique for feature selection can depend on the data and application area. Lasso regression: We used Lasso regression to minimize the mean squared error (MSE) with a bound on the l1 norm of parameters ai [7]. The results from Lasso regression are highly sparse due to l1 nature of the bound. Therefore, if less sparsity is required, it is also possible to use elastic nets or ridge regression by varying the distribution of l1 and l2 norm penalties on the learning parameters. For P samples the Lasso regression can be performed by minimizing the MSE between the actual change in frame time ∆tF,k and using the estimate from Equation 6 after adding a l1 norm penalty as: â = argmin a ✓ ◆ P  X fk−1 ∆tF,k − a0tF,k−1 −1 fk k=1 − N X i=1 2 ai ∆xi,k (fk ) +λ N X |aj | (7) j=0 By increasing the value of λ, less features can be selected at the expense of accuracy. An acceptable loss in accuracy is about one standard error more than the minimum MSE. Sequential Feature Selection (SFS): The SFS algorithm is a heuristic that adds features to an empty selection set in a stepwise manner to minimize the MSE in the prediction of a variable like GPU frame time. The result from SFS is close to the result of Lasso regression, but SFS is completely as O(M 2 ) [18]. Nonetheless, feature selection minimizes the size of the feature set to reduce the complexity. In particular, when the number of features shrinks from 38 to 4, the computational complexity reduces by about 90 times. Furthermore, matrix inversions are the main source of complexity in many algorithms, including RLS. Our solution is to use the co-variance form of RLS which does not perform matrix inversion. The value hTk Pk−1 hk in Equation 9 evaluates to a scalar, eliminating the overhead of the inversion operation. Actual Δ#",! !! Feature set $ Adaptive Δ#̂",! % !! &'!%& Filter Estimate Δ&'! Update algorithm Error (̃! % Δ#",! * Δ#̂",! Figure 7: Adaptive filtering approach showing the update in parameters ai based on error between the actual change in frame time and prediction. oblivious to the multi-collinearity in the feature set. Hence, it is not an ideal methodology when the features are correlated. However, it can be faster to use in the case where the feature set is very large and known to have mostly uncorrelated features [20]. Principle Component Analysis: The PCA algorithm can help remove the low variance dimensions by centering, rotating and scaling the data along the eigenvectors. The eigenvectors corresponding to larger eigenvalues are retained and the rest are pruned. The retained eigenvectors are then used for transforming the original data [13]. One drawback of PCA is that it gives features in the transformed domain. We implemented Lasso, SFS and PCA. In what follows, we present the results obtained with Lasso, because our goal is to achieve high sparsity on a correlated feature set while preserving the original meaning of the features. Thus, during the learning phase we will regress on M feature subset, where M << N + 1, instead of N + 1 features. Note that a trivial method to choose the number of features can lead to an increase in overhead or poor predictions. 4.3 4.4 Frame Time Sensitivity Previous section explained how we predict the change in frame time ∆tF,k (fk−1 → fk ) by continuously learning bk−1 and our feature set. DPM algorithms the parameters a often need to evaluate the impact of a frequency change on performance before making any decision. This information together with power sensitivity to frequency can help DPM algorithms to make better decisions. This section explains how our frame time prediction technique is used for this purpose. As an example, consider a scenario where the GPU frequency at time k is fk = 400 MHz. Suppose that a DPM algorithm needs to predict the change in frame time when the frequency goes from fk = 400 MHz to a candidate frequency fnew = 444 MHz. Before finalizing this decision, the DPM algorithm needs to evaluate the corresponding change in frame time, i.e., ∆tF,k (fk → fnew ) using Equation 6. In this equation, the frequency change affects the first term 400 − 1 and only the counters that are a function of the 444 frequency. To make the latter more explicit, we can write the change in counters due to the GPU frequency f and the frame complexity C as: ∂xi,k ∂xi,k ∆fk + ∆C, f or 1 ≤ i ≤ N (11) ∂f ∂C Since the frame sensitivity is calculated for a given frame, the change in complexity ∆C = 0, and Equation 6 can be written as: ✓ ◆ ◆ ✓ ∆xi,k ≈ Online Learning The parameters in Equation 6 can be learned offline and then used at runtime. However, it is hard to generalize offline learning to all possible applications that would be executed by the system. Moreover, the workload can change as a function of user activity. Therefore, the learning mechanism should not completely rely on offline learning. We employ an adaptive algorithm to learn the parameters of the frame time model. In particular, we use the Recursive Least Squares estimation technique [15]. RLS algorithm updates the parameters ai in Equation 6 in each prediction interval, as described in Figure 7, using the following set of equations: âk = âk−1 + Gk [∆tF,k (fk , xk (fk )) − hTk âk−1 ] (8) Gk = Pk−1 hk [hTk Pk−1 hk + 1]−1 (9) Pk = [I − Gk hTk ]Pk−1 (10) The update rule given in Equation 8 computes the prediction error by subtracting the frame time prediction from the actual change in frame time. Note that online learning would not be possible without our kernel instrumentation, which provides reliable reference measurement at runtime (∆tF,k (fk , xk (fk ))). Equation 9 and Equation 10 update the gain Gk and covariance Pk matrices using the feature vector. We refer the reader to [15] for details of the RLS algorithm. Computational complexity: RLS is well known for giving good predictions in the signal processing field, however, its computational complexity grows with number of features ∆tF (fk→fnew ) ≈ a0 tF,k−1 fk fnew N X ∂xi,k ai −1 + (fnew −fk ) ∂f i=1 (12) In Equation 12, fk , fnew , and ai are known at time step ∂x k. The only unknown value is ∂fi,k , which is zero for frequency independent counters. To model the derivative of the frequency dependent counters with respect to the GPU frequency, we can use a nonlinear function of frequency and frequency independent counters. Then, this model can also employ an online learning, such as RLS, or it can be learned offline. Subsequently, Equation 12 can be used to predict the change in frame time for the new candidate frequency as: dtF df k ≈ ∆tF (fk → fnew ) fnew − fk (13) 5. EXPERIMENTAL RESULTS This section first describes the experimental setup and the results of offline feature selection. Then, we demonstrate the accuracy of the proposed online frame time prediction technique, and its potential impact on DPM algorithms. 5.1 Experimental Setup We performed our experiments on the Minnowboard MAX platform [10] running Android 5.1 operating system with the Feature Selection using Lasso Regression We applied Lasso regression with 100–fold cross-validation on our large dataset collected from the RenderingTest application. Figure 8a shows the change in mean squared error between the predicted and measured frame time of the GPU. As λ in Equation 7 increases, the penalty on the cost function increases leading to higher MSE. The minimum value, λmin = 5.1 × 10−4 uses all the features, as shown in Figure 8b. To shrink the model, a good choice is λsel = 1.6 × 10−1 for which the performance in terms of expected generalization error is about one standard error of the minimum. In our experiments, the number of features for λsel turns out to be four. The four selected features are change in the frequency term from Equation 6 and change in the Aggregate Core Array Active, Slow Z Test Pixels Failing, and Rendering Engine Busy counters. The Aggregate Core Array Active counter gives the sum of all cycles on all the GPU cores spent actively executing instructions. The Slow Z Test Pixels Failing counter gives the pixels that fail the slow check in the GPU. Neither of these counters depends on the frequency; they are functions of only the frame complexity. However, the Ren- Number of features 10 1 λsel λmin 10 0 10 -2 (a) Lambda 10 0 5 0 (b) 10 -2 10 0 Lambda Figure 8: Cross-validated LASSO regression result for; (a) the change in mean squared error of the frame time prediction with increasing λ values, and (b) the change in the number of selected features with increasing λ values. dering Engine Busy counter changes with frame complexity, as well as frequency. 5.3 Online Frame Time Prediction We validated our frame time prediction approach first on the RenderingTest application to test the corner cases. Figure 9 shows the comparison between the actual and the predicted frame time. During the first 5 seconds, both the GPU frequency and frames change randomly. We observe that the proposed online model successfully keeps up with the rapid changes. In order to test our approach under corner cases, we enforced a saw-tooth pattern during the remaining duration of the application. More precisely, the GPU frequency starts at 200 MHz, and the complexity increases from 1 to 64 in increments of one (the first tooth). Then, the same iterations are repeated for 9 supported GPU frequencies. Figure 9 demonstrates that we achieve very good accuracy when the frequency stays constant for a period of time. There is a spike when the complexity jumps suddenly from 64 to 1. However, the RLS reacts quickly and maintains a high accuracy. Overall, the mean absolute percentage error between the real and predicted frame time values is 2.1%. We obtained similar levels of accuracy for Art3 and standard benchmarks. In particular, Figure 10 shows the actual and predicted frame times for 3DMark’s Ice Storm benchmark at two different GPU frequencies. We achieved a high prediction accuracy with the mean absolute error of 2.8% and 7.9% for the GPU frequencies 200 MHz and 489 MHz, respectively. Similarly, the actual and predicted frame time for the BrainItOut gaming application with fixed GPU frequency is shown in Figure 11. This interactive game requires frequent user inputs, the frame time exhibits more sudden changes compared to other applications. Our frame time prediction matches closely to the actual frame time with the median and mean absolute percentage errors of 2.2% and 9.1%, respectively. Note that, the higher mean absolute error value for the BrainItOut application is due to a few outliers in the frame time. This is confirmed from the very low median absolute percentage error value of the benchmark. The frame time prediction for all of the benchmarks runFrame time (ms) 5.2 10 MSE kernel modifications mentioned in Section 3.2. This platform has two CPU cores and one GPU, whose frequency can take the values listed in Figure 4. The GPU frequency is readily available from the kernel file system. In addition to this, we used the Intel GPU Tools as an external module to the Android system to trace the GPU performance counters. Standard Benchmarks: We validated the proposed frame time prediction technique using the following commonly used GPU benchmarks: Nenamark2, BrainItOut, and 3DMark (both the Ice Storm and Slingshot scenarios). Custom Benchmarks: The accuracy of the frame time prediction can be tested without any limitations, since our frame time prediction technique works for any Android app that can run on the target platform. However, validating the sensitivity prediction (i.e., the derivative of the frame time with respect to the frequency) requires reference measurements taken at different frequencies. This golden reference cannot be simply collected by running the whole application at different frequencies due to the reasons detailed in Section 3.2. Therefore, we also developed RenderingTest and Art3 applications that enable us to control the number of times each frame is repeated. The RenderingTest application accepts two inputs that specify the number of cubes rendered in the frame, and the number of times the same frame is processed. By changing the number of cubes, we control the frame complexity. In our experiments, we swept the number of cubes from 1 to 64, and repeated each frame 80 times. The cubes were rendered at a maximum of 60 FPS with vertex shaders and depth buffering enabled. Since we used the RenderingTest application for offline characterization, we developed one more custom application, called Art3, which renders pyramids with a different rendering pipeline. The RenderingTest application renders each cube with its own memory buffer, while Art3 concatenates all pyramids into the same memory buffer before rendering. The pyramids are not constrained by an FPS limit, but they are also rendered with vertex shaders and depth buffering. These two application allow us to compute and store the reference sensitivities, such that they can be used as the golden reference to validate our online frequency sensitivity predictions. 15 Predicted Actual 10 5 0 0 5 10 15 20 Time (s) 25 30 Figure 9: Frame time prediction for the RenderingTest app. (a) Actual Error (%) Frame time (ms) Frame time (ms) Frame Time Prediction Prediction 60 40 5 10 15 20 25 30 30 Error (%) 10 0 5 10 15 Time (s) 20 25 30 Figure 10: Frame time prediction for the 3DMark Ice Storm application running at (a) 200 MHz, (b) 489 MHz. Frame time (ms) Mean Figure 12: Median and mean absolute percentage errors in the frame time for the Android applications. (b) 20 Prediction 100 Actual 50 0 0 10 20 30 Time (s) 40 50 60 Figure 11: Frame time prediction for the BrainItOut application running at 200 MHz. ning over all GPU frequencies is summarized in Figure 12. The average median and mean absolute errors across all the benchmarks are found as 1.3% and 3.8%. We also compared our approach with an offline method, where all the model parameters are learned at design time and remained constant at runtime. Figure 13 shows the median absolute percentage errors for online (dashed line) and offline (solid line) learning for different training ratios. When we run all the benchmarks one after the other with our online learning mechanism, we get an error of 1.5%. However, running the same benchmarks with offline learned parameters leads to higher errors. As shown in the figure, the difference between the offline and online error decreases as the training ratio approaches one, i.e., when the training set equals the test set. This shows that offline learning leads to higher error, unless the model can be trained on all the applications. Of note, the prediction error of our approach is flat, since the same set of features are selected with smaller training set. 5.4 Median 2 Out rt3 est age orm io 1 io 2 io 3 io 4 rk A gT er -St nar nar nar nar ma inIt rin Av Ice Sce Sce Sce Sce Nena Bra nde e R 20 0 9.0 7.5 6.0 4.5 3.0 1.5 0.0 Potential Impact for Dynamic Power Management In this section, we demonstrate the accuracy of our frame time sensitivity prediction presented in Section 4.4. In our feature set for frame time prediction, only the Rendering Engine Busy counter is a function of frequency. After performing extensive analysis, we modeled the frequency dependence of ∂xdep this counter ∂f empirically as a function of the frequency f , and two frequency independent counters HIZ Fast Z Test Pixels Passing and 3D Render Target Writes. 2 X p ∂xdep αi xindepi + β0 f + β1 f ≈ ∂f i=1 (14) Then, we applied offline learning to characterize the α and Median Absolute Percentage Error in Frame Time 7 Offline Learning Online Learning 5 3 1 0.2 0.4 0.6 Training Ratio 0.8 1 Figure 13: Comparison of median absolute percentage error in frame time for all Android applications combined. β coefficients in this equation. We could use also online learning, but we opted for offline learning for three reasons. First, we observed that the change in the counters as a function of the frequency is much less dynamic than the frame time. Second, it is harder to obtain a clean reference ∂xdep for ∂f at runtime, unlike the frame time which is obtained through instrumentation. Finally, this choice implies less computational overhead at runtime. Figure 14 shows the real and predicted values of the derivative of this counter with respect to frequency for RenderingTest application. The root mean squared error of our prediction is 0.03, while the data range is [−0.6, 0.4]. Thus, Equation 14 provides a good approximation of this derivative. Derivative of Rendering Engine Busy w.r.t. frequency 0.2 ∂xdep ∂f 0 Predicted -0.2 0 5 Actual 10 Time (s) 15 20 Figure 14: Offline prediction of the derivative of Rendering Engine Busy counter with respect to GPU frequency. To assess the accuracy of our sensitivity prediction, we predict the change in frame time as a result of increasing (or decreasing) the frequency. Then, we compute the frame time sensitivity using Equation 13. We started with changing the frequency by one level according the supported GPU frequencies listed in Figure 4, e.g., changing fGP U from fk = 400 MHz to fnew = 444 MHz or fnew = 355 MHz. Figure 15 shows the predicted and actual frame time when the new frequency fnew is one level higher. The mean absolute percentage error for this prediction is 1.5%. We observed similar results when fnew is one level lower. One might argue that the high prediction accuracy is only due to single frequency jumps like 400 MHz to 444 MHz. Therefore, we repeated our experiments for multiple frequency jumps. For example, if current frequency is 200 MHz, then a frequency jump of three implies fnew is 311 MHz. Figure 16 shows that the accuracy indeed degrades, but even when the number of frequency levels is six, the error is less than 7.5%. We present the accuracy in predicting the derivative of Frame Time (ms) Predicted 12 Actual 10 8 6 0 5 10 15 20 Time (s) 25 30 Figure 15: Predicted and actual frame times for RenderingTest application when fnew is one level higher. Figure 17: Sensitivity of frame time to frequency for RenderingTest app. with fnew one level; (a) higher, (b) lower. Figure 16: Frame time prediction error in Rendering Test application for multiple frequency jumps. frame time with respect to GPU frequency for the RenderingTest application in Figure 17 . The root mean squared error in these predictions are 4.0 × 10−3 and 4.4 × 10−3 for frequency jumps of one level higher and lower, respectively. As seen from this plot, the slope starts with a negative value and then diminishes to zero on increasing frequency. This is consistent with the observation in Figure 6. In addition to running the RenderingTest application we ran Art3 as well to measure frame time sensitivity. Figure 18 shows that the predicted derivative of frame time with respect to GPU frequency follows the reference values closely. In particular, the root mean squared error for the frame time sensitivity to frequency were 2.3 × 10−3 and 2.7 × 10−3 for frequency jumps of one level higher and lower, respectively. 6. CONCLUSION AND FUTURE WORK In this paper, we proposed a methodology that combines offline data collection and online learning. We constructed an RLS based adaptive runtime performance model using this methodology. Extensive evaluations on a commercial platform using common GPU benchmarks resulted in average mean absolute errors of 3.1% in frame time and 3.9% in frame time sensitivity prediction. This high accuracy model can help predict the sensitivity of the frame processing time to frequency, which is important for DPM algorithms. As future work, we plan to integrate the proposed runtime model into a sophisticated DPM algorithm. Acknowledgments: This work was supported partially by Strategic CAD Labs, Intel Corporation and National Science Foundation under Grant No. CNS-1526562. 7. REFERENCES [1] App Tornado. App Brain. http://www.appbrain.com/, accessed July 20, 2016. [2] R. Z. Ayoub et al. OS-level Power Minimization under Tight Performance Constraints in General Purpose Systems. In Proc. of the Intl. Symp. on Low-power Electronics and Design, pages 321–326, 2011. [3] W.-M. Chen, S.-W. Cheng, P.-C. Hsiu, and T.-W. Kuo. A User-Centric CPU-GPU Governing Framework for 3D Games on Mobile Devices. In Proc. of ICCAD, pages 224–231, 2015. [4] B. Dietrich and S. Chakraborty. Lightweight Graphics Instrumentation for Game State-Specific Power Management in Android. Multimedia Systems, 20(5):563–578, 2014. Figure 18: Sensitivity of frame time to frequency for Art3 application with fnew one level; (a) higher, (b) lower. [5] B. Dietrich et al. LMS-based Low-complexity Game Workload Prediction for DVFS. In Proc. of the Intl. Conf. on Comp. Design, pages 417–424, 2010. [6] R. Faith. The Direct Rendering Manager: Kernel Support for the Direct Rendering Infrastructure, 1999. http: //dri.sourceforge.net/doc/drm low level.html, accessed July 20, 2016. [7] J. Friedman, T. Hastie, and R. Tibshirani. The Elements of Statistical Learning, volume 1. Springer Series in Statistics, Berlin, 2001. [8] Future Mark. http://www.futuremark.com/benchmarks, accessedJuly20,2016. [9] Intel Corp. Intel GPU Tools. http://01.org/linuxgraphics/gfxdocs/igt/, accessed July 20, 2016. [10] Intel Corp. Minnowboard. http://www.minnowboard.org/, accessed July 20, 2016. [11] Intel Corp. Open Source HD Graphics Programmers’ Reference Manual. June 2015. [12] T. Jin, S. He, and Y. Liu. Towards Accurate GPU Power Modeling for Smartphones. In Proc. of the 2nd Workshop on Mobile Gaming, pages 7–11, 2015. [13] I. Jolliffe. Principal component analysis. Wiley Online Library, 2002. [14] D. Kadjo, R. Ayoub, M. Kishinevsky, and P. V. Gratz. A Control-Theoretic Approach for Energy Efficient CPU-GPU Subsystem in Mobile Platforms. In Proc. of DAC, pages 62:1– 62:6, 2015. [15] J. M. Mendel. Lessons in Estimation Theory for Signal Processing, Communications, and Control. Pearson Educ., 1995. [16] H. Nagasaka et al. Statistical Power Modeling of GPU Kernels using Performance Counters. In Proc. of the Intl. Green Computing Conf., pages 115–122, 2010. [17] A. Pathania, A. E. Irimiea, A. Prakash, and T. Mitra. PowerPerformance Modelling of Mobile Gaming Workloads on Heterogeneous MPSoCs. In Proc. of DAC, pages 201:1–201:6, 2015. [18] A. H. Sayed. Fundamentals of Adaptive Filtering. John Wiley & Sons, 2003. [19] G. V. Varatkar and R. Marculescu. On-chip Traffic Modeling and Synthesis for MPEG-2 Video Applications. IEEE Trans. on Very Large Scale Integration Systems, 12(1):108–119, 2004. [20] D. Ververidis and C. Kotropoulos. Sequential Forward Feature Selection with Low Computational Cost. In Proc. of the European Signal Processing Conf., pages 1–4, 2005.