Academia.eduAcademia.edu

Online learning for adaptive optimization of heterogeneous SoCs

2018, Proceedings of the International Conference on Computer-Aided Design

Energy efficiency and performance of heterogeneous multiprocessor systems-on-chip (SoC) depend critically on utilizing a diverse set of processing elements and managing their power states dynamically. Dynamic resource management techniques typically rely on power consumption and performance models to assess the impact of dynamic decisions. Despite the importance of these decisions, many existing approaches rely on fixed power and performance models learned offline. This paper presents an online learning framework to construct adaptive analytical models. We illustrate this framework for modeling GPU frame processing time, GPU power consumption and SoC power-temperature dynamics. Experiments on Intel Atom E3826, Qualcomm Snapdragon 810, and Samsung Exynos 5422 SoCs demonstrate that the proposed approach achieves less than 6% error under dynamically varying workloads.

Online Learning for Adaptive Optimization of Heterogeneous SoCs Ganapati Bhat1 , Sumit K. (Invited Paper) Mandal1 , Ujjwal Gupta2 , Umit Y. Ogras1 {gmbhat, skmandal, umit}@asu.edu, [email protected] 1 School of Electrical Computer and Energy Engineering, Arizona State University, Tempe, AZ, USA 2 Intel Corporation, Hillsboro, OR, USA ABSTRACT Energy efficiency and performance of heterogeneous multiprocessor systems-on-chip (SoC) depend critically on utilizing a diverse set of processing elements and managing their power states dynamically. Dynamic resource management techniques typically rely on power consumption and performance models to assess the impact of dynamic decisions. Despite the importance of these decisions, many existing approaches rely on fixed power and performance models learned offline. This paper presents an online learning framework to construct adaptive analytical models. We illustrate this framework for modeling GPU frame processing time, GPU power consumption and SoC power-temperature dynamics. Experiments on Intel Atom E3826, Qualcomm Snapdragon 810, and Samsung Exynos 5422 SoCs demonstrate that the proposed approach achieves less than 6% error under dynamically varying workloads. ACM Reference Format: Ganapati Bhat, Sumit K. Mandal, Ujjwal Gupta, Umit Y. Ogras. 2018. Online Learning for Adaptive Optimization of, Heterogeneous SoCs . In ICCAD ’18, November 2018, San Diego, CA, USA. https://doi.org/10.1145/3240765. 3243489 1 INTRODUCTION Heterogeneous architectures are recognized as the primary instrument to bridge the energy efficiency of application-specific hardware with the programmability of general-purpose processors [21, 32]. Indeed, integrating custom accelerators and generalpurpose cores delivers programmable SoCs with superior performance and significantly lower power footprint compared to homogeneous architectures. This capability has already been successfully illustrated by mobile platforms, which are used by more than half of the world’s population to run a large variety of apps, such as phone call, video conferencing, navigation, and games [38]. Hence, heterogeneous architectures can enable new application domains ranging from biomedical and environmental sensing to mobile applications and all the way up to big data analytics [34]. Harvesting the full potential of heterogeneous SoCs is challenged by the tension between energy efficiency and development cost. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ICCAD ’18, November 5–8, 2018, San Diego, CA, USA © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-5950-4/18/11. . . $15.00 https://doi.org/10.1145/3240765.3243489 While form factors and specific system requirements vary, maximizing performance under tight energy and power-density constraints is a common objective. However, application development, let alone aggressive optimization, is notoriously known to be difficult and time consuming when utilizing highly specialized accelerators. The optimization problem is exacerbated by dynamic variations of application workloads and operating conditions [4, 6]. In practice, active applications are scheduled to the processing elements (PE) by the resource management techniques implemented in OS kernels, as shown in Figure 1. At the same time, the sleep and active power states of the PEs are managed by ‘the power management drivers [20]. These decisions are made in discrete time intervals which are typically 50–100 ms long. Existing governors implemented in commercial platforms typically rely on PE utilizations to make these decisions [24, 27]. However, utilization alone does not reveal the sensitivity of the system power and performance to the control knobs, such as the operating frequency and sleep states. For example, we cannot quantify the change in the frame processing time of a GPU as a function of the change in frequency by only using the GPU utilization. In contrast, the sensitivity of system power and performance to the control knobs, e.g., the partial derivative of processing time to frequency, can help in determining the optimal actions, as illustrated in Figure 1. For instance, a “Jacobian matrix” for heterogeneous systems can be constructed by computing the sensitivity of system power and performance to the frequency of individual resources. Then, these sensitivity models can be used by adaptive resource and power management techniques to achieve maximum energy efficiency. Adaptive optimization techniques need to learn the sensitivity models online, since the effectiveness of offline models is severely limited. First, a significant amount of manual effort is required for developing and maintaining offline models. Second, one cannot account for emerging application workloads that are not available at the time of the development. Even exploring different combinations of known applications at design time is costly. Finally, offline models cannot capture runtime variations in workload and operating conditions, which are unknown at design time. Therefore, this paper presents online techniques for learning models that can be used by the dynamic resource and power management algorithms to utilize the PEs in heterogeneous SoCs most effectively. The proposed framework is illustrated for modeling GPU frame processing time, GPU power consumption and SoC power-temperature dynamics. The effectiveness of these models is demonstrated by running dynamically varying workloads on Intel Atom E3826, Qualcomm Snapdragon 810, and Samsung Exynos 5422 SoCs. Figure 1: The use of the adaptive models (right-most block). Dynamic resource management techniques schedule applications to PEs, and manage the power states periodically in discrete time intervals. Adaptive models guide these decisions by expressing the sensitivity of power, performance and temperature metrics to the control knobs. For instance, the first row of the sensitivity matrix gives the partial derivative of frame processing time (t f ) with respect to the operating frequencies of the PEs (f 1 , . . . , f N .) This paper is a part of the ICCAD 2018 Special Session on "Managing Heterogeneous Many-cores for High-Performance and EnergyEfficiency". The other two papers of this special session are: “Dynamic Resource Management for Heterogeneous Many-Cores” [14] and “Hybrid On-Chip Communication Architectures for Heterogeneous Manycore Systems” [18]. The rest of this paper is organized as follows. Section 2 overviews the proposed online learning framework. Sections 3 and 4 illustrate this framework for GPU frame processing time and power consumption modeling, respectively. Section 5 presents an online analysis technique for power-temperature dynamics using a multi-input multi-output model. Finally, Section 6 concludes the paper. 2 ADAPTIVE MODELING FRAMEWORK The quintessential question in dynamic resource and power management is to quantify how the control knobs affect the metric of interest. For instance, when the operating frequency of a PE is increased, it is well known that the execution time will decrease, while the power consumption becomes larger. However, it is hard to take an action, unless we can accurately predict by how much the execution time, performance, as well as other dependent metrics, such as energy and temperature, will change. Note that, these quantities can be measured at runtime through sensors (e.g., current and temperature sensors) or OS/Firmware instrumentation (e.g., power meters and performance monitors). However, measurements at a given configuration are not sufficient to take dynamic resource and power management actions. We still need to predict the impact of a potential action, such as increasing the CPU operating frequency, before committing to it. The core of the proposed adaptive modeling framework is the analytical models for system performance, power consumption and temperature, as shown in Figure 2. The analytical models express these metrics in terms of the hardware configuration and the states of the PEs. Hardware configuration includes the set of active processing elements and their operating frequencies. The definition of the state depends on the PE. For example, a CPU core can be characterized by the number instructions retired, memory accesses, cache hits, and active cycles in a given amount of time. Similarly, state of the GPU can include number of frames processed, pixels shaded, and utilization. These quantities can be obtained using performance monitors, such as SimplePerf [9], and instrumenting the target OS/Firmware. Then, they are used to process features which are fed to the analytical models. Feature generation can be as simple as normalizing the measured values within a given range, arithmetic operations (e.g., normalizing with number of instructions), or more complex transformations. At the end, the analytical models produce system performance, power consumption and temperature estimates, as illustrated in Figure 2. The analytical models can be used for “what/if” analysis by dynamic optimization techniques at runtime. For example, the impact of increasing (or decreasing) the operating frequency of one of the PEs can be predicted without actually changing the frequency. However, it is hard to maintain the accuracy of open loop models designed offline, as the workload changes at runtime. Therefore, we propose to employ a closed loop approach that continuously improves the models by taking advantage of the abundant data collected at runtime. More precisely, the quantities, such as processing time, predicted at time k (tF (k + 1)) are compared to the actual values in the next interval. Then, the error between the predicted and measured values is used to improve the model using adaptive algorithms, such as recursive least square estimation [35], as illustrated in Figure 2. In this way, we maintain the accuracy by adapting the analytical models to dynamically varying workloads. This approach is broadly applicable to power consumption and performance modeling. For illustration purposes, we describe its application to GPU frame processing time and power consumption in the following two sections. 3 ADAPTIVE GPU PERFORMANCE MODEL Increased demand of graphics applications in mobile systems has led to tight integration of GPUs in SoCs [26]. Since the maximum achievable frame rate and power consumption depend critically on the GPU frequency, dynamic power management algorithms have to assess the performance sensitivity to the GPU frequency Figure 2: Overview of the proposed online learning framework. Raw performance monitors, such as PE utilizations, frequencies and memory access statistics, are collected through OS and firmware instrumentation. Features generated from this data are used to predict performance metrics, such as processing time, power consumption and temperature. Then, measured values are used to compute prediction errors, which are used to correct the model coefficients. accurately. To address this need, a number of performance models have been proposed [7, 8, 19, 31]. However, these models do not generalize well to a larger set of workloads due to offline training and coarse-grain inputs, such as utilization. In contrast, the online learning framework presented in Section 2 can be used to construct a lightweight adaptive runtime GPU performance model with finegrain inputs, as described next. 3.2 Online Learning for Frame Processing Time Parameters a ∈ RN +1 can be learned online using a variety of algorithms, such as recursive least squares (RLS) or least mean squares (LMS) [35]. For example, update equations for the information form of RLS with exponential forgetting can be written as: Correlation Matrix Update : Rk = λRk −1 + hk hTk 3.1 Model Templates for Frame Processing Time t F,k ( fk , xk ( fk )) = t F,k −1 ( fk−1 , xk ( fk −1 )) + ∆t F,k ( fk , xk ( fk )) (1) where ∆t F,k ( fk , xk ( fk )) is the change in frame processing time due to frequency and workload. Thus, we can find the frame processing time in interval k by measuring the previous value at the end of each interval and modeling ∆t F,k ( fk , xk ( fk )). The change in frame processing time can be expressed as:    N f ∆t F,k ( fk , xk ( fk )) ≈a 0t F,k −1 k−1 − 1 + ai ∆x i,k ( fk ) (2) fk i=1 where a 0 coefficient denotes the sensitivity to the GPU frequency, and a 1 , . . . , a N are sensitivities to performance counters x i , . . . , x N , respectively. Equation 2 enables us to determine the frequency sensitivity that can be used by dynamic power management algorithms. For example, when a 0 ≈ 0, changing the frequency does not impact the performance, while a 0 ≈ 1 shows high   frequency sensitivity. f In this example, the terms t F,k−1 kf −1 − 1 and ∆x i,k ( fk ) ∀i form k the feature set hk . These features can be either selected online [10], or determined offline by using techniques such as Lasso regression [11]. (4) (5) where λ ∈ (0, 1] is the exponential forgetting factor, Rk is the correlation matrix updated in each iteration, and ek is the prediction error. The features ∆x i,k ( fk ) ∀i used by the model can be selected online [10], or determined offline by solving a ℓ2 regularized cost function [11]. To illustrate this approach, we run Nenamark2 and Mobilebench [28, 30] applications sequentially on Intel Minnowboard platform [17]. The predicted frame processing time follows the measured value closely, as shown in Figure 3. We observe a sudden change in frame time when the Nenamark2 benchmark completes and Mobile-bench starts at time t = 15 s. Our adaptive model is able to track the frame processing time accurately even during this transition. Similarly, we increase the frequency from 355 MHz to 511 MHz after Mobile-bench finishes at time t = 29 s. The model quickly adapts its coefficients to track fast changes in frame time, as shown in Figure 3. Overall, the proposed adaptive model yields a mean Prediction Frame time (ms) The maximum achievable frame rate in interval k is given by the reciprocal of the frame processing time, which is a multivariate function of the GPU frequency and workload. Suppose that the frequency is denoted by fk , and the workload is characterized by GPU performance counters x i,k ( fk ), 1 ≤ i ≤ N , where N is the number of performance counters. We can measure the frame processing time in the previous interval (k −1) at runtime by instrumenting the GPU driver. Hence, the frame processing time in the next interval can be written as: (3) Prediction Error : ek = ∆t F,k − aTk−1 hk Correction Equation : ak = ak−1 + Rk−1 hk ek Actual A2: 355 MHz A2: 511 MHz 20 10 A1: 355 MHz 0 0 10 A1: 511 MHz 20 30 40 50 Time (s) Figure 3: Adaptive frame time prediction result for Nenamark2 (A1) and Mobile-bench (A2) benchmarks running on Minnowboard with two distinct frequencies of 355 MHz and 511 MHz. absolute percentage error (MAPE) of only 4.1%. Furthermore, we observe that the Nenamark2 application (A1) has higher frequency sensitivity (larger a 0 coefficient in Equation 2) than Mobile-bench. This is also evident in Figure 3, since the reduction in frame time is larger for Nenamark2, when the GPU frequency increases from 355 MHz to 511 MHz. The frequency sensitivity information can be utilized by dynamic power management algorithms [22]. For example, Nenamark2 can run at a higher frequency to avoid performance loss, while the Mobile-bench application can run at a lower frequency to save power without significant performance penalty. 4 ADAPTIVE POWER CONSUMPTION MODEL This section illustrates the application of the proposed online learning framework to power consumption. Maintaining an accurate power consumption model for the entire SoC and different power domains is critical for two reasons. First, the power consumption should not exceed the thermal design power to avoid temperature violations [3, 29]. Second, the total power budget needs to be distributed among different processing elements optimally to avoid performance bottlenecks [12, 19]. Thus, dynamic resource management techniques should have access to accurate power consumption models that guide power management decisions. 4.1 Power Consumption Model Templates The general model template used for power consumption of a core can be written as: P = Pdynamic + Pl eakaдe 2 P = αCV f + V Il eakaдe c2 (6) (7) where As is a technology dependent constant, L and W are channel length and width, k is the Boltzmann constant, T is the temperature, q is the electron charge, VGS is the gate to source voltage, Vth is the threshold voltage, n is the sub-threshold swing coefficient, and Iдat e is the gate leakage [25, 33]. The technology dependent constants and other constants in Equation 7 can be combined to get parameters c 1 and c 2 [3]. Our goal is to estimate the power consumption at runtime, e.g., in a given control interval k. To achieve this goal, we first characterize the leakage power parameters offline. These parameters can be estimated by performing nonlinear regression using traces obtained at multiple temperatures [3]. Once c 1 , c 2 , and Iдat e parameters are estimated, the leakage current can be found at runtime by plugging the temperature to Equation 7. Thus, the total power consumption at time k can be written as: c2 GPU Capacity GPU Utilization Frame Count CPU Cycles per Instruction L2 References per Instruction L2 Misses per Instruction Branch Misses per Instruction Per Core CPU Utilization switching activity α k as a function of the workload, and employ the proposed online learning framework. 4.2 Online Learning for Dynamic Power The switching activity α k is a function of the workload and frequency similar to the frame processing time modeled in Section 3. Therefore, we first express it as α k ( fk , xk ( fk )), where x i,k ( fk ), 1 ≤ i ≤ N denote the performance counters. For example, the CPU power consumption can be modeled by using the number instructions retired, clock cycles, page walks, power state residencies, memory bus accesses, level two cache accesses, operating frequency, and utilization [20, 31]. To align with Section 3, we illustrate the proposed approach for GPU power consumption modeling. The switching activity can be expressed as a function of the GPU frequency and performance counters x i,k listed in Table 1 as: N  ai x i,k ( fk ) (9) α k ( fk , xk ( fk )) ≈ i=0 where α is the activity factor, C is the switching capacitance, V is the operating voltage, f is the operating frequency, and Il eakaдe is the leakage current [5, 39]. We can further express the leakage current in terms of the temperature and technology parameters as:   W kT 2 q (VG S −Vt h ) Il eakaдe = As + Iдat e e nkT L q Il eakaдe = c 1T 2e T + Iдat e Table 1: Performance counters used in this work (8) Pk = α k CVk2 fk + Vk (c 1Tk2e Tk + Iдat e ) Unlike the leakage power, the dynamic power components (the first term) depends heavily on the workload. Therefore, we model the These counters are observed at runtime by instrumenting the drivers in the OS. The GPU capacity in Table 1 is found as: GPUcapacity,k = GPUut il izat ion,k fk fдpu,max (10) where thefдpu,max is the maximum possible GPU frequency. We utilized this feature, since it is used by the default GPU drivers. At runtime, we employ Equation 9 to approximate the switching activity. Given the switching activity, the power consumption can be predicted using Equation 8 for any frequency/voltage pair. These predictions, along with performance models, can be used to make power management decisions [11]. For illustration, we employ the Nexus 6P smartphone [16], which allows controlling the CPU and GPU frequencies independently. We run Angry Birds and Rendering Test applications back to back to evaluate the power consumption model. More precisely, we employ Equation 9 to predict the power consumption in the next interval. After the actual power consumption is measured, we compute the prediction error. Then, we employ the RLS algorithm using Equation 3 – Equation 5 to maintain the accuracy of the model. We observe that the proposed adaptive model tracks the GPU power consumption accurately throughout the 120 s experiment, as shown in Figure 4. In particular, the power consumption drops sharply around t = 20 s, when the execution of Angry Birds completes and Rendering Test starts. The adaptive model successfully tracks the measured power consumption with only one-sample overshoot at the time of transition. Similarly, the model captures the variations within each application, and achieves MAPE of only 5.5%. GPU Power (W) 1 the temperature in the next control interval given the current temperature and power consumption. This information is useful to check if the temperature constraints will be violated in the future control intervals. Dynamic resource management techniques can take appropriate actions with the help of these models [3]. Actual Prediction 0.8 0.6 0.4 5.2 Online Power-Temperature Analysis 0.2 0 0 20 40 60 80 100 120 Time (s) Figure 4: Online learning and estimation of the GPU power consumption while running Angry Birds and a custom Rendering Test app on the Nexus 6P smartphone. The MAPE for this workload is 5.5%. 5 POWER–TEMPERATURE DYNAMICS Increasing power density drives the chip temperature up through thermal resistance and capacitance networks [3, 15]. In turn, higher temperature leads to an exponential increase in the leakage current, as revealed by Equation 7. This relation gives rise to a positive feedback that continues until the power-temperature dynamics reaches a stable steady state, or a thermal runaway occurs [2, 23]. If the chip temperature exceeds thermally safe limits, power management drivers reduce the operating frequencies or power down active PEs. However, these actions degrade the performance and quality of service (QoS) delivered to the user severely. Therefore, it is critical to maintain accurate thermal models, in addition to the performance and power models, to predict the impact of power management decisions before committing them. 5.1 Temperature Model Template Thermal modeling has recently received significant attention due to its importance in dynamic thermal and power management [1, 5, 15, 36, 37]. The power-temperature dynamics is described as: dT = −G t T + P (11) dt where Ct is the thermal capacitance, G t is the thermal conductance, T is the temperature, and P is the power consumption. Since the power management decisions are typically made at discrete time intervals, most studies discretize Equation 11 to obtain: Ct T [k + 1] = AT [k] + BP[k] (12) where T [k] and T [k + 1] are vectors that denote the temperature at each thermal hotspot in time intervals k and k + 1, respectively. The matrix A captures the impact of T [k] on T [k + 1]. Similarly, matrix B models the impact of each power sources on each thermal hotspot [3]. These matrices can be characterized by starting off with continuous time models, such as those in HotSpot [15], and discretizing them [36]. However, detailed thermal network models are not available for most modern SoCs. Therefore, recent studies employ system identification methods to directly identify matrices A and B [1, 3]. System identification methods are effective even when detailed floorplan information is not available. After A and B matrices are characterized, Equation 12 can be used to predict Evaluating the longer term behavior of the power-temperature dynamics is useful to analyze stability and predict potential thermal runaway. However, Equation 12 predicts the temperature only in the next control interval. While it can be also used iteratively to predict the temperature in future intervals, the prediction error increases with the prediction interval [2, 3]. Hence, Equation 12 cannot be used alone to find the steady-state temperature at runtime. Our goal is to predict the steady state fixed point, which is defined as the power consumption and temperature the system will converge if the current operating conditions are maintained [2]. The fixed point is a function of the current operating frequency, voltage and temperature, as well as the workload. It can be calculated by solving the system of equations: Tf ix [k] = ATf ix [k] + BP[k] (13) where Tf ix,k is the fixed point temperature evaluated at time k. We have the same temperature on both sides of the equation, since the system is in a steady state when the fixed point is reached. The power consumption of each processing element in P[k] can be written as: c2 Pi [k] = α i Ci Vi2 fi + Vi (c 1Ti2 [k]e Ti [k ] + Iдat e ), 1 ≤ i ≤ M (14) where M is the number of processing elements in the system. Due to the nonlinear dependency between the power consumption and temperature, solving this set of equations is challenging at runtime. Therefore, we constructed an efficient fixed point computation algorithm that can be implemented in power management drivers [2]. Using this technique, we compute the power consumption and temperature fixed point at time k in less than 100 μs, as detailed in [2]. Since the workload, frequency and voltage change in every interval, the fixed point may change dynamically. Therefore, we repeat this process in each control interval to maintain an up to date and accurate prediction. To illustrate this analysis, we run the CRC32 benchmark on the Odroid-XU3 [13] board that includes a Samsung Exynos-5422 SoC. When the benchmark starts running, the board has a low power consumption of about 0.25 W, as shown in Figure 5. At this time, the fixed point temperature is predicted as only 50 ◦ C, as shown by the red  with black border. In the next few time intervals, the power consumption rises to about 1.8 W, as shown by the red line in the figure. As a result of this increase, the fixed point prediction is updated as 85 ◦ C. As the benchmark continues to run, the temperature of the system rises due to increase in the leakage power consumption. The fixed point is continuously updated as the workload and temperature varies. These fixed point predictions are shown using red  with black borders in Figure 5. We observe that the predictions are clustered around about 88 ◦ C as the variation in the power consumption is only about 0.2 W. We also see that the measured temperature at the end of the experiment reaches to about 87 ◦ C, which is within 1 ◦ C of the predicted fixed point, thus showing the accuracy of our prediction algorithm. Simulation Measurement Analytical Fixed Point Prediction Temperature ( oC) 90 80 70 60 50 0 0.5 1 1.5 2 2.5 Power Consumption (W) Figure 5: Measurement, simulation and prediction of temperature while running the CRC benchmark on the OdroidXU3 board that includes a Samsung Exynos-5422 SoC. 6 CONCLUSION Heterogeneous SoCs have the capability to bridge the gap between the energy efficiency of application specific hardware and generalpurpose processors. In order to achieve this goal, dynamic resource managers in heterogeneous SoCs need adaptive models for performance, power consumption and temperature of various processing elements in the SoC. This paper presented a general methodology for online learning of adaptive performance, power and thermal models. Specifically, we illustrated online learning of GPU frame processing time, GPU power consumption and power-temperature dynamics of a SoC. Experiments on state-of-the-art industrial platforms show that the proposed approach is able to model the metrics of interest with less than 6% modeling error. Acknowledgements: This work was supported partially by National Science Foundation (NSF) grants CNS-1526562, Semiconductor Research Corporation (SRC) task 2721.001, and Strategic CAD Labs, Intel Corporation. REFERENCES [1] F. Beneventi, A. Bartolini, A. Tilli, and L. Benini. An Effective Gray-Box Identification Procedure for Multicore Thermal Modeling. IEEE Trans. Comput., 63(5):1097–1110, 2014. [2] G. Bhat, S. Gumussoy, and U. Y. Ogras. Power-Temperature Stability and Safety Analysis for Multiprocessor Systems. ACM Trans. Embedd. Comput. Syst., 16(5s):145, 2017. [3] G. Bhat, G. Singla, A. K. Unver, and U. Y. Ogras. Algorithmic Optimization of Thermal and Power Management for Heterogeneous Mobile Platforms. IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 26(3):544–557, 2018. [4] P. Bogdan, R. Marculescu, S. Jain, and R. T. Gavila. An Optimal Control Approach to Power Management for Multi-Voltage and Frequency Islands Multiprocessor Platforms under Highly Variable Workloads. In Proc. of the Int. Symp. on Networks on Chip, pages 35–42, 2012. [5] D. Brooks, R. P. Dick, R. Joseph, and L. Shang. Power, Thermal, and Reliability Modeling in Nanometer-Scale Microprocessors. IEEE Micro, 27(3):49–62, 2007. [6] E. Del Sozzo et al. Workload-aware Power Optimization Strategy for Asymmetric Multiprocessors. In Proc. of the Conf. on Design, Autom. and Test in Europe, pages 531–534, 2016. [7] B. Dietrich and S. Chakraborty. Lightweight Graphics Instrumentation for Game State-Specific Power Management in Android. Multimedia Systems, 20(5):563–578, 2014. [8] B. Dietrich et al. LMS-based Low-complexity Game Workload Prediction for DVFS. In Proc. of the Int. Conf. on Comput. Design, pages 417–424, 2010. [9] Google. Simpleperf. https://developer.android.com/ndk/guides/simpleperf Accessed 08/18/2018, 2018. [10] U. Gupta, M. Babu, R. Ayoub, M. Kishinevsky, F. Paterna, and U. Y. Ogras. STAFF: Online Learning with Stabilized Adaptive Forgetting Factor and Feature Selection Algorithm. In Proc. of Design Autom. Conf., page 6, 2018. [11] U. Gupta et al. An Online Learning Methodology for Performance Modeling of Graphics Processors. IEEE Trans. Comput., 2018. DOI:10.1109/TC.2018.2840710. [12] U. Gupta et al. Dynamic Power Budgeting for Mobile Systems Running Graphics Workloads. IEEE Trans. Multi-Scale Comput. Syst., 4(1):30–40, 2018. [13] Hardkernel. Platforms, ODROID − XU3, 2017. http://www.hardkernel.com/ main/products/prdt_info.php?g_code=G143452239825, Accessed 08/22/2018. [14] J. Henkel, J. Teich, S. Wildermann, and H. Amrouch. Dynamic Resource Management for Heterogeneous Many-Cores. In Proc. of the Int. Conf. on Comput.-Aided Design, Nov. 2018. [15] W. Huang, S. Ghosh, S. Velusamy, K. Sankaranarayanan, K. Skadron, and M. R. Stan. HotSpot: A Compact Thermal Modeling Methodology for Early-Stage VLSI Design. IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 14(5):501–513, 2006. [16] Huawei. Huawei Nexus 6P Smartphone, 2015. https://www.gsmarena.com/ huawei_nexus_6p-7588.php, Accessed 08/22/2018. [17] Intel Corp. Minnowboard, 2016. http://www.minnowboard.org/, Accessed 08/22/2018. [18] B. K. Joardar, J. R. Doppa, P. P. Pande, D. Marculescu, and R. Marculescu. Hybrid On-Chip Communication Architectures for Heterogeneous Manycore Systems. In Proc. of the Int. Conf. on Comput.-Aided Design, Nov. 2018. [19] D. Kadjo, R. Ayoub, M. Kishinevsky, and P. V. Gratz. A Control-Theoretic Approach for Energy Efficient CPU-GPU Subsystem in Mobile Platforms. In Proc. of the Design Autom. Conf., pages 62:1–62:6, 2015. [20] D. Kadjo, U. Ogras, R. Ayoub, M. Kishinevsky, and P. Gratz. Towards Platform Level Power Management in Mobile Systems. In IEEE Int. System-on-Chip Conf., pages 146–151, 2014. [21] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy. Introduction to the Cell multiprocessor. IBM J. of Res. and Develop., 49(4.5):589– 604, 2005. [22] R. G. Kim et al. Imitation Learning for Dynamic VFI Control in Large-Scale Manycore Systems. IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 25(9):2458– 2471, 2017. [23] W. Liao and L. He. Coupled Power and Thermal Simulation with Active Cooling. In Proc. of the Int. Workshop on Power-Aware Comput. Syst., pages 148–163, 2003. [24] Linux Kernel. The Interactive Governor. https://android.googlesource. com/kernel/common/+/a7827a2a60218b25f222b54f77ed38f57aebe08b/ Documentation/cpu-freq/governors.txt, Accessed 08/14/2018, 2016. [25] Y. Liu, R. P. Dick, L. Shang, and H. Yang. Accurate Temperature-Dependent Integrated Circuit Leakage Power Estimation is Easy. In Proc. of the Conf. on Design, Autom. and Test in Europe, pages 1526–1531, 2007. [26] X. Ma, Z. Deng, M. Dong, and L. Zhong. Characterizing the Performance and Power Consumption of 3D Mobile Games. Computer, 46(4):76–82, 2013. [27] T. S. Muthukaruppan, M. Pricopi, V. Venkataramani, T. Mitra, and S. Vishin. Hierarchical Power Management for Asymmetric Multi-Core in Dark Silicon Era. In Proc. of the Design Autom. Conf., pages 1–9, 2013. [28] Nena Innovation. Nenamark2. https://nena.se/nenamark/ Accessed 08/18/2018, 2018. [29] S. Pagani, H. Khdr, J.-J. Chen, M. Shafique, M. Li, and J. Henkel. Thermal Safe Power (TSP): Efficient Power Budgeting for Heterogeneous Manycore Systems in Dark Silicon. IEEE Trans. Comput., 66(1):147–162, 2017. [30] D. Pandiyan, S.-Y. Lee, and C.-J. Wu. Performance, Energy Characterizations and Architectural Implications of an Emerging Mobile Platform Benchmark Suite-Mobilebench. In Int. Symp. Workload Characterization, pages 133–142, 2013. [31] A. Pathania, A. E. Irimiea, A. Prakash, and T. Mitra. Power-Performance Modelling of Mobile Gaming Workloads on Heterogeneous MPSoCs. In Proc. of the Design Autom. Conf., pages 201:1–201:6, 2015. [32] E. Rotem. Intel Architecture, Code Name Skylake Deep Dive: A New Architecture to Manage Power Performance and Energy Efficiency. In Intel Dev. Forum, 2015. [33] K. Roy, S. Mukhopadhyay, and H. Mahmoodi-Meimand. Leakage Current Mechanisms and Leakage Reduction Techniques in Deep-Submicrometer CMOS Circuits. Proc. IEEE, 91(2):305–327, 2003. [34] M. Saecker and V. Markl. Big Data Analytics on Modern Hardware Architectures: A Technology Survey. In Eur. Bus. Intell. Summer School, pages 125–149, 2012. [35] A. H. Sayed. Fundamentals of Adaptive Filtering. John Wiley & Sons, 2003. [36] S. Sharifi, D. Krishnaswamy, and T. S. Rosing. PROMETHEUS: A Proactive Method for Thermal Management of Heterogeneous MPSoCs. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., pages 1110–1123, 2013. [37] K. Skadron, M. R. Stan, K. Sankaranarayanan, W. Huang, S. Velusamy, and D. Tarjan. Temperature-Aware Microarchitecture: Modeling and Implementation. ACM Trans. Archit. Code Optim., 1(1):94–125, 2004. Number of Mobile Phone Users Worldwide From [38] Statista. 2013 to 2019. https://www.statista.com/statistics/274774/ forecast-of-mobile-phone-users-worldwide/, Accessed 08/22/2018. [39] N. Weste and D. Harris. CMOS VLSI Design: A Circuits and Systems Perspective. 2010.