Academia.eduAcademia.edu

Future performance challenges in nanometer design

2001, Proceedings of the 38th conference on Design automation - DAC '01

We highlight several fundamental challenges to designing highperformance integrated circuits in nanometer-scale technologies (i.e. drawn feature sizes < 100 nm). Dynamic power scaling trends lead to major packaging problems. To alleviate these concerns, thermal monitoring and feedback mechanisms can limit worst-case dissipation and reduce costs. Furthermore, a flexible multi-V dd + multi-V th + re-sizing approach is advocated to leverage the inherent properties of ultrasmall MOSFETs and limit both dynamic and static power. Alternative global signaling strategies such as differential and low-swing drivers are recommended in order to curb the power requirements of crosschip communication. Finally, potential power delivery challenges are addressed with respect to ITRS packaging predictions.

Future Performance Challenges in Nanometer Design Dennis Sylvester Himanshu Kaul University of Michigan Ann Arbor, MI 48109-2122 [email protected] University of Michigan Ann Arbor, MI 48109-2122 [email protected] ABSTRACT We highlight several fundamental challenges to designing highperformance integrated circuits in nanometer-scale technologies (i.e. drawn feature sizes < 100 nm). Dynamic power scaling trends lead to major packaging problems. To alleviate these concerns, thermal monitoring and feedback mechanisms can limit worst-case dissipation and reduce costs. Furthermore, a flexible multi-Vdd + multi-Vth + re-sizing approach is advocated to leverage the inherent properties of ultrasmall MOSFETs and limit both dynamic and static power. Alternative global signaling strategies such as differential and low-swing drivers are recommended in order to curb the power requirements of crosschip communication. Finally, potential power delivery challenges are addressed with respect to ITRS packaging predictions. 1. INTRODUCTION Many challenges confront device engineers, circuit designers, systemlevel architects, and electronic design automation (EDA) tool developers in nanometer (sub-0.1µm) design. They can be broadly categorized as speed, power, reliability, and variability challenges. Specific examples include soft error rates (reliability), increasing Vth fluctuations across a large die (variability), full-chip inductance extraction (reliability/signal integrity), rising global interconnect latency (delay), and distributing Vdd/GND stably despite large current transients and massive supply currents (power). This paper will center on powerrelated challenges for high-performance IC design (e.g., for desktop microprocessor (MPU) applications) in the 50nm and 35nm technology nodes at the end of the ITRS. Our discussion will highlight key challenges facing designers and EDA developers, existing or proposed solutions to these challenges, and new ideas that may help circumvent the biggest challenges. This paper does not address such important issues as difficulties in synchronization at extremely high clock rates, the impact of growing process variability, signal integrity, etc. We focus on power-related issues because power consumption has more widespread implications than the above issues. For instance, limitations in power management capabilities can fundamentally restrict performance.1 In Section 2, we see that power-related packaging limitations place bounds on die area and integration density. Removing these limits by better packaging/cooling or other methods improves overall performance, not merely power management. Furthermore, static power dissipation and transistor drive current are linked – it is in large part transistor drive current that enables high speed ICs. In general, any roadblocks to the long-standing trends of rising transistor density, die sizes, and clock frequency/throughput can be seen as challenges to performance. Whether they are commonly viewed as reliability problems, signal 1 That is, any important design metric such as clock speed/throughput, integration density, power dissipation, reliability/yield, etc. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 2001, June 18-22, 2001, Las Vegas, Nevada, USA. Copyright 2001 ACM 1-58113-297-2/01/0006…$5.00. integrity, power management, etc., the consequence of such challenges is to limit IC performance. To summarize, while the submicron and deep submicron regimes concentrated on maintaining device and circuit speed improvements despite shrinking supply voltages, nanometer design will be most concerned with limiting power consumption while sustaining throughput and reliability. In the following, Section 2 examines dynamic power, as well as tradeoffs and possible ways to reduce power limitations on performance. Section 3 explores static power consumption’s increasing importance, even in desktop (non-portable) applications. Section 4 examines difficulties in distributing power to increasingly larger ICs under restrictive performance targets (e.g. IR drop < 5-10% of a shrinking Vdd). Throughout, we refer to the 2000 update of the ITRS and highlight important trends and key deviations required to continue relatively unabated along the roadmap. We also use predictive MOS SPICE models [2] as well as a realistic 50nm device model extracted from rigorous process and device simulations [3]. 2. DYNAMIC POWER 2.1 Packaging Limitations With the forecasted increase in MPU power consumption, IC packaging will bear the burden of dissipating even more heat in the future. A package’s ability to remove waste heat is defined by the junction-toambient thermal resistance (θja), expressed as: (1) θ ja = (Tchip − Tambient ) Pchip In (1), Tchip is the on-die junction temperature, Tambient is the ambient (outside package) temperature, and Pchip is the maximum IC power consumption. Given a packaging solution with a fixed θja and an MPU design consuming Pchip, the resultant on-die temperature can be calculated using (1). Alternatively, when considering packaging solutions for a new MPU, the maximum allowable θja can be determined based on constraints for maximum on-die temperature (this is typically limited to ensure correct operation of the MPU). Currently, IC operation frequently pushes the junction temperature beyond 100°C while Tambient is approximately 45°C. With Pchip rising, packaging technology must improve (meaning θja must decrease) to meet heat dissipation demands. Consistent reduction of thermal junction resistance requires advanced cooling techniques such as larger, more powerful (and louder) fans, liquid cooling, etc. Furthermore, to ensure reliability in nanometer scale MPUs, the ITRS calls for a reduction in junction temperature (from 100°C in 1999 to 85°C in 2002). Due to cost constraints, achieving the corresponding θja values for packaging is considered a barrier to scaling – the materials needed are currently unknown. Presently, θja values range from 0.6 to 1 °C/W for the workstation/desktop processor markets [4]. ITRS projections call for a θja of 0.25 °C/W in 3 years – requiring improvements in the CPU package (ceramics, etc.) as well as heat sinking technology. Allowing Tchip to rise allows for less complex and expensive packaging solutions to be used, but this adversely affects circuit performance with respect to leakage current and device reliability. Some packaging experts believe cooled systems are the best alternative for packaging high power density nanometer microprocessor designs. The advantages of cooling the ambient and junction temperatures are well documented: improved voltage scalability due to reduced leakage currents, higher carrier mobilities, lower interconnect resistances, and improved reliability [5]. However, as a reference point, current vapor compression based refrigeration techniques are expensive, on the order of $1 per watt cooled. Such measures for desktop applications in the next decade will likely not be needed, due to improved heat sinking technology and evolving low power design techniques applied to high-end processors. The above packaging-constrained system perspective leads into the concept of dynamic thermal management [6]. Thermal management techniques can take a number of forms. Transmeta’s approach dynamically varies the supply voltage when the CPU is not heavily loaded. Simpler techniques can be used with only minor changes to a straightforward processor implementation. An example is the thermal monitor in Intel’s Pentium 4 design [7], which has an on-chip temperature sensor (a diode with a fixed voltage across it) along with a reference current source and current comparator to determine when the on-die temperature exceeds a given value. This temperature corresponds to a power dissipation level for the microprocessor (determined by (1)). When the temperature (and power consumption) is exceeded, the internal clock frequency is reduced, limiting power and performance / throughput. The importance of dynamic thermal management techniques lies in their ability to reduce Pchip in (1) to the effective worst-case power dissipation rather than the theoretical worst-case [6]. The effective worst-case power consumption, as found by running power-hungry applications, is about 75% of the theoretical worst-case, which is determined using synthetic input code sequences that are not realized in practice [7,8]. This difference has major implications for packaging costs and design flexibility. Small increases in the maximum power can lead to significantly more exotic, expensive cooling techniques. For example, Intel engineers found that a rise in power consumption from 65 to 75 W would triple cooling costs due to the need for additional heat pipe technology to achieve the required θja [7]. With an effective 25% reduction in Pchip, the allowable θja is 33% higher, translating to less expensive heat sinking, quieter and smaller fans, and avoidance of refrigerated or liquid-cooled solutions. 2.2 Global Signaling Propagation of global signals across a large die in a shrinking clock period is one of the foremost challenges in nanometer design [1,9,10]. It appears likely that global signaling will use a slower clock than localized logic such as datapaths (despite the fact that multi-cycle nets can be broken up using latches). For example, a recent Intel microprocessor clocks the integer ALUs at a higher rate than other sections of the design. Even with relaxed timing constraints on global communication, substantial power is consumed to achieve the desired global clock speeds. [9] demonstrates that using unscaled top level wiring, ITRS projected global clock frequencies can be met. Based on the current signaling paradigm of inserting large CMOS buffers along an RC line, this requires over 50 W of power in the nanometer regime. The proliferation of repeaters (nearly 106 required at 50-nm compared to about 104 in a large 180nm microprocessor [11]) heightens difficulties in power distribution and floorplanning2. An alternative is to use advanced signaling strategies such as differential and/or low-swing drivers and receivers for global communication [12]. In many cases, these approaches can lead to power and delay savings due to smaller voltage transitions as well as major reductions in the magnitude of power grid current transients. For instance, the Alpha 21264 uses differential low-swing buses to communicate between functional units [8]. Worst-case power for these buses was reduced significantly by limiting the voltage swing to 10% of Vdd. Differential signaling increases routing area, but the increase may be less than the expected factor of 2 due to the use of shield wires in 2 Repeater clusters constrain repeater placement to ease floorplanning and simplify insertion of repeaters late in the design. Resulting power densities can exceed 100 W/cm2, complicating power distribution. global signaling to limit coupling from neighboring signals on long lines. Furthermore, shielding may be insufficient to limit inductively coupled noise, whereas low-swing differential signaling creates less noise and is more noise immune than single-ended full-swing CMOS [13]. While further study is necessary to determine worst-case noise behavior and tolerable voltage swings, the Alpha design demonstrates that the approach is already viable today. With trends indicating rising power consumption for global communication, the use of alternative signaling strategies will likely increase. 2.3 Library Optimization While most high performance microprocessors rely heavily on custom design, library optimization can still enhance performance in these applications. System complexity and the resulting design productivity needs mean that some components of nearly every IC design will draw from a cell library. Advances in library generation, and synthesis tools that take advantage of improved libraries, can together yield more automated, less expensive design flows. Recent work claims libraries are one important reason that custom designs are significantly faster (6-8X) than counterpart ASIC designs [14,15]. For instance, [15] asserts that the lowest performance level (smallest) gates in modern libraries are nearly 10X larger than minimum-sized gates, leading to major power increases due to overdriving small loads. However, most current libraries contain a large number of drive strengths, including some very near minimum size. As evidence, we cite the same 180 nm library as [15]: the smallest standard cell inverter has an input capacitance of just 1.5fF (smaller than the custom gate in [15]) and the smallest inverter with balanced rise/fall delays has an input capacitance of 6.6fF [16]. Other leading-edge libraries contain a rich set of drive strengths (e.g. 11 2-input NANDs, 16 inverter sizes), dual output polarities, and single pin inverted inputs on NAND/NOR’s. This recent increase in library complexity seems to be closing the gap slightly between custom designed cells and those from libraries. However, more work needs to be done: a recent study [17] demonstrates the potential of on-the-fly cell generation layered on top of a pre-existing rich library. Results show 15-22% power reductions with fixed timing, and one design achieved 13.5% speed gains and 18% power reduction. In these cases overnight optimizations created hundreds of new cells, adding flexibility to the original library and more closely approximating a custom design approach. The new cells serve to exactly match load conditions (limiting overdrive of small capacitances) and allow for imbalanced P/N sizing if advantageous. 2.4 Multiple-Vdd Multiple supply voltages on a chip will be one of the most valuable tools for designers to fight the rise of dynamic power in nanometer design. Only a few designs based on this concept, all with relatively low clock speeds, have been reported [18,19]. However, results are promising, and the slow acceptance in high-performance MPUs seems primarily due to a lack of urgency in dynamic power reduction. The general idea most often applied is that of clustered voltage scaling (CVS) [20]. With two Vdd levels (Vdd,h and Vdd,l), the circuit is partitioned so that non-critical gates run at Vdd,l and only critical gates use Vdd,h. Level conversions, performed when gates running at Vdd,l fan-out to gates at Vdd,h, are reduced by clustering Vdd,l and Vdd,h gates together to minimize the number of such interactions. Analysis indicates that Vdd,l should be around 0.6 to 0.7 times Vdd,h to maximize power savings. The dynamic power reduction by using two Vdd levels is readily calculated if one can estimate the fraction of cells that can be assigned to Vdd,l. Existing media processor designs that use CVS report that ~75% of all gates can tolerate Vdd,l without altering the critical path delay. Similarly, path slack distributions for highend MPUs show that over half of all timing paths commonly use less than half the clock cycle [21,22]. Using Vdd,l = 0.65 * Vdd,h, this yields a 45-50% dynamic power reduction, considering 8-10% additional level conversion power. In [18], area overhead due to constrained cell placement, level converters, and added power grid routing was found to be 15%. The impact of post-synthesis transistor re-sizing on multiVdd processes is discussed in Section 3.3. The key challenges to the use of multiple supplies on a chip lie in minimizing area overhead and providing EDA tool support for Vdd cell selection, placement given new clustering constraints, dual power grid routing, and enhanced library generation capabilities. In Section 3.3 we describe the major improvements that can be achieved in the delay vs. Vdd design space by use of multiple threshold voltages. With this new concept, the idea of using multiple supplies on a chip becomes much more powerful. Table 1. Recent NMOS device results, compared with ITRS projections. ITRS Tox (Å) Ioff Ion Ref node Vdd (µ µA/µ µm) (nA/µ µm) (electrical) (nm) [24] 50-70 18 0.85 514 100 [25] 100 21 1.2 860 10 [26] 70 25 1.2 697 10 [27] 100 27 1.2 800 10 [28] 70 32 1.2 650 3 [29] 100 13 - physical 1.0 723 16 3. STATIC POWER 3.1 ITRS Projections & Analysis The ITRS predicts an increase in MOSFET off current (Ioff) by a factor of 2 per generation. The author of [23] projects a 5X rise in Ioff/generation. Figure 1 shows the relative importance of static and dynamic power for an inverter driving a fan-out of 4 with an average interconnect load. 70 nm and 50 nm technologies are explored; results indicate that for logic with switching activities on the order of 0.01 to 0.1, static power can approach and exceed 10% of dynamic power. The ITRS calculates the expected increase in static power consumption due to Ioff and sets constraints to limit static power to 10% of the maximum power dissipation of the MPU. Hence at 35 nm, an MPU can draw 30A of current in standby. Even with this mild restriction on static power consumption, the reduction needed by circuit/architecture innovations reaches 98% at the end of the roadmap [1]. Unchecked, static power would reach kilowatt levels, dwarfing dynamic power. Circuit and architectural techniques have been proposed to reduce standby power. These approaches will become standard in low-power applications and experience with these designs will ease integration into high performance ICs. Some of these techniques are described in the following section. To give further perspective on Ioff scaling, we examined recent literature on advanced CMOS processes, noting the Ion, Ioff, Vdd, and Tox (oxide thickness) values. Results are summarized in Table 1. The key point of this table is that, while very good Ion/Ioff characteristics are achieved, there are no examples of sub-1 V technologies that come close to meeting ITRS expectations. For instance, the 70nm technologies described in [26,28] offer leakage currents below that projected by the ITRS with Ion values slightly lower than forecast. However, the Vdd value required to achieve this performance is 1.2 V – not 0.9 V as expected for 70nm. This Vdd increase gives a 78% rise in dynamic power. 1 Pstatic / Pdynamic 1 70nm, Vdd=0.9V 50nm, Vdd=0.7V 50nm, Vdd=0.6V 0.1 0.1 0.01 0.01 0.0 0.1 0.2 0.3 0.4 0.5 Switching Activity Factor Figure 1. The ratio of static power consumption to dynamic power for an inverter with fan-out of 4 and average wiring load. Temperature is 85°°C. ITRS 100 12-15 physical 1.2 750 13 ITRS 70 8-12 physical 0.9 750 40 ITRS 50 6-8 physical 0.6 750 80 While the current literature is not scalable to sub-1 V supplies, there will be improvements when these processes come online (2-5 years). Looking at historical references, reports of pre-production technologies tend to underestimate Ion by ~20% compared to actual performance several years later [30,31]. Unfortunately, most of the gains in Ion from R&D to production have been obtained from aggressive oxide scaling. This performance “lever” may be approaching the end of its usefulness; even with high-dielectric materials, maintaining current scaling trends for the effective oxide thickness faces a number of barriers in nanometer design. This point is further described using a set of compact MOSFET I-V expressions to project the scaling of Ion and Ioff in nanometer scale processes [32]. Ion is expressed as:  2I R I dsat 0 Rs I on = I dsat 0 1 + dsat 0 s −  V −V V V − dd th dd th + E sat Leff      (2) Rs is the parasitic source resistance (set according to [1]), Esat is the lateral electric field required to saturate the carrier velocity, and Leff is the effective gate length (final, as-etched dimension in [1]). Idsat0 is: I dsat 0 = Wµeff Coxe 2 Leff (Vdd − Vth )2 1 + (Vdd − Vth )/ Esat Leff (3) Here µeff is the effective mobility, which is a function of gate voltage and Tox. Coxe is the electrical oxide capacitance, described later. Off current (per unit width) is estimated as [33]:  −Vth  I off = 10 × 10 85mV  µA µm (4)   85 mV is the assumed subthreshold swing parameter throughout scaling (taken at room temperature to match [1])3. An analytical analysis of the ITRS on/off current projections is summarized in Table 2. The Vth for each technology is set to meet 750µA/µm for Ion. We make the following observations: 1. Including electrical oxide thickness is important and should be considered in the ITRS. Electrical oxide thickness reflects the finite inversion layer thickness (i.e. the inversion layer is not a sheet of charge located at the Si/SiO2 interface) and gate depletion effects (GDE) [32]. The net effect is that the oxide appears ~0.7 nm thicker than the physical oxide layer. Advanced gate materials may limit the contribution of GDE, however the quantization of the inversion layer will be unaffected. An analysis ignoring GDE but incorporating inversion layer thickness (denoted “metal gate” in Table 2) shows Ioff decreases by 78% at 35 nm. Enhanced current resulting from a thinner effective gate oxide allows a 55 mV increase in Vth, significantly reducing Ioff. 3 Technologies such as fully-depleted SOI may reduce this value considerably (i.e. by 20%), making lower thresholds feasible given fixed Ioff constraints. ITRS node (nm) ⇒ Coxe (normalized) Cox (physical) Vth required to meet Ion 180 130 100 70 50 35 1 1 1.23 1.32 1.45 1.67 1.68 2.08 2.46 4.17 0.3 0.29 0.22 0.14 Ioff (nA/µm) 3 4 26 210 Ioff (metal gate) 1 1.4 8.7 55 ITRS Ioff projections 7 10 16 40 2.13 3.13 0.04 (0.12) 3205 (432) 666 (100) 80 0.11 456 103 160 2. A 0.6 V supply voltage for 50 nm high-performance parts will make it difficult to achieve the desired Ion/Ioff targets. A Vdd of 0.7 V is more realistic (given that Vdd for 35 nm is projected as 0.6V), reducing off current by nearly 7X but increasing dynamic power by 36%. Extracted 50 nm device parameters support this – simulations demonstrate a marked increase in Ioff at Vdd=0.6 V to meet ITRS Ion (Ioff = 2.6 µA/µm at 0.6 V, 430 nA/µm for 0.7 V). 3. The projected Ioff from the models is 3 nA/µm for 180 nm, rising to 456 nA/µm for 35 nm. The increase of 152X is markedly higher than the ITRS value of 23X4. Furthermore, the leakage current at 35 nm here is 2.9X larger than ITRS projections. This translates to additional static power reduction required by circuit design techniques. In general, the 2X increase in Ioff/generation listed in [1] allows just a 25mV drop in Vth in each technology. Following this constraint, the models show a 16% loss in Ion by the end of the roadmap5. We note, however, that the 152X increase in Ioff across the roadmap is much less than predicted by [23] which anticipates a 3125X rise by 35nm. 3.2 Multiple-Vth Approaches Several approaches have been developed to reduce CMOS static power consumption. This section briefly highlights several of these techniques that use multiple thresholds on a single chip to limit Ioff. 3.2.1 MTCMOS and variants 3.2.2 Dual-Vth Recently, circuit designers gained access to multiple threshold voltages on a single IC to select between gates that use high or low 3.3 Scalable Dynamic/Static Power Approach The combination of multiple Vdd’s, multiple Vth’s, and intra-cell size and Vth assignments points to a highly flexible, scalable, costeffective design approach to dynamic and static power minimization. With two voltage supply values available, different Vth’s will allow designers or EDA tools to choose to emphasize speed, standby power, or dynamic power. Figure 3 demonstrates the potential of the multiVdd + multi-Vth approach. In 35nm technology, a reduction in Vdd from nominal (0.6V) to 0.2V incurs a severe delay penalty (normalized delay is 3.7X that at 0.6V). However, by reducing Vth in the gates using 0.2V supplies, the delay increase is less than 30% while dynamic power is 89% lower and static power is constant. These compelling results are the product of two powerful ideas: 1) MOSFET drive current when using sub-1V Vdd are very sensitive to Vth, so small reductions in Vth achieve major current gains. 2) Static power decays roughly quadratically with Vdd reductions (given a fixed Vth) due to shrinking Ioff and a smaller Vdd value. Figure 3 harnesses these two concepts by slowly reducing Vth as Vdd is dropped so that Ioff rises at the same rate Vdd is shrinking, keeping Pstatic constant. Figure 4 shows that the vast improvements in dynamic power and a constant Pstatic push the ratio of Pdynamic/Pstatic towards 1 for low switching activ30 100 25 20 15 10 10 Ioff increase to achieve 20% rise in Ion 5 Ion increase with 100mV Vth reduction Published data points 0 180 130 100 70 50 1 Ioff increase (normalized) Multi-Threshold CMOS (MTCMOS) gates a high-Vth transistor with a sleep mode signal to virtually eliminate leakage current in idle states [34]. The sleep transistor is placed between ground and fast low-Vth CMOS logic. As it is in series, it adds delay, which can be reduced by increasing its area. Disadvantages include no leakage reduction in active mode, increased device area, and additional overhead for routing sleep signals. Other related techniques include dual-Vth domino logic [35], substrate biasing to modify Vth in standby [36], and using negative NMOS gate voltages to bias the devices further into cut-off [37]. A singlethreshold leakage reduction technique combines the concepts of sleep transistors and state dependent leakage [38]. All these techniques trade off area to limit static power and most only reduce leakage in standby mode. In practice, they are currently limited to portable applications such as notebook processors. Also, some of the proposed methods do not scale well – the use of domino logic for example, and substrate bias controlled Vth (body bias is less effective at controlling Vth in scaled devices). Dual Vth insertion, described next, is the only technique used in current high-end MPUs. thresholds. The impact of Vth on the delay and power of gates such as inverters and NANDs is profound. As seen in (4), a reduction in Vth (with constant Vdd) exponentially increases off current and roughly linearly reduces propagation delay. An additional threshold adjust ion implantation step allows designers to choose from a wider range within the power-performance design envelope. Gates located on critical paths can be assigned fast low Vth, while gates that are not timing critical can tolerate high Vth and slower response times. Algorithms have been developed to optimally assign gates to either high or low threshold voltages [22,39]. Typical results show leakage power reductions of 40-80% with minimal penalty in critical path delay compared to all low-Vth implementations. It is instructive to examine the scaling properties of a dual-Vth approach to limiting Ioff. Based on (2)-(4), we consider two NMOS devices in the same technology with thresholds offset by 100 mV. The high-Vth device has its Vth set so that Ion is 750 µA/µm. Figure 2 shows the increase in Ion for the low-Vth device. The relative difference in Ioff between the two devices will remain constant throughout the roadmap (at about a 15X increase in Ioff for 100 mV reduction in Vth). Given that the off current change is constant, the steady improvement in Ion with scaling demonstrates that the dual-Vth (or multiVth) approach to leakage reduction is inherently scalable. Figure 2 also shows the resulting Ioff increase for Ion to rise 20% beyond the high-Vth case. At 35 nm, just a 7X rise in Ioff is required to yield 20% drive current improvement, compared with a factor of 54X today. Published data from [21,40] validate the models, as seen in Figure 2. Ion Increase (%) Table 2. Analytical model results for Ioff scaling. Values in ( ) for 50nm are results for Vdd=0.7V. 35 Technology Node (nm) 4 The slope of Ioff vs. technology is larger for the models as well, meaning a fast rise in leakage may be ahead. 5 This includes a reduction of 37% at 50 nm (19.6% if Vdd = 0.7 V). Figure 2. Ion increases more rapidly with a 100mV change in Vth for scaled technologies. Ioff penalty for 20% Ion gain reduces with scaling. ity gates at Vdd=0.2V6. If a constraint is set that Pdynamic must be 10X larger than Pstatic (as in the ITRS), a Vdd of about 0.44V is attainable, providing 46% dynamic power reduction. More options are available; Figure 3 shows that if threshold voltage is scaled less aggressively than required to maintain constant Pstatic, delay increases more quickly but remains reasonable at 1/3 the nominal Vdd value. In this scenario, the static power is being reduced linearly with Vdd so that Pstatic is 1/3 that of a gate using Vdd=0.6V. Now, consider post-synthesis transistor re-sizing, which reduces power by down sizing transistors off critical paths [21]. As a result, more paths approach criticality; this makes the application of multiVdd approaches less advantageous since fewer cells than assumed above (75%) can move to Vdd,l. This point highlights the sub-optimal nature of today’s low power design techniques. If, before transistor resizing, slack distributions demonstrate a large number of paths with significant slack, the current approach is to down size the corresponding cells, slowing down that path. This approach provides a sublinear reduction in power with respect to the size reduction (sublinear since interconnect capacitance will not scale down and represents a constant factor in the total capacitance). Instead of such re-sizing efforts, a lower supply voltage could be used, providing a quadratic drop in power. Leakage power will be significantly reduced in this case due to the Vdd reduction as well as Ioff which also decreases. The combination of multiple Vdd’s, multiple Vth’s, and transistor re-sizing needs to be harnessed in future EDA tools to achieve excellent power/performance results. Combining the above multi-Vdd + multi-Vth optimization strategy with the on-the-fly cell generation approach of Section 2.3, designers and Delay (Normalized) 4 4 Constant Vth (0.11V) Scaled V th, Constant Pstatic Conservatively Scaled Vth 3 3 35-nm, nominal Vdd = 0.6V 2 2 1 1 0.2 0.3 0.4 0.5 0.6 EDA tools can fully explore the design space of dynamic power, static power, and timing slack. One example of unique gate layouts that could help face the power challenges of nanometer design is the use of different Vth’s inside a cell. Particularly, the use of different threshold transistors in a stacked arrangement can give fairly substantial leakage savings with minimal delay penalties. Furthermore, the state dependence of leakage can be leveraged in cases with stacked multiVth’s without additional sleep transistors that sacrifice area and dynamic power. 4. POWER DISTRIBUTION Flip-chip and grid array packaging allows distribution of Vdd/GND and signals throughout a die, rather than just at the periphery. This increased flexibility makes power grid IR drops substantially more manageable, to meet 10% IR drop constraints, etc. However, in this section we show that current ITRS projections for power/grid pad connectivity in nanometer designs do not fully take advantage of grid array capabilities and lead to power distribution problems. Based on BACPAC models [41], we examine the scalability of typical power grid distribution in the face of quickly rising chip current supplies. Hot-spots are considered since uniform power density assumptions are overly optimistic. A hot-spot is defined to have a localized power density four times larger than a uniform power density approximation (given by Pchip / Achip)7. Figure 5 shows the required power rail width (normalized to minimum top-level metal width) to ensure <10% IR drop in “hot-spots” of a design in scaled technologies using the minimum allowable bump pitch. This figure focuses on top-level routing only, assuming that the remainder of the power grid is under the designers control whereas the top-level granularity is technology-limited8. 35 nm is less restricted than 50 nm due to a reduction in power density at 35 nm9. In general, while the trend seems alarming (roughly quadratic increase in power rail linewidth, normalized to minimum allowable linewidth), even 35 nm results are manageable, in that Vdd and GND rails that are 16X minimum width will consume less than 4% of top-level routing resources (based on 80 µm bump and power-grid pitch). The total routing resources consumed due to power routing is around 17-20% as a constant factor of 16% is used to reflect the need for large metal “landing pads” for the bumps. The continued reductions in bump pitch allow Vdd/GND to be supplied at finer granularities where it is most needed. 0.7 100 Linewidth (norm. to Wmin) Figure 3. Delay increase due to Vdd reduction can be effectively offset by reducing Vth. 100 Pdynamic / Pstatic Switching activity = 0.1 10 10 1 0.2 , 0.3 0.4 0.5 0.6 50 Minimum bump pitches 40 100 30 20 10 10 180 130 100 70 50 35 0 Technology Node (nm) 1 Figure 5. IR drop scaling trends based on minimum allowable bump pitch (open symbols) and ITRS bump/pad count projections (solid symbols). 0.7 Vdd (V) Figure 4. The ratio of dynamic to static power drops when using low Vth’s to reduce delay penalty in low Vdd gates. 35-nm technology is modeled. 6 1000 1 Constant Vth Scaled Vth, Pstatic constant Conservatively scaled Vth 60 Routing resources used Minimum Linewidth, <10% IR drop % Routing Resources Used V dd (V) Pdynamic is calculated using a fan-out of 4 and an average wiring load. Gates are inverters with Wn/L=4,Wp/L=8. 7 The factor of four stems from estimating that half the chip area is consumed by memory (having about 1/10th the power density of logic) and that certain logic areas may have twice the power density of others. 8 Meaning that the chip’s access to Vdd/GND is limited by how often connections can be made to the off-chip supplies. 9 Total power at 50 nm increases only slightly while the area jumps 15%. However, ITRS projections for microprocessor pad counts do not correspond to the minimum achievable bump pitch. For instance, a bump pitch of 80 µm is estimated to be attainable at 35 nm, but the number of bumps actually used is 4416, translating to an effective bump pitch of 356 µm. Since IR drop is strongly dependent on the periodicity of power connections, this large bump pitch results in a staggering increase in wiring resources needed to maintain adequate IR drops. Figure 5 also shows the required power rail widths under the ITRS assumptions of bump/pad count. At 35 nm, the required line width is over 2000X the minimum allowable; this is the result of a roughly constant bump pitch of around 350 µm throughout the roadmap. More Vdd and GND connections will be required and advances in technology should be leveraged rather than consuming extra routing resources. In addition, with just 1500 Vdd bumps at 35 nm, ITRS bump current capability projections are incompatible with the worstcase current draw of 300A in such a design. This also points to the need for more Vdd/GND connections at the chip-to-package level. Finally, rising supply currents and the use of sleep or standby modes to reduce power have potential consequences in power distribution. Awakening from standby results in large current transients, placing an extreme burden on the power distribution network to limit inductive noise. Using the minimum bump pitch will help here as well, providing a low inductance path to each gate on the chip. Alternate logic styles may minimize current transients and provide superior powerdelay characteristics. One option is MOS current mode logic (MCML), which burns static power but yields much smaller current transients while providing comparable performance and lower total power in high activity circuitry such as datapaths [42]. If a point is reached where static CMOS leakage currents are intractable, current steering logic families such as MCML may provide solutions. 5. CONCLUSIONS The main points of this paper are: 1. Power management techniques such as on-chip temperature monitors and multiple voltage supplies will reduce dynamic power, enabling cheaper packaging and higher integration densities. 2. Alternative techniques to CMOS repeaters for global signaling need to be investigated and mated with EDA tools (similar to buffer insertion tools today but using different primitive components) to minimize power consumed in global communications. 3. A multi-layered approach to power reduction (both dynamic and static) is described, combining multiple threshold and supply voltages with flexible gate layouts using different thresholds and device sizes within a gate. Non-critical gates are first assigned to a reduced Vdd, followed by sizing and Vth selection to reduce power most efficiently. 4. Power distribution will be manageable from the standpoint of IR drop – given changes in the ITRS to take advantage of technological advancements in flip-chip packaging. However, large current transients may be exacerbated by the use of sleep/standby modes. 6. ACKNOWLEDGMENTS The authors thank Kurt Keutzer, Andrew Kahng, and Dave Chinnery for valuable comments, Pin Su, Charles Kuo, and Min She for device models, Richard Hamilton for packaging discussions, and Philippe Hurat and Martin Lefebvre at Cadabra Design. 7. REFERENCES [1] [2] [3] [4] [5] [6] http://public.itrs.net, ITRS, 2000 update. Y. Cao, et al., “New paradigm of predictive MOSFET and interconnect modeling for early circuit simulation,” Proc. CICC, pp. 201-204, 2000. Personal communication, Pin Su & Charles Kuo. R. Viswanath, et al., “Thermal performance challenges from silicon to systems,” Intel Technology Journal, 3rd quarter, 2000. I. Aller, et al., “CMOS circuit technology for sub-ambient temperature operation,” Proc. ISSCC, pp. 214-215, 2000. D. Brooks and M. Martonosi, “Dynamic thermal management for highperformance microprocessors,” Proc. High-Performance Comp. Arch., 2001. [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] S.H. Gunther, et al., “Managing the impact of increasing microprocessor power consumption,” Intel Technology Journal, 1st quarter, 2001. M.K. Gowan, et al., “Power considerations in the design of the Alpha 21264 microprocessor,” Proc. DAC, pp. 726-731, 1998. D. Sylvester and K. Keutzer, “Getting to the bottom of deep submicron II: A global wiring paradigm,” Proc. ISPD, pp. 193-201, 1999. R. Ho, K. Mai, H. Kapadia, and M. Horowitz, “Interconnect scaling implications for CAD,” Proc. ICCAD, pp. 425-429, 1999. R. McInerney, et al, “Methodology for repeater insertion in the Itanium microprocessor,” Proc. ISPD, pp. 99-104, 2000. H. Zhang, et al., “Low-swing on-chip signaling techniques: effectiveness and robustness,” IEEE Trans. VLSI Systems, pp. 264-272, Jun. 2000. Y. Massoud, et al., “Differential signaling in crosstalk avoidance strategies for physical synthesis,” Proc. TAU, 2000. D.G. Chinnery and K. Keutzer, “Closing the gap between ASIC and custom: an ASIC perspective,” Proc. DAC, pp. 637-641, 2000. W.J. Dally and A. Chang, “The role of custom design in ASIC chips,” Proc. DAC, pp. 643-647, 2000. IBM SA-27E ASIC standard cell datasheet. P. Hurat, “Beyond physical synthesis,” SNUG Europe 2001. K. Usami, et al., “Automated low-power technique exploiting multiple supply voltages applied to a media processor,” IEEE J. Solid-State Circ., pp. 463-472, Mar. 1998. M. Takahashi, et al., “A 60-mW MPEG4 video codec using clustered voltage scaling with variable supply-voltage scheme,” IEEE J. Solid-State Circ., pp. 1772-1780, Nov. 1998. K. Usami and M. Horowitz, “Cluster voltage scaling technique for low power design,” Proc. ISLPED, pp. 3-8, 1995. C. Akrout, et al, “A 480MHz RISC microprocessor in a 0.12µm Leff CMOS technology with copper interconnects,” IEEE J. Solid-State Circ., pp. 16091616, Nov. 1998. S. Sirichotiyakul, et al., “Standby power minimization through simultaneous threshold voltage and circuit sizing,” Proc. DAC, pp. 436-441, 1999. S. Borkar, “Design challenges of technology scaling,” IEEE Micro, pp. 23-29, Jul-Aug 1999. R. Chau, et al., “30nm physical gate length CMOS transistors with 1.0ps NMOS and 1.7ps PMOS gate delays,” Proc. IEDM, pp. 45-48, 2000. S. Song, et al., “CMOS device scaling beyond 100nm,” Proc. IEDM, pp. 235238, 2000. H. Wakabayashi, et al., “45-nm gate length CMOS technology and beyond using steep halo,” Proc. IEDM, pp. 49-52, 2000. M.Mehrotra, et al., “A 1.2V, sub-0.09µm gate length CMOS technology,” Proc. IEDM, pp. 419-422, 1999. I.Y. Yang, et al., “Sub-60nm physical gate length SOI CMOS,” Proc. IEDM, pp. 431-434, 1999. A. Ono, et al., “A 70nm gate length CMOS technology with 1.0V operation,” VLSI Symp. Tech., pp. 14-15, 2000. M. Rodder, et al., “A scaled 1.8V, 0.18µm gate length CMOS technology: device design and reliability considerations,” Proc IEDM, pp. 415-418, 1995. L. Su, et al, “A high-performance 0.08µm CMOS,” VLSI Symp. Tech., pp. 1213, 1996. K. Chen and C. Hu, “Performance and Vdd scaling in deep submicrometer CMOS,” IEEE J. Solid-State Circ., pp. 1586-1589, Oct. 1998. C. Hu, "Device and technology impact on low power electronics," in Low Power Design Methodologies, ed. Jan Rabaey, Kluwer, pp. 21-35, 1996. S. Mutoh, et al., "1V Multi-Threshold CMOS DSP with an efficient power management technique for mobile phone application", Proc. ISSCC, pp. 168169, 1996. J.T. Kao and A.P. Chandrakasan, “Dual-threshold voltage techniques for lowpower digital circuits,” IEEE J. Solid-State Circ., pp. 1009-1018, Jul. 2000. T. Kuroda, et al., “A 0.9V, 150MHz, 10mW, 4mm2, 2-DCT core processor with variable VT scheme,” IEEE J. Solid-State Circ., pp. 1770-1778, Nov. 1996. H. Kawaguchi, et al., “A CMOS scheme for 0.5V supply voltage with picoampere standby current,” Proc. ISSCC, pp. 192-193, 1998. M.C. Johnson, et al., “Leakage control with efficient use of transistor stacks in single threshold CMOS,” Proc. DAC, pp. 442-445, 1999. L. Wei, et al., “Design and optimization of dual-threshold circuits for lowvoltage low-power applications,” IEEE T. VLSI Sys, pp. 16-24, Mar. 1999. S. Tyagi, et al., “A 130nm generation logic technology featuring 70nm transistors, dual-Vt transistors and 6 layers of Cu interconnects,” Proc. IEDM, pp. 567570, 2000. http://www-device.eecs.berkeley.edu/~dennis/BACPAC, see also: http://vlsicad.cs.ucla.edu/GSRC/GTX J.M. Musicer and J. Rabaey, “MOS current mode logic for low power, low noise CORDIC computation in mixed-signal environments,” Proc. ISLPED, pp. 102107, 2000.