Overcoming The Power Wall: Connecting Voltage Domains in Series
Overcoming The Power Wall: Connecting Voltage Domains in Series
Overcoming The Power Wall: Connecting Voltage Domains in Series
Domains in Series
Pradeep S. Shenoy, Sai Zhang, Rami A. Abdallah, Philip T. Krein, and Naresh R. Shanbhag
ECE Dept., University of Illinois at Urbana-Champaign
Urbana, IL
[email protected], [email protected]
AbstractThis paper analyzes an alternative for system-level
power configuration of digital circuits. To overcome the power
wall and enable low supply voltages for digital circuits, the
voltage domains of microprocessors or other digital circuit
elements are connected in series. This enhances overall system
efficiency and performance by allowing both the digital circuits
and the power delivery circuits to operate in their region of
highest efficiency. Power consumption is dramatically reduced
because multiple, independent voltage levels enable each
processor to operate at its local minimum energy point. A
comparative analysis of series and parallel voltage domains
concludes that series voltage domains consume less power.
Various voltage regulations schemes ranging from software,
firmware, and hardware are presented. Some of the challenges
and opportunities of series circuits are discussed.
Keywords- voltage regulation, series circuits, microprocessor
loads, multi-core, many-core, power delivery, switching converters
I. INTRODUCTION
The trend toward parallelism to increase computational
throughput using many-core microprocessors (i.e. tens or
hundreds of cores) motivates a renewed look at power delivery.
Even advanced power delivery designs that allow inactive
cores to independently reduce supply voltage or even shut
down [1], [2] may not be effective for system-level energy
reduction. Some reduction in power consumption may be
obtained by placing fast, on-chip switching converters in the
power path [3], [4]. However, chip-level converters tend to be
less efficient than discrete designs. They also add another
power conversion stage. While these approaches have some
merits, it is not clear whether the overall system efficiency
improves. On-chip conversion places a higher thermal load on
the processor package. A nonlinear breakthrough in power
delivery has been sought [5].
With increasing demand for high performance and low cost
in modern computing systems, processes continue to scale
down for faster devices and smaller area. At the same time,
supply voltage scales down. According to International
Technology Roadmap for Semiconductors (ITRS), supply
voltage will fall below 600 mV by 2024 [6], as shown in Fig. 1.
However, the power consumption of high performance
computing systems tends to remain the same, limited by heat
dissipation from a given package area. Heat removal is not the
only challenge associated with the so-called power wall [7].
Low-voltage power is challenging and costly to supply. In
traditional computing systems, digital loads in parallel share a
common supply voltage. In multicore microprocessors (which
may draw 100 W or more) with supplies below 1 V, the high
current places a heavy burden on dc-dc converters in the
system [8]. As supply voltages decrease, the effective load
impedance decreases by a factor of 1/V
dd
2
, complicating power
delivery. The bus capacitance required to keep supply voltage
dynamic variation within bounds also increases as V
dd
drops.
Connecting the voltage domains of digital circuits in series,
shown in Fig. 2, has the potential to be the needed
breakthrough. The underlying concept was proposed in [9],
[10], in which a few digital loads are connected in series and
the total supply voltage is increased. In this paper, the basic
concept is extended to large numbers of series-connected
digital gates, cores, chips, or even boards. The load current
provided by the power supply drops in proportion to the
number of series blocks, which typically increases the
efficiency of power delivery. Here, we also seek to support
multiple voltage domains within a series structure, such that
each digital device or core can operate at its highest efficiency.
Figure 1. Supply voltage scaling trends projected by ITRS [1].
Figure 2. Voltage domains of digital circuits connected in series.
The idea of using multiple voltage domains that enable
functional blocks to operate at independent voltage levels to
drop overall power consumption has been attempted in parallel
configurations [11]. Noise performance is known to be superior
with stacked load configurations [12]. Here, the technique is
not physical stacking (i.e. 3D stacking) of integrated circuits,
but rather electrical series connection of multiple voltage
domains. Of course, this approach does not exclude 3D
stacking [13]. It also applies to non-computational circuits such
as memory [14].
This research considers a systems-oriented viewpoint, in
which a series connection is extended to large numbers of
modules and can be considered across the range from multiple
function modules within a single IC up to interconnected
boards in a data center. This paper first examines the minimum
energy operating point of a CMOS circuit. The series
connected voltage domain approach is then presented,
analyzed, and compared to the conventional parallel load case.
A hierarchy of voltage regulation schemes for the intermediate
nodes is also discussed.
II. CORE ENERGY OPTIMAL OPERATING POINT
Reducing energy in ultra low-power (ULP) systems such as
medical portable processors and implants, distributed sensor
networks, and active RFIDs is vital for extended battery and
design lifetime. Aggressive voltage scaling, where supply
voltage (V
dd
) is lowered to trade performance and clock
frequency for energy, is a typical approach in ULP
applications. The intent is to take advantage of modest
performance requirements that can be met with clocks ranging
from a few hundreds of Hz to a few kHz. Reducing V
dd
results
in quadratic reduction in dynamic energy at the expense of
increased delay. However, as V
dd
drops below the device
threshold voltage (V
t
) (subthreshold region), leakage energy
starts to increase due to the exponential increase in delay, and
becomes comparable to dynamic energy. This trade-off
between leakage and dynamic energy results in a minimum
total energy point at an optimal V
dd
< V
t
. Several studies have
addressed this trade-off [15]-[17], and proposed circuit-level
design techniques [18], [19] to reduce leakage and lower the
optimal energy point. Recently, various subthreshold ASICs
[20] and general-purpose processors [16] have emerged for
ULP applications.
A. Minimum Energy Operation
The dominant forms of energy consumed per clock cycle in
subthreshold designs are dynamic energy, E
dyn
, and leakage
energy, E
lkg
. Dynamic energy is caused by parasitic capacitance
charging and discharging with a switching current, I
ON
, and
leakage energy is due to leakage current, I
OFF
. Given a
processing element with N processing nodes, each with
capacitance C, switching activity factor , operating frequency
f, and supply voltage V
dd
, the total energy per clock cycle, E
o
is
expressed as
o dyn lkg
E E E = +
2
dyn dd
E NCV o =
,
dd
OFF V dd
lkg
NI V
E
f
= (1)
The subthreshold current as a function of gate-to-source and
drain-to-source voltage is given by
( ) , 10 (1 )
DS
GS th DS
T
V
V V V
V
S
SUB GS DS o
I V V I e
= (2)
where I
o
is the reference current and is proportional to the
transistor W/L ratio, S is the swing factor, is the DIBL
coefficient, V
th
is the threshold voltage and V
T
is the thermal
voltage (26 mV at room temperature). Using (2), the switching
and leakage current for an NMOS transistor are
( )
,
,
dd
ON V SUB dd dd
I I V V = and
( )
,
0,
dd
OFF V SUB dd
I I V = respectively.
Assuming the processing element critical path consists of L
processing nodes each with capacitance C, the operating
frequency f is given by
,
dd
ON V
dd
I
f
LCV |
= (3)
where is a fitting parameter. Due to the exponential
dependence of I
ON
on V
dd
in (2), the subthreshold frequency
decreases exponentially with V
dd
reduction leading to an
exponential increase in leakage energy. This now can be seen
by substituting (3) in (1) to get
,
2 2
,
10
dd
dd
dd
V
OFF V
S
lkg dd dd
ON V
I
E NLCV NLCV
I
| |
= = (4)
and the total subthreshold energy is given by
2
10
dd
V
S
o dd
E NCV L o |
| |
= +
|
|
\ .
(5)
Therefore, reducing V
dd
into the subthreshold region decreases
E
dyn
but increases E
lkg
exponentially such that an optimum
subthreshold operating voltage exists.
B. Subthreshold Multiply-Accumulate (MAC) Unit
The subthreshold energy behavior can be illustrated through
a 23-bit output multiple-accumulate (MAC) unit which is
widely used in various applications. The MAC unit computes
1 2
[ ] [ 1] [ ] [ ] y n y n x n x n = + , where x
1
and x
2
are 10-bit input
operands and n is the clock-cycle/time index. We use a ripple-
carry based architecture with 1-bit full adder as a building
block/processing node to study the energy profile per MAC
operation (shown in Fig. 3). Fig. 4 shows the MAC unit energy
profile based on the analytic model in (5) and SPICE
simulations in 130-nm commercial CMOS process at different
switching activity factors . The analytical model behaves very
closely to SPICE simulations. As voltage is reduced, E
lkg
Figure 3. Ripple-carry based architecture of the 23-bit output MAC
unit.
increases while E
dyn
decreases and the minimum energy point is
reached at V
dd,opt
= 0.33V for = 0.3. Lowering not only
decreases total energy due to E
dyn
reduction but also pushes
optimum V
dd
to higher values since E
dyn
contribution to total
system energy E
o
when compared to E
lkg
is decreased.
III. POWER CONSUMPTION ANALYSIS
While linear regulators offer tight voltage control and can
be designed for fast transient response, the energy loss penalty
is too large to consider in most designs, particularly for
subthreshold applications. There are two types of dc-dc
converters generally used in digital circuit supplies. Switched
capacitor (SC) converters use capacitors and a switch network
to transfer charge and convert voltage levels. They can perform
high conversion ratios and are relatively easy to integrate on
chip. Non-isolated buck dc-dc converters are more common in
low-voltage applications but require an inductor which is
difficult to integrate. Conduction loss and switching loss
reduce efficiency of any switching converter, especially when
the current level is high.
A. Buck Converter Loss Model
Modern high performance circuits usually use buck
converters to deliver high power at low output voltage. A
simple model is shown in Fig. 5. There are two main loss
components associated with buck converters [21]. Conduction
loss can be modeled as
2
2
1 2
( )( (1 ) )
12
load
cond load L
I
P I r Dr D r
A
= + + + (6)
where I
load
, I
load
, D, represent the load current, the load current
ripple, and the high-side switch duty ratio. Resistances r
1
and
r
2
represent the on-state resistance of the high and low side
switches, respectively. Switching loss can be modeled as
2
1 1
( )
2 2
sw load in on off sw OSS in sw
P I V t t f C V f = + + (7)
where V
in
, t
on
, t
off
, f
sw
, and C
OSS
are the input voltage, the switch
turn-on time, the switch turn-off time, the switching frequency,
and the effective output capacitance of the MOSFET. Both
power loss contributions increase as load current increases. A
parallel connected load will have high load current at low
output voltage and might not be the best configuration.
B. Power Model for Parallel Connected Loads
A parallel connected load is shown in Fig. 6 (a). Here N
cores are running at different speeds, therefore the optimum
voltages for each core differ and are denoted by V
dd,i
. In a
parallel configuration, all cores must have the same voltage, set
to the maximum value of all optimum operating voltages:
,
i=1...n
( )
max
dd dd i
V V = (8)
The load current flowing in core i can be expressed as
, , , load i i load i dd sw i
I C V f o = (9)
The power consumption of N cores computing in parallel is
2
, , , ,
1 1
( )
N N
out parallel dd load i i load i dd sw i
i i
P V I C V f o
= =
= =
(10)
C. Power Model for Series Connected Loads
Parallel connections require every core to have the same
V
dd
, which is not ideal for cores running at slow frequencies.
Additional cascaded converters can be used to enable per-core
V
dd
, but the area and power overhead can reduce the energy
savings. As shown in Fig. 6 (b), one way to enable per-core V
dd
and reduce the total output current is to connect cores in series
so that each core can have its own V
dd
, and computational load
is balanced so that all cores consume the same current. In this
case, the load current can be expressed as
1 ,1 ,1 ,1 , , ,
...
load load dd sw N load N dd N sw N
I C V f C V f o o = = = (11)
Each core can be configured to its optimum supply voltage
V
dd,i
, so the total power consumption of N cores in series is
Figure 5. Model of a buck converter showing on-state resistance of the
switches and inductor series resistance.
Figure 6. (a) Parallel connected load configuration (b) Series connected
load configuration
Figure 4. Subthreshold energy of a 23-bit output MAC at =0.3, 0.1.
2
, , , , , ,
1 1
N N
out series load i dd i i load i dd i sw i
i i
P I V C V f o
= =
= =
(12)
D. Comparison
It is quite easy to show that the series configuration
consumes less power than the parallel connected loads due to
independent voltage domains. Since the V
dd,i
in the parallel
configuration is the maximum value of all V
dd,i
of each core, we
can simplify and compare (10) and (12) to see that
, , out series out parallel
P P < (13)
Moreover, in the series configuration, the load current I
load
supplied by the power delivery circuits is roughly N times less
than that in parallel configuration. Since the power loss in the
dc-dc converter is
loss cond sw
P P P = + (14)
which is proportional to load current, notice that
, , loss series loss parallel
P P < (15)
Hence, the power consumed in the digital circuit load and the
power loss in the power supply is lower in the series connected
system.
IV. REGULATION OF SERIES CONNECTED VOLTAGE
DOMAINS
A key challenge in series voltage domains is regulating the
intermediate node voltages. Voltage domains should be
regulated to avoid computational errors and limit stress on the
transistors. The digital circuits in each voltage domain may not
naturally draw the same current at the desired operating
voltage. Series circuits, however, must conduct the same
current through each element. This causes the voltage at each
domain to deviate from the desired level. Just like a simple
voltage divider circuit, the steady-state voltage of each domain
is determined by the effective impedance of the load elements
at that domain.
The tradeoff is clear when a potential design is considered.
For example, a 64-core processor designed to work at 30 W
and 0.3 V requires power at 0.3 V and 100 A when connected
in parallel (shown in Fig. 7 (a)). In series, it requires 19.2 V
and only 1.6 A (shown in Fig. 7 (b)). This is an increase in the
effective load impedance from 3 m to 12.3 (about 4000
times larger). The latter power supply will be much more
efficient, with better transient response and lower node-by-
node voltage ripple. However, the higher operating voltage
introduces 63 intermediate voltage nodes that need regulation.
There are several techniques that can be used to actively
regulate the voltage domains. Broadly speaking, the goal is to
manage the effective impedance of each load in software first
and only utilize hardware mechanisms as a backup. This
proactive approach reduces loss in the power delivery circuits.
A hierarchical list of techniques (shown pictorially in Fig. 8) is
as follows:
1. Compiler system/scheduler
2. Run-time system/scheduler
3. Distributed hardware schedule implementer
4. Clock rates
5. Communications (local and external)
6. Task swapping
7. External exchange (local power electronics)
8. Tolerance
9. Global supply
10. Protection
The first line of defense in voltage regulation occurs at the
software level. If the desired voltage of each voltage domain is
equal, the operating system would manage software threads at
compile time or run time to evenly balance the computational
load of each element. The operating system could also dispatch
activity unevenly if unequal voltage levels are desired.
Another regulation layer is at the firmware level. A task
manager could distribute operations among computational
elements without the knowledge of the operating system. Task
swapping and task rotation are possible techniques. Clock rates
can be adjusted to throttle the activity at each voltage domain
and control the effective load impedance. It is important to
keep in mind that, even though Fig. 8 abstractly displays one
circuit at each voltage level, a voltage domain could consist of
several elements in parallel. The activity of all the elements
could be managed to control the effective load impedance.
If the preceding methods are unable to maintain voltages at
the specified targets, hardware based regulation approaches can
be used. This includes using local dc-dc converters to actively
regulate the voltage [22]. These converters could be integrated
on-chip or be placed off chip. Because these circuits process
the power delivered to the digital load and incur power loss, it
is preferable to avoid voltage regulation by power delivery
Figure 8. Various methods of load balancing and voltage regulation
using software, firmware, and hardware.
(a) (b)
Figure 7. Comparison of a 64-core, 30 W processor with cores connected
in (a) parallel and in (b) series. The effective impedance of series cores is
about 4000 times larger than parallel cores.
circuits when possible. It may be unavoidable in some
situations though. A unique advantage of series connected
loads is that the dc-dc converters only need to process a
differential fraction of the load power. Instead of processing the
entire load current (as would be needed in standard approaches
with parallel loads) the voltage regulators only need to provide
the difference in current between voltage domains. The effect
on voltage overshoot/undershoot caused by load transients will
also be less severe since load current step magnitude will be
substantially reduced (e.g. by a factor of 1/n where n is the
number of voltage domains). In extreme situations, Zener
diodes can protect the digital circuits from overvoltage
conditions. The experimental prototype shown in Fig. 9 has
been used to verify some of the regulation schemes described
above.
V. FUTURE WORK
While series connected voltage domains have considerable
promise, level shifting and inter-domain communications
complicate implementation. If viable solutions can be found,
this paradigm becomes attractive. That said, level conversion is
standard in modern SOCs, and microprocessors exist which
already employ multiple voltage domains. The inter-core
communication bandwidth in a series connected system should
be close to that in a typical multicore chip. Viable
communication techniques include, but are not limited to,
capacitively-driven wires [23], optical interconnects [24], and
even wireless transmission [25].
What remains to be seen is if additional communication
would be needed. This may motivate rethinking computer
architecture to reduce communication or to limit it so that
latency and power overhead have minimal impact. Similarly,
this paradigm may lend itself to applications where there is
little to no interdependence in the data and computations can be
made in parallel without significant interaction. The goal may
be to develop an initial mapping that is done in such a way as
to balance the workloads across the processors and limit
communication. Interestingly enough, it may turn out that
communication power overhead decreases since a dominant
factor is the signal swing (V
max
- V
min
) on the line. If lower
voltages are enabled through series connections, the signal
swing would be lower thereby reducing power overhead.
This position paper aims to lay down a vision for series
connected voltage domains and is not exhaustive. Certain
aspects of this vision are elaborated upon, and others are
enumerated and may require further study. Other challenges
include efficient and effective regulation of intermediate node
voltages, coordination between digital circuits in different
voltage domains, latency, and packaging. The potential gains
are substantial and worth the effort to work towards a complete
solution.
VI. CONCLUSION
A paradigm shift in system-level design of digital circuits is
driven by power concerns. Currently available options will not
be sufficient to overcome the power wall. Series connection
of digital circuit voltage domains has great potential to be the
nonlinear breakthrough required. This approach reduces power
consumption in digital loads by allowing each element to
operate at lower voltages or even its minimum energy
operating point. Power delivery complexity and loss is reduced
substantially as conducted current decreases. In turn, the
number of pins dedicated to power is reduced. The overall
system is able to operate efficiently over a variety of
conditions.
A key to enabling this approach is computational load
management and regulation of the intermediate node voltages.
A variety of techniques ranging from software based to power
electronics are feasible. A general benefit of a series connection
it that it facilitates substantial voltage reduction. This technique
is especially suited for many-core processors that are able to
run at very low voltage (e.g. 0.3 V).
REFERENCES
[1] D.E. Lackey et al., Managing power and performance for system-on-
chip designs using Voltage Islands, in Proc. ACM/IEEE Int. Conf.
Computer Aided Design, 2002, pp. 195- 202.
[2] J. Hu, Y. Shin, N. Dhanwada, and R. Marculescu, Architecting voltage
islands in core-based system-on-a-chip designs, in Proc. IEEE Int.
Symp. Low Power Electron. Design, 2004, pp. 180-185.
[3] H.P. Le, M. Seeman, S.R. Sanders, V. Sathe, S. Naffziger, and E. Alon,
A 32nm fully integrated reconfigurable switched-capacitor DC-DC
converter delivering 0.55W/mm2 at 81% efficiency, in Proc. IEEE Int.
Solid-State Circuits Conf., 2010, pp. 210-211.
[4] W. Kim, D.M. Brooks, and G.Y. Wei, A fully-integrated 3-level
DC/DC converter for nanosecond-scale DVS with fast shunt regulation,
in Proc. IEEE Int. Solid-State Circuits Conf., 2011, pp. 268-270.
[5] S.L. Smith, Power Optimizatio in the Connected World, presented at
the IEEE Int. Conf. Energy Aware Computing, Cairo, Egypt, Dec. 2010.
[6] 2009 ITRS Overall Technology Roadmap Characters. (Key Roadmap
Drivers), available, http://www.itrs.net/Links/2009ITRS/Home2009.htm
[7] T. Kuroda, Low-power, high-speed CMOS VLSI design, in Proc.
IEEE Int. Conf. Computer Design: VLSI Computers Processors
(ICCD02), 2002, pp. 310-315.
[8] Intel Corp., Voltage regulator module (VRM) and enterprise voltage
regulator-down (EVRD) 11.1 design guidelines, available,
www.intel.com/assets/PDF/designguide/321736.pdf, Sept. 2009.
[9] S. Rajapandian, X. Zheng, and K.L. Shepard, Implicit DC-DC
downconversion through charge-recycling, IEEE J. Solid-State
Circuits, vol. 40, no. 4, pp. 846- 852, Apr. 2005.
[10] S. Rajapandian, K.L. Shepard, P. Hazucha, and T. Karnik, High-voltage
power delivery through charge recycling, IEEE J. Solid-State Circuits,
vol. 41, no. 6, pp. 1400-1410, June 2006.
Figure 9. Experimental platform with microcontrollers, communication
I/Os, and voltage regulation circuits.
[11] M. Nakai et al., Dynamic voltage and frequency management for a
low-power embedded microprocessor, IEEE J. Solid-State Circuits,
vol. 40, no. 1, pp. 28-35, Jan. 2005.
[12] G. Jie and C.H. Kim, Multi-story power delivery for supply noise
reduction and low voltage operation, in Proc. IEEE Int. Symp. Low
Power Electron. Design, 2005, pp. 192-197.
[13] P. Jain, T.H. Kim, J. Keane, and C.H. Kim, A multi-story power
delivery technique for 3D integrated circuits, in Proc. Int. Symp. Low
Power Electron. Design, 2008, pp. 57-62.
[14] A.C. Cabe, Z. Qi, and M.R. Stan, Stacking SRAM banks for ultra low
power standby mode operation, in Proc. ACM/IEEE Design
Automation Conf., 2010, pp. 699-704.
[15] B. Zhai, S. Hanson, D. Blauw, and D. Sylvester, Analysis and
mitigation of variability in subthreshold design, in Proc. Int. Symp.
Low-Power Elec. Des., 2005, pp. 20-25.
[16] B. Zhai et al., Energy efficient subthreshold processor design, IEEE
Trans. VLSI Sys., vol. 17, no. 8, Aug. 2009, pp. 1127-1137.
[17] L. Nazhandali et al., Energy optimization of subthreshold-voltage
sensor network processors, in Proc. ACM ISCA, 2005.
[18] J. Kwong and A. Chandrakasan, Variation-driven device sizing for
minimum energy sub-threshold circuits, in Proc. Int. Symp. Low Power
Electron. Des., 2006, pp. 813.
[19] S. Mutoh et al., 1-V power supply high-speed digital circuit technology
with multi-threshold voltage CMOS, IEEE J. of Solid-State Circ., vol.
30, pp. 847 - 853, Aug. 1995.
[20] A. Wang and A. Chandrakasan, A 180-mV subthreshold FFT processor
using a minimum energy design methodology, IEEE J. Solid-State
Circ., vol. 40, no. 1, pp. 310-319, Jan. 2005.
[21] P.S. Shenoy, V.T. Buyukdegirmenci, A.M. Bazzi, and P.T. Krein,
System Level Trade-Offs of Microprocessor Supply Voltage
Reduction, in Proc. IEEE Int. Conf. Energy Aware Comp., 2010.
[22] P.S. Shenoy, I. Fedorov, T. Neyens, and P.T. Krein, Power Delivery for
Series Connected Voltage Domains in Digital Circuits, in Proc. IEEE
Int. Conf. Energy Aware Comp., 2011.
[23] R. Ho et al., High-Speed and Low-Energy Capacitively-Driven On-
Chip Wires, in Proc. IEEE Int. Solid-State Circuits Conf., 2007,
pp.412-612.
[24] M. Haurylau et al., On-Chip Optical Interconnect Roadmap:
Challenges and Critical Directions, IEEE J. Sel. Topics Quantum
Electron., vol. 12, no. 6, pp. 1699-1705, Nov.-Dec. 2006.
[25] C. Hu, R. Khanna, J. Nejedlo, K. Hu, H. Liu, and P.Y. Chiang, A 90
nm-CMOS, 500 Mbps, 35 GHz Fully-Integrated IR-UWB Transceiver
With Multipath Equalization Using Pulse Injection-Locking for Receiver
Phase Synchronization, IEEE J. Solid-State Circuits, vol. 46, no. 5, pp.
1076-1088, May 2011.