Power and Cooling Capacit y Management For Data Centers: White Paper #150
Power and Cooling Capacit y Management For Data Centers: White Paper #150
Power and Cooling Capacit y Management For Data Centers: White Paper #150
Capacity Management
for Data Centers
By Neil Rasmussen
problems with power and cooling infrastructure including overheating, overloads, and loss
of redundancy. The ability to measure and predict power and cooling capability at the rack
enclosure level is required to ensure predictable performance and optimize use of the
physical infrastructure resource. This paper describes the principles for achieving power
©2007 American Power Conversion. All rights reserved. No part of this publication may be used, reproduced, photocopied, transmitted, or 2
stored in any retrieval system of any nature, without the written permission of the copyright owner. www.apc.com WP #150 Rev 0
Introduction
According to Gartner Inc., most data center operators are unaware of the loading and current power and
cooling capability of their data centers, even at a total bulk level. Installing equipment that exceeds the
design density of the data center, and the resultant stresses on the power and cooling systems, are causing
downtime from overloads, overheating, and loss of redundancy.
CAPACITY MANAGEMENT
The IT Infrastructure Library (ITIL) defines capacity management ITIL Definition
as the discipline that ensures infrastructure is provided at the right
time in the right volume at the right price, and that it is used in the Providing infrastructure...
most efficient manner. The critical success factors are:
At the RI
RIGHT TIME
In the RI
RIGHT AMOUNT
At the RIGHT PRICE
• Providing accurate capacity forecasts AND Æ USED EFFICIENTLY
• Providing appropriate capacity to meet business needs
This involves input from many areas of the business to identify what IT systems are (or will be) required,
what power and cooling infrastructure is required to support these IT loads, what level of contingency will be
needed, and what the cost of this infrastructure will be.
This paper applies the ITIL view of capacity management specifically to the problem of power, cooling, and
space capacity of data centers. A model is described for the quantification of power and cooling supply,
demand, and the different types of capacity that must be managed. This model can serve as a framework
for describing a capacity management system, or for establishing service level management guidelines.
Availability IT Service
Problem Community
Management Management
Management
Change IT Financial
Management Capacity Management
Management
Release
Management
ITIL Service Delivery processes
Focused on business owners
ITIL Service Support processes Configuration
Management
Focused on end users
©2007 American Power Conversion. All rights reserved. No part of this publication may be used, reproduced, photocopied, transmitted, or 3
stored in any retrieval system of any nature, without the written permission of the copyright owner. www.apc.com WP #150 Rev 0
Background
The ability to establish the power and cooling capability at a specific rack is extremely rare. Data center
operators typically do not have the information they need to effectively deploy new equipment at the rate
required by the business, and are unable to answer simple questions such as:
• Where in my data center should I deploy the next server so I don’t impact the availability of existing
equipment?
• From a power and cooling availability standpoint, where is the best location to deploy the proposed
IT equipment?
• Will I be able to install new equipment without negatively impacting my safety margins such as
redundancy and backup runtime?
• Will I still have power or cooling redundancy under fault or maintenance conditions?
• Can I deploy new hardware technology, such as blade servers, using my existing power and cooling
infrastructure?
The inability to answer these simple questions is common but unacceptable. For data centers which are
grossly over-designed or under-utilized, the safety margins can allow successful operation with only a
primitive understanding of overall system performance. The compromise in availability due to this lack of
knowledge may result in a small but tolerable amount of downtime. While not the most economically efficient,
in the short term oversizing provides a safety margin until such a time as the available capacity equals
capacity utilized. However, three factors are currently placing stresses on data centers which are in turn
exposing the inadequacies of current operating methods:
Each of these factors leads to pressure to operate data centers in a more predictable manner.
High-density IT equipment
IT equipment drawing more than 10 kW per rack enclosure can Data center stress #1
High-density IT equipment
be considered high density. Fully populated racks of servers can
draw from 6 kW to 35 kW per rack. Yet the vast majority of data Overloads
Overheating
centers today are designed for a power density of less than 2 kW Loss of redundancy
per rack. As mentioned earlier, more and more users are
©2007 American Power Conversion. All rights reserved. No part of this publication may be used, reproduced, photocopied, transmitted, or 4
stored in any retrieval system of any nature, without the written permission of the copyright owner. www.apc.com WP #150 Rev 0
installing equipment that exceeds the design density of their data centers and the resultant stresses on the
power and cooling systems can cause downtime from overloads, overheating, and loss of redundancy. Data
center operators need better information regarding how and where to reliably deploy this equipment in both
existing and new data centers.
While having power and cooling supply and demand information at the room or facility level helps, it does not
provide sufficiently detailed information to answer the questions about specific IT equipment deployments.
On the other hand, providing power and cooling supply and demand information at the IT device level is
unnecessarily detailed and intractable. An effective and practical level at which to measure and budget
power and cooling capacity is at the rack level, and this paper utilizes that approach (Figure 2).
©2007 American Power Conversion. All rights reserved. No part of this publication may be used, reproduced, photocopied, transmitted, or 5
stored in any retrieval system of any nature, without the written permission of the copyright owner. www.apc.com WP #150 Rev 0
Figure 2 – Level of control for capacity management
Rack
Rack
Rack
Rack
Rack
Rack
The model described in this paper quantifies power and cooling supply and demand at the rack level in four
important ways:
This information allows a complete description of the current status of a data center power and cooling at the
rack level.
The maximum power and cooling demand is always greater than or equal to the actual power and cooling
demand and is critical information for capacity management.
©2007 American Power Conversion. All rights reserved. No part of this publication may be used, reproduced, photocopied, transmitted, or 6
stored in any retrieval system of any nature, without the written permission of the copyright owner. www.apc.com WP #150 Rev 0
be measured by the power distribution system or it can be measured by the IT equipment itself, and the
reported power consumed by the set of IT devices within a rack can be summed to obtain the rack power.
It is an important function of a capacity management system to recognize when the current actual supply is
not the same as the design value, and to diagnose the source of the constraints of the system that are
preventing realization of the design supply capacity.
The actual power supply at a given rack is determined by knowing the available branch circuit capacity to the
rack, constrained by the availability of unutilized power of upstream sources such as PDUs and UPS. In
some cases, the available capacity is further constrained by the design or configuration of the power system.
For example, a modular system might not be fully populated or the design may call for dual power feeds.
Determining the actual cooling supply at a rack is typically more complex than determining the power supply,
and is highly dependent on the air distribution architecture. Unlike the power architecture, where the flow of
power is constrained by wires, airflow is typically delivered to an approximate group of racks, where it
spreads among the racks based on the draw of the fans in the IT equipment. This makes the computation of
available air capacity more complex and sophisticated computer models are required. In cases where the
supply or return air are directly ducted to racks, the cooling supply at a rack is better defined and therefore
can be computed with improved accuracy.
©2007 American Power Conversion. All rights reserved. No part of this publication may be used, reproduced, photocopied, transmitted, or 7
stored in any retrieval system of any nature, without the written permission of the copyright owner. www.apc.com WP #150 Rev 0
Figure 3 – Quantifying demand and supply at the rack level
• Reduce TCO
Use CAPACITY MANAGEMENT data to
• Increase efficiency
OPTIMIZE
On the SUPPLY side, reduce the gap between actual and design max – i.e., get the best
to-the-rack delivery from installed power/cooling infrastructure
RIGHT-SIZE
Reduce the gap between design max SUPPLY and design max DEMAND – i.e., match
power/cooling to load, to increase efficiency and reduce waste
System-level Capacities
The demand on power and cooling is established at the rack. The supply, as described in the previous
section, must also be understood and quantified at the rack. However, the power and cooling supply system
is not established rack-by-rack but is hierarchical, with supply devices such as UPSs, PDUs, and air
conditioners supplying groups of racks. Bulk supply devices such as the power service entrance and cooling
towers also represent sources of capacity supply that must be sufficient for the demand. Therefore, in
addition to quantifying power and cooling supply capacity at the rack, it must also be quantified at the
aggregate levels aligned with the supply devices.
©2007 American Power Conversion. All rights reserved. No part of this publication may be used, reproduced, photocopied, transmitted, or 8
stored in any retrieval system of any nature, without the written permission of the copyright owner. www.apc.com WP #150 Rev 0
Figure 4 – Source of demand vs source of supply
Cooling
tower
Service
entrance
Bulk infrastructure
supplies the ROOM
Chiller
plant
is provided by a system-wide
hierarchy of infrastructure
originates at the RACK
Supply must always be greater than or equal to demand to prevent the data center from experiencing a
failure. This must be true at each rack, and it must also be true for each supply device supplying groups of
racks. Therefore, at any given time, there is always excess capacity. Excess capacity comes in four different
forms for purposes of capacity management, which are:
• Spare capacity
• Idle capacity
• Safety margin capacity
• Stranded capacity
Each of these types of excess capacity is explained in the following sections.
Spare capacity
Spare capacity is the current actual excess capacity that can be utilized "right now” for new IT equipment.
Carrying spare capacity has significant capital and operating costs related to the purchase and maintenance
of the power and cooling equipment. Furthermore, spare capacity always brings down the operating
efficiency of a data center and increases its electrical consumption.
In an effective capacity management architecture for a growing and changing data center, certain types of
spare capacity, such as spare utility connection capacity, are cost effective. However, power and cooling
equipment should ideally be installed only when and where needed to meet growing demand.
©2007 American Power Conversion. All rights reserved. No part of this publication may be used, reproduced, photocopied, transmitted, or 9
stored in any retrieval system of any nature, without the written permission of the copyright owner. www.apc.com WP #150 Rev 0
An effective capacity management system must comprehend and quantify growth plans. For more
information on quantifying growth plans see APC White Paper #143, “Data Center Projects: Growth Model.”
Idle capacity
Idle capacity is the current actual excess capacity that is held available to meet the as-configured maximum
potential power or cooling demand. The existing IT equipment might need this capacity under peak load
conditions, so this idle capacity cannot be used to supply new IT equipment deployments.
Idle capacity is a growing problem caused by power management functions within IT equipment. The idle
capacity must be maintained for the times when power-managed IT equipment switches to high power
modes.
Stranded capacity
Stranded capacity is capacity that cannot be utilized by IT loads due to the design or configuration of the
system. The presence of stranded capacity indicates an imbalance between two or more of the following
capacities:
A specific IT device requires sufficient capacity of all of the five above elements. Yet these elements are
almost never available in an exact balance of capacity to match a specific IT load. Invariably, there are
locations with rack space but without available cooling, or spaces with available power but with no available
rack space. Capacity of one type that cannot be used because one of the other four capacities listed above
has been used to its maximum capacity is called stranded capacity. Stranded capacity is undesirable and
can seriously limit the performance of a data center. Unfortunately, most data centers have significant
stranded capacity issues, including the following common examples:
©2007 American Power Conversion. All rights reserved. No part of this publication may be used, reproduced, photocopied, transmitted, or 10
stored in any retrieval system of any nature, without the written permission of the copyright owner. www.apc.com WP #150 Rev 0
• An air conditioner has sufficient capacity but inadequate air distribution to the IT load
• A PDU has sufficient capacity but no available breaker positions
• Floor space is available but there is no remaining power
• Air conditioners are in the wrong location
• Some PDUs are overloaded while others are lightly loaded
• Some areas are overheated while others are cold
Depending on the situation and the architecture of the power and cooling system, it might be impossible to
utilize stranded capacity or it might be that only minor investments are needed to free stranded capacity so
that it can be effectively used. By definition, utilizing stranded capacity comes at a cost. It is often necessary
to take down part of the installation or install new power and cooling components.
Stranded capacity is a very frustrating capacity management problem for data center operators because it is
very hard to explain to users or management that a data center with 1 MW of installed power and cooling
capacity can’t cool the new blade servers when it is only operating at 200 kW of total load.
An effective capacity management system not only identifies and highlights stranded capacity, but also helps
customers avoid creating it in the first place.
©2007 American Power Conversion. All rights reserved. No part of this publication may be used, reproduced, photocopied, transmitted, or 11
stored in any retrieval system of any nature, without the written permission of the copyright owner. www.apc.com WP #150 Rev 0
Managing Capacity
The previous sections have established the framework for quantifying power and cooling supply and
demand. The ITIL framework specifies the functions which take place within the capacity management
process, including:
• Performance monitoring
• Workload monitoring
• Supply forecasting
• Demand forecasting
• Modeling
The above tools are to serve the ITIL-defined capacity management goals of providing accurate capacity
forecasts and providing appropriate capacity to meet business needs.
A power and cooling capacity management system based on measurement by technicians combined with
paper calculations could be envisioned, and in fact this method is used in crude form in some data centers.
However, with the advent of server virtualization and IT equipment that changes its own power and cooling
demand dynamically, the use of networked power and cooling instrumentation combined with power and
cooling capacity management software is the only practical and feasible solution. From a user’s perspective,
such a system would provide the following functionality:
• Room level: The bulk level supply and demand as well as the various capacities for the entire room.
Typically focuses on facility level UPS, generator, chiller, cooling tower, and service entrance
equipment.
• Row level: Power and cooling supply and demand associated with a row or other logical zone within
the data center. Often associated with cooling or power distribution equipment that is row-oriented,
such as PDUs, or row-oriented cooling systems. Particularly valuable for planning purposes when
rack-level details about configuration of specific racks are not yet known.
©2007 American Power Conversion. All rights reserved. No part of this publication may be used, reproduced, photocopied, transmitted, or 12
stored in any retrieval system of any nature, without the written permission of the copyright owner. www.apc.com WP #150 Rev 0
• Rack level: Power and cooling supply and demand associated with a specific rack or cabinet.
Information at this level is required to diagnose problems or to assess the impact of specific IT
device deployments. May be associated with rack level distribution circuits or rack-oriented cooling
systems.
An effective capacity management system will provide a display of the above types of information in a
hierarchical drill-down model, including a graphical representation of the layout of the data center. Figure 6
illustrates the room-level view and Figure 7 illustrates the rack-level view.
View accurate
representation of
Ability to drill in to
data center floor
row or rack level
layout
(as in Figure 7)
Visibility to average
and peak power
usage by measuring
actual consumption
©2007 American Power Conversion. All rights reserved. No part of this publication may be used, reproduced, photocopied, transmitted, or 13
stored in any retrieval system of any nature, without the written permission of the copyright owner. www.apc.com WP #150 Rev 0
Figure 7 – Example rack level view using APC Capacity Manager
ITIL specifically focuses on the issue of not just ensuring sufficient capacity, but ensuring appropriate
capacity. Too often the focus is on assuring sufficient capacity without regard for right-sizing to the actual IT
needs. The common result is oversizing with the associated waste of capital expenditures, energy, service
contracts, and water consumption.
Data center design tools help establish capacity plans and therefore should integrate into the capacity
management system. An example of such a suite of software tools is the APC InfraStruXure Designer data
center design tool, the APC InfraStruXure Central management platform, and the APC Capacity Manager.
©2007 American Power Conversion. All rights reserved. No part of this publication may be used, reproduced, photocopied, transmitted, or 14
stored in any retrieval system of any nature, without the written permission of the copyright owner. www.apc.com WP #150 Rev 0
Alerting on violations of the capacity plan
Capacity related alerts should be triggered when actual conditions are outside the boundaries of the capacity
management plan. These warnings can take the form of local, visual, or audible alerts, or can escalate via
the management system as pages, e-mails, etc.
• Increase of power consumption of installed equipment in a rack beyond the peak specified in the
capacity management plan for a rack, a row, or the room
• Reduction in available cooling or power capacity at the row, rack, or room level due to loss or
degradation of a power or cooling sub-system
• Cooling or power systems entering a state where they are not able to provide the redundancy
specified in the capacity management plan
For many of these events, no actual hardware fault has occurred and hence no events would be triggered by
traditional monitoring systems. In fact, most alerts provided by a capacity management system are predictive
in nature. Note that in an actual data center, the capacity management system complements other
monitoring tools such as real time fault, security, water leak, and temperature monitoring. An example of a
monitoring system that provides both real time alerts as well as capacity management alerts is the APC
InfraStruXure Central (Figure 8).
©2007 American Power Conversion. All rights reserved. No part of this publication may be used, reproduced, photocopied, transmitted, or 15
stored in any retrieval system of any nature, without the written permission of the copyright owner. www.apc.com WP #150 Rev 0
Modeling proposed changes
In addition to the determination of current conditions, an effective capacity management system must
provide the ability to analyze the capacity in historical and hypothetical situations. These scenarios may
include:
• Simulating fault conditions, such as loss of one or more power or cooling devices
• Analyzing plan growth versus actual capacity usage
• Proposals of equipment adds, removes, and relocations
• Trending based on historic data
The capacity management system should allow these scenarios to be evaluated against the current capacity
management plan. An effective model would guide the user to select the best scenario from options, for
example to maximize electrical efficiency or minimize floor space consumption.
To effectively utilize knowledge gained from detailed inventory management, the data must be understood
by a capacity management system.
In general, most small to medium data centers do not have the process maturity and staffing needed to
maintain rack-related IT equipment installation inventories and change history. Therefore, a capacity
©2007 American Power Conversion. All rights reserved. No part of this publication may be used, reproduced, photocopied, transmitted, or 16
stored in any retrieval system of any nature, without the written permission of the copyright owner. www.apc.com WP #150 Rev 0
management system cannot depend on the presence of this information, but should be able to take
advantage of it when available. As organizations mature, they can migrate from simplified capacity
management to a more detailed solution that incorporates change and inventory management. The
interaction between change management and capacity management is bi-directional as change
management is highly dependent on capacity management information to predict the impact of proposed
changes.
Process
Performance monitoring Set capacity plan
Workload monitoring
Present capacity data
Monitor infrastructure Supply forecasting
Demand forecasting Alert on violations of capacity plan
Monitor IT workload
Modeling Model proposed changes
Conclusion
Capacity management is an essential part of the efficient planning and operation of data centers. The need
for capacity management grows with the density, size, and complexity of the data center. A methodology for
capacity management has been described. It has been shown that capacity management is not dependant
on detailed information about the IT devices at the rack level and requires less effort to implement and
maintain, compared to traditional detailed inventory management systems, while still providing most of the
key benefits. If capacity management is implemented as described in this white paper, it can provide critical
information about the state of the data center which is not provided by traditional monitoring systems.
©2007 American Power Conversion. All rights reserved. No part of this publication may be used, reproduced, photocopied, transmitted, or 17
stored in any retrieval system of any nature, without the written permission of the copyright owner. www.apc.com WP #150 Rev 0
About the Author:
Neil Rasmussen is the Senior VP of Innovation for APC, which is the IT Business Unit of Schneider Electric.
He establishes the technology direction for the world’s largest R&D budget devoted to power, cooling, and
rack infrastructure for critical networks.
Neil holds 14 patents related to high-efficiency and high-density data center power and cooling infra-
structure, and has published over 50 white papers related to power and cooling systems, many published in
more than 10 languages, most recently with a focus on the improvement of energy efficiency. He is an
internationally recognized key-note speaker on the subject of high efficiency data centers. Neil is currently
working to advance the science of high-efficiency, high-density, scalable data center infrastructure solutions
and is a principal architect of the APC InfraStruXure system.
Prior to founding APC in 1981, Neil received his bachelors and masters degrees from MIT in electrical
engineering, where he did his thesis on the analysis of a 200MW power supply for a tokamak fusion reactor.
From 1979 to 1981 he worked at MIT Lincoln Laboratories on flywheel energy storage systems and solar
electric power systems.
©2007 American Power Conversion. All rights reserved. No part of this publication may be used, reproduced, photocopied, transmitted, or 18
stored in any retrieval system of any nature, without the written permission of the copyright owner. www.apc.com WP #150 Rev 0