Thread-Level Speculation on a CMP Can Be Energy Efficient ∗
Jose Renau† Karin Strauss Luis Ceze Wei Liu Smruti Sarangi James Tuck Josep Torrellas
†
Dept. of Computer Engineering, University of California Santa Cruz
http://masc.soe.ucsc.edu
Dept. of Computer Science, University of Illinois at Urbana-Champaign
http://iacoma.cs.uiuc.edu
ABSTRACT
Chip Multiprocessors (CMP) with Thread-Level Speculation (TLS)
have become the subject of intense research. However, TLS is suspected of being too energy inefficient to compete against conventional
processors. In this paper, we refute this claim. To do so, we first
identify the main sources of dynamic energy consumption in TLS.
Then, we present simple energy-saving optimizations that cut the energy cost of TLS by over 60% on average with minimal performance
impact. The resulting TLS CMP, populated with four 3-issue cores,
speeds-up full SPECint 2000 codes by 1.27 on average, while keeping the fraction of the chip’s energy consumption due to TLS to only
20%. Compared to a 6-issue superscalar at the same frequency, the
TLS CMP is on average faster, while consuming only 85% of its total
on-chip power.
1 Introduction
Substantial research effort is currently being devoted to speeding
up hard-to-parallelize non-numerical applications such as SPECint
codes. Designers build sophisticated out-of-order processors, with
carefully-tuned execution engines and memory subsystems. Unfortunately, these systems tend to combine high design complexity with
diminishing performance returns, motivating the search for design alternatives.
One such alternative is Thread-Level Speculation (TLS) on a Chip
Multiprocessor (CMP) [9, 10, 13, 21, 22, 25, 26]. Under TLS, these
hard-to-analyze applications are partitioned into tasks, which are then
optimistically executed in parallel, hoping that no data or control dependence will be violated. Special hardware support monitors the
tasks’ control flow and data accesses, and detects violations at run
time. Should one occur, the hardware transparently rolls back the
incorrect tasks and, after repairing the state, restarts them.
Published results show that TLS CMPs can speed up difficult nonnumerical applications. This is significant because CMPs are attractive platforms; they provide a low-complexity, energy-efficient architecture, and have a natural advantage for explicitly-parallel codes.
However, TLS is suspected of being too energy inefficient to seriously challenge conventional processors. The rationale is that aggressive speculative execution is not the best course at a time when
processors are primarily constrained by energy and power issues. Our
initial experiments, shown in Figure 1, appear to agree: assuming
constant frequency, a high-performance TLS CMP with four 3-issue
∗ This
work was supported in part by the National Science Foundation under grants EIA-0072102, EIA-0103610, CHE-0121357, and CCR-0325603;
DARPA under grant NBCH30390004; DOE under grant B347886; and gifts
from IBM and Intel.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. ICS’05, June 20-22, Boston, MA, USA. Copyright
c 2005, ACM 1-59593-167-8/06/2005...$5.00
cores is on average faster than a 6-issue superscalar for SPECint
codes, but consumes on average 15% more on-chip power.
In this paper, we refute the claim that TLS is energy inefficient. This paper is the first one to show that, perhaps contrary to
commonly-held views, a TLS CMP can be a very desirable design
for high-performance, power-constrained processors, even under the
very challenging SPECint codes.
Fundamentally, the energy cost of TLS can be kept modest by using a lean TLS CMP microarchitecture and by minimizing wasted
TLS work. Then, such a TLS CMP provides a better energyperformance trade-off than a wider-issue superscalar simply because,
as the size of the processor structures increases, energy scales superlinearly and performance sublinearly.
This paper offers three contributions. The first one is to identify
and quantify the main sources of energy consumption in TLS. These
sources are task squashing, hardware structures in the cache hierarchy for data versioning and dependence checking, additional traffic
in the memory subsystem due to the same two effects, and additional
instructions induced by TLS.
The second contribution is to present and evaluate simple energysaving optimizations for TLS. They are based on reducing the number of checks, reducing the cost of individual checks, and eliminating
work with low performance returns. These optimizations cut the energy cost of TLS by over 60% on average, with minimal performance
impact.
The third contribution is to show that the resulting TLS CMP
can provide a very desirable energy-performance trade-off, even for
SPECint codes. Specifically, a TLS CMP with four 3-issue cores
speeds-up full SPECint 2000 codes by 1.27 on average, while keeping the fraction of the chip’s energy consumption due to TLS to only
20%. Moreover, compared to a 6-issue superscalar at the same frequency, the TLS CMP is on average faster, while consuming only
85% of its total on-chip power. Finally, we expect better results for
floating point, multimedia, or more parallel codes.
This paper is organized as follows: Section 2 provides a background; Section 3 examines why TLS consumes more energy; Section 4 outlines our TLS architecture and compiler; Section 5 describes
simple optimizations to save energy in TLS; Sections 6 and 7 present
our methodology and evaluation; and Section 8 lists related work.
2 Thread-Level Speculation (TLS)
Overview. In TLS, a sequential program is divided into tasks, which
are then executed in parallel, hoping not to violate sequential semantics. The sequential code imposes a task order and, therefore, we
use the terms predecessor and successor tasks. The safe (or nonspeculative) task precedes all speculative tasks. As tasks execute,
special hardware support checks that no cross-task dependence is violated. If any is, the incorrect tasks are squashed, any polluted state
is repaired, and the tasks are re-executed.
Cross-Task Dependence Violations. Data dependences are typically
monitored by tracking, for each individual task, the data written and
the data read with exposed reads. An exposed read is a read that is
not preceded by a write to the same location within the same task. A
TLS
Wide
TLS
Wide
100
69
80
60
Avg. Power (w)
1.29
1.23
Normalized Speedup
2.2
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
60
40
20
0
bzip2
crafty
gap
gzip
mcf
parser
twolf
vortex
vpr
Geo.Mean
bzip2
crafty
gap
gzip
mcf
(a)
parser
twolf
vortex
vpr
A.Mean
(b)
Figure 1: Comparing the performance (a) and power (b) of a high-performance TLS CMP with four 3-issue cores (TLS) and a
6-issue superscalar (Wide) for SPECint codes. The experiments, which are described in detail later, use a constant frequency. The
bars in (a) are normalized to the performance of a 3-issue superscalar.
data dependence violation occurs when a task writes a location that
has been read by a successor task with an exposed read. A control dependence violation occurs when a task is spawned in a mispredicted
branch path. Dependence violations lead to task squashes, which involve discarding the work produced by the task.
State Buffering. Stores issued by a speculative task generate speculative state that cannot be merged with the safe state of the program
because it may be incorrect. Such state is stored separately, typically in the cache of the processor running the task. If a violation is
detected, the state is discarded. Otherwise, when the task becomes
non-speculative, the state is allowed to propagate to memory. When
a non-speculative task finishes execution, it commits. Committing
informs the rest of the system that the state generated by the task is
now part of the safe program state.
Data Versioning. A task has at most a single version of any given
variable. However, different speculative tasks that run concurrently in
the machine may write to the same variable and, as a result, produce
different versions of the variable. Such versions must be buffered
separately. Moreover, readers must be provided the correct versions.
Finally, as tasks commit in order, data versions need to be merged
with the safe memory state also in order.
Multi-Versioned Caches. A cache that can hold state from multiple
tasks is called multi-versioned [5, 8, 22]. There are two performance
reasons why multi-versioned caches are desirable: they avoid processor stall when tasks are imbalanced, and enable lazy commit.
If tasks have load imbalance, a processor may finish a task and the
task still be speculative. If the cache can only hold state for a single
task, the processor has to stall until the task becomes safe [6]. An
alternative is to move the task’s state to some other buffer, but this
complicates the design. Instead, the cache can retain the state from
the old task and allow the processor to execute another task. If so, the
cache has to be multi-versioned.
Lazy commit [17] is an approach where, when a task commits, it
does not eagerly merge its cache state with main memory through
ownership requests [22] or write backs [10]. Instead, the task simply passes the commit token to its successor. Its state remains in the
cache and is lazily merged with main memory later, usually as a result
of cache line replacements. This approach improves performance because it speeds up the commit operation. However, it requires multiversioned caches.
Tagging Multi-Versioned Caches. Multi-versioned caches typically
require that we tag each cache line with a version ID, which records
what task the line belongs to. Intuitively, such version ID could be
the long global ID of the task. However, to save space, it is best to
translate global task IDs into some arbitrary Local IDs (LIDs) that are
much shorter [22]. These LIDs are used only locally in the cache, to
tag cache lines. Their translations into global IDs are kept in a small,
per-cache table that we call LID Table. Each cache has a different
LID Table.
Architecture and Environment Considered. While TLS can be
supported in different ways, we use a CMP because it is a lowcomplexity, energy-efficient platform. Moreover, to maximize the
use of commodity hardware, our CMP has no special hardware support for inter-processor register communication. Processors can only
communicate via the memory system. In addition, to gain usability,
our speculative tasks are generated automatically by a TLS compiler.
Finally, we concentrate on SPECint 2000 applications because they
are very challenging to speed-up.
3 Sources of Energy Consumption in TLS
Enhancing a W-wide superscalar into a CMP with several W-wide
cores and TLS support causes the energy consumption to increase.
We loosely call the increase the energy cost of TLS (∆ET LS ). In
practice, a portion of ∆ET LS simply comes from having multiple
cores and caches on chip, and from inefficiencies of parallel execution. However, most of ∆ET LS is due to TLS-specific sources. We
are interested in the latter.
We propose to classify TLS-specific sources of energy consumption into four main groups: (1) task squashing, (2) hardware structures in the cache hierarchy needed to support data versioning and
dependence checking, (3) additional traffic in the memory system due
to these same two effects, and (4) additional dynamic instructions induced by TLS. These sources are detailed in Table 1.
TLS-Specific Source
Task squashing
Hardware structures
in the cache hierarchy
for data versioning and
dependence checking
Traffic due to data
versioning and
dependence checking
Additional dynamic
instructions induced
by TLS
Work of the tasks that
get squashed
Task squash operations
Storage & logic for data
version IDs and access bits
Tag-group operations
Evictions and misses due
to higher cache pressure
Selection & combination
of multiple versions
Fine-grain data
dependence tracking
Side-effects of breaking
the code into tasks
TLS-specific instructions
Optimization
StallSq, TaskOpt
Indirect
NoWalk
—
TrafRed
TaskOpt
Table 1: Main TLS-specific sources of energy consumption.
3.1
Task Squashing
An obvious TLS source of energy consumption is the work of tasks
that ultimately get squashed. In the TLS CMP that we evaluate in
Section 7, 22.6% of all graduated instructions belong to such tasks.
Note, however, that not all such work is wasted: a squashed task may
bring useful data into the caches or train the branch predictor.
The actual squash operation also consumes energy: a squash signal is sent to the target processor, and a hardware finite-state machine
(FSM) is activated to repair the state. In our system, such repair only
involves restoring the program counter and stack pointer, and setting
the task’s LID Table entry to invalid. In our system, the frequency of
squashes is only 1 per 3211 instructions on average. Consequently,
the total energy consumed by the actual squash operations is negligible.
3.2
Hardware Structures for Data Versioning and
Dependence Checking
The two most characteristic operations of TLS systems are maintaining data versioning and performing dependence checking. These operations are largely supported in the cache hierarchy. Data versioning
is needed when the cache hierarchy can hold multiple versions of
the same datum. Such versions appear when speculative tasks have
WAW or WAR dependences with predecessor tasks. The version created by the speculative task is buffered, typically in the processor’s
cache. If multiple speculative tasks co-exist in the same processor, a
cache may have to hold multiple versions of the same datum. In such
cases, data versions are identified by tagging the cache lines with a
version ID — in our case an LID (Section 2).
To perform dependence checking, caches record how each datum
was accessed. Typically, this is supported by augmenting each cached
datum with two access bits: an exposed-read and a write bit. They
are set on an exposed read and a write, respectively.
The LID and access bits are read or updated in hardware in a variety of cache access operations. For example, on an external access to
a cache, the (translated) LID of an address-matching line in the cache
is compared to the ID of the incoming message. From the comparison and the value of the access bits, the cache may conclude that a
violation occurred, or can instead supply the data normally.
A distinct use of these TLS structures is in tag-group operations. They involve changing the tag state of groups of cache lines.
There are three main cases. First, when a task is squashed, all its
cache lines need to be eventually invalidated. Second, in eagercommit systems [6], when a task commits, all its dirty cache lines
are merged with main memory through write backs [10] or ownership requests [22]. Finally, in lazy-commit systems, when a cache
has no free LIDs left, it needs to recycle one. This is typically done
by selecting a long-committed task and writing back all its dirty cache
lines to memory. Then, that task’s LID becomes free and can be reassigned.
These TLS tag-group operations often induce significant energy
consumption. Specifically, for certain operations, some schemes use
a hardware FSM that, periodically and in the background, repeatedly
walks the tags of the cache. For example, to recycle LIDs in [17], a
FSM periodically selects the LID of a committed task from the LID
Table, walks the cache tags writing back to memory the dirty lines
of that task, and finally frees up the LID. The FSM operates in the
background eagerly, using free cache cycles. Other schemes perform similar hardware walks of tags while stalling the processor to
avoid causing races. For example, to commit a task in [22], a special hardware module sequentially requests ownership for a group of
cache lines whose addresses are stored in a buffer. Since the processor stalls, execution takes longer and, therefore, consumes more
energy. Finally, some schemes use “one-shot” hardware signals that
can change the tag state of a large group of lines in a handful of cycles. For example, this is done to invalidate the lines of a squashed
task. Such hardware is reasonable when the cache can hold data for
only a single or very few speculative tasks [5, 9, 22]. However, in
caches with many versions, it is likely to adversely affect the cache
access time. For example, in our system, we use 6-bit LIDs per cache
line. A “one-shot” clear of the valid bit of all the lines belonging to
a given task would require to keep, for each line tag, 6 NXOR gates
that feed into one (possibly cascaded) AND gate. Having such logic
per tag entry is likely to slow down the common case of a plain cache
access, and result in longer, more energy-consuming executions.
3.3
Additional Traffic for Data Versioning and
Dependence Checking
A TLS CMP system generates more traffic than a superscalar. The increase is 460% in our system (Section 7). While some of the increase
is the result of parallel execution, there are three main TLS-specific
sources of additional traffic (Table 1).
One reason is that caches do not work as well. Caches often have
to retain lines from older tasks that ran on the processor and are
still speculative. Only when such tasks become safe can the lines
be evicted. As a result, there is less space in the cache for data that
may be useful to the task currently running locally. This higher cache
pressure increases evictions of useful lines and subsequent misses.
The presence of multiple versions of the same line in the system
also causes additional messages. Specifically, when a processor requests a line, multiple versions of it may be provided, and the coherence protocol then selects what version to use. Similarly, when a
committed version of a line is to be evicted to L2, the protocol first
invalidates all the other cached versions of the line that are older —
they cannot remain cached anymore.
Finally, it is desirable that the speculative cache coherence protocol track dependences at a fine grain, which creates additional traffic. To see why, recall that these protocols typically track dependences by using the write and exposed-read bits. If this access information is kept per line, lines that exhibit false sharing may appear
to be involved in data dependence violations and, as a result, cause
squashes [5]. For this reason, many TLS proposals keep some access information at a finer grain, such as per word. Unfortunately,
per-word dependence tracking may induce higher traffic: a distinct
message (such as an invalidation) may need to be sent for every word
of the line.
3.4
Additional Dynamic Instructions Due to TLS
TLS systems with compiler-generated tasks such as ours often execute more dynamic instructions than non-TLS systems. This is the
case even counting only tasks that are not squashed. In our system,
the increase is 12.5%. These additional instructions come from two
sources (Table 1): side-effects of breaking the code into tasks and,
less importantly, TLS-specific instructions.
The first source dominates. It accounts for 88.3% of the increase.
One reason for this source is that conventional compiler optimizations are not very effective at optimizing code across task boundaries.
Therefore, TLS code quality is lower than non-TLS code. In addition,
in CMPs where processors communicate only through memory, the
compiler must spill registers across task boundaries.
TLS-specific instructions are the other source. They include task
spawn and commit instructions. The spawn instruction sends some
state from one processor to another. Task commit in lazy implementations sends the commit token between processors [17]. These instructions contribute with 11.7% of the instruction increase.
4 TLS Architecture and Compiler
Before we examine ways to reduce TLS energy consumption, it is
helpful to outline the high-performance TLS CMP architecture and
compiler that we use as baseline in our work. More details can be
found in [18, 19, 27].
4.1
TLS CMP Architecture
The CMP connects four modest-issue processors in a virtual ring.
Each processor has a private, multi-versioned L1. The ring is also
connected to a small, multi-versioned victim cache. Finally, there
is a plain, shared L2 that only holds safe data (Figure 2-(a)). We
use a ring interconnect to minimize races in the coherence protocol.
The victim cache is included to avoid the more expensive alternative
of designing a multi-versioned L2. Figures 2-(b) and (c) show the
extensions required by TLS to the processors, L1s, and victim cache.
Each structure shows its fields in the form bit count:field name.
Each processor has an array of TaskHolders, which are hardware
structures that hold some state for the tasks that are currently loaded
on-chip (Figure 2-(b)). Each TaskHolder contains the task’s LID, its
spawn address (PC), its stack pointer (SP), and a few additional bits
that will be discussed later. The register state is kept on the stack.
A copy of the LID for the task currently going through rename is
✠✡✁
✠✡✁
✡✁
✡✁
✡✁✠✡✁✠✡✁✠ ✡✁✠✡✁✠✡✁✠ ✡✁✠✡✁✠✡✁✠ ✡✠✡✠✡✠
✠✡✁
✠✡✁
✠✡✠✁
✠
✠
✁
✡
✡✁✠✡✁✠✡✁✠ ✡✁✠✡✁✠✡✁✠ ✡✁✠✡✁✠✡✁✠ ✡✠✡✠✡✠
✠✡✁
✡✠✁
✡✁
✠
✡✠✁
✠
✡✠✁
✁
✡
✡✁✠✡✁✠✠✁✡ L2✡✁✠✡✁✠✠✁✡ ✡✁✠✡✁✠✠✁✡ ✡✠✡✠✠✡
✠✠✁
✠✠✁
✡✁
✡✁
✡✡✠✁
✡✡✁
✠
✡✁✠✡✁✠✡✁✠ ✡✁✠✡✁✠✡✁✠ ✡✁✠✡✁✠✡✁✠ ✡✠✡✠✡✠
✠
✡✠✁
✁
✡
✠
✡✠✁
✁
✡
✠✡✁
✠
✡✁
✁
✡
✠✡✁
✠✡✁
✠ ✡✁
✠ ✡✁✠✡✁✠✡✁✠ ✡✁✠✡✁✠✡✁✠ ✡✁✠✡✁✠✡✁✠ ✡✠✡✠✡✠
CPU
TaskHolder Array
6: LID
1: Restarted
1: Stalled
1: Safe
1: Finished
32: PC
32: SP
3: NextFree
HeadFreeTaskHolder
Controller
Victim Cache
L1
Controller
...
Controller
L1
CPU
✂✁✁✁✁
✂✁✂✁✂✁✂✁✂✂
✂✁✁✁✁
✁✂✁✁✁✁
✂✁✁✁
✂✁✂✁✂✁✂✁✂✂
✁✁✁
✂✂✁✁✁✁✁
✂✁✂✁✂✁✂✁✂✂
✂✁✁✁✁
✂✁✂✁✂✁✂✁✂✂
✂✁✁✁✁
✁✂✁✁✁✁
✂✁✁✁
✂✁✂✁✂✁✂✁✂✂
✂✁✁✁✁
✂✁✂✁✂✁✂✁✂✂
✂✁✁✁✁
✞✞✞ ✟✁✞✟✁✞✞ ✟✁✞✟✁✞✞ ✟✁✞✟✁✞✞ ✟✁✞✟✁✞✞ ✟✞✟✞✞
✟✁✞✟✁✞✞ ✟✁✞✟✁✞✞ ✟✁✞✟✁✞✞ ✟✁✞✟✁✞✞ ✟✁✞✟✁✞✞ ✟✁✞✟✁✞✞ Data
✟✁✞ ✟✁
✟✁✟✞✁✞ ✟✁✟✁✞✞ ✟✁✟✁✞✞ ✟✁✟✁✞✞ ✟✁✟✁✞✞ ✟✁✟✁✞✞ ✟✁✞✟✁✞✟✁✞✞ ✟✁
✟✁
✞✟✁
✞✞✞ ✟✁✟✁✞✟✁✟✁✞✞✞ ✟✁✟✁✞✟✁✟✁✞✞✞ ✟✁✟✁✞✟✁✟✁✞✞✞ ✟✁✟✁✞✟✁✟✁✞✞✞ ✟✟✞✟✟✞✞✞
✟✟✞✁✁✞ ✟✁✟✁✞✞ ✟✁✟✁✞✞ ✟✁✟✁✞✞ ✟✁✟✁✞✞ ✟✁✟✁✞✞ ✟✁✟✁✞✞ ✟✁
✟✁✟✁✞✞ ✟✁✟✁✞✞ ✟✁✟✁✞✞ ✟✁✟✁✞✞ ✟✁✟✁✞✞ ✟✁✟✁✞✞ ✟✁✟✁✞✞ ✟✁
✟✁
✟✁✟✁✞✞ ✟✁✟✁✞✞ ✟✁✟✁✞✞ ✟✁✟✁✞✞ ✟✟✞✞
✞✟✁
✞✟✁
✟✁✟✞✁✞ ✟✁✟✁✞✞ ✟✁✟✁✞✞ ✟✁✟✁✞✞ ✟✁✟✁✞✞ ✟✁✟✁✞✞ ✟✁✟✁✞✞ ✟✁
✟✁✟✁✟✁✟✁✟
✞
✞✟✁
✟✁✟✞✁✟✁✟✁✞ ✟✁✟✁✞ ✟✁✟✁✞ ✟✁✟✁✞ ✟✁✟✁✞ ✟✁✟✁✞ ✟✁
✞ ✟✁✞✟✁✞✟✁✞ ✟✁✞✟✁✞✟✁✞ ✟✁✞✟✁✞✟✁✞ ✟✁✞✟✁✞✟✁✞ ✟✞✟✞✟✞
Tag
6: LID
8: Write
8: Exposed−Read
CurrentID
register
6: LID
(b) Processor Modifications
(a) Chip
Addr
6: LID
☎✁✄☎✁✄✄ ☎✁✄☎✁✄✄ Load−Store
☎✁✄ ☎✁✄ ☎✁✄ ☎✁✄ Queue
☎✁✄ ☎✁✄ ☎✁✄ ☎✄
☎✁☎✄✁✄ ☎✁☎✁✄✄ ☎✁✄☎✁✄☎✁✄✄ ☎✁✄☎✁✄☎✁✄✄ ☎✁✄☎✁✄☎✁✄✄ ☎✁✄☎✁✄☎✁✄✄ ☎✁✄☎✁✄☎✁✄✄ ☎✁✄☎✁✄☎✁✄✄ ☎✁✄☎✁✄☎✁✄✄ ☎✄☎✄☎✄✄
☎☎✄✁✁✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎☎✄✄
☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎☎✄✄
☎✁☎✄✁✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎☎✄✄
☎✁☎✄✁✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎✁☎✁✄✄ ☎☎✄✄
☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎✁☎
LID Table
LID
16: Global task ID
1: Committed
1: Killed
x: # lines
5: NextFree
4−5: TaskHolder pointer
x = log 2(#cache lines)
✆✁✝✝✁✆✆ ✆✝✝✆✆
✁✝ ✝
Original,
Unmodified
Hardware
HeadFreeLID
(c) L1 and Victim Cache Modifications
Figure 2: Proposed architecture of a high-performance TLS CMP.
kept in the CurrentID register of the load-store queue (Figure 2-(b)).
This register is used to tag loads and stores as they are inserted in the
load-store queue. With this support, when a load or store is sent to the
L1, it includes the task’s LID. Note that a processor can have in-flight
instructions from multiple tasks.
In the L1s and the victim cache, each line tag is augmented with an
LID and, for each word in the line, with one Write and one ExposedRead bit (Figure 2-(c)). As per Section 2, each cache keeps its own
LID Table to translate LIDs to global task IDs (Figure 2-(c)). The
LID Table is direct mapped. Each entry has information for one LID.
A novel feature of this architecture is that each LID Table entry
also contains a Killed bit and a Committed bit for the corresponding
task, and a counter of the number of lines in the cache with that LID.
These fields are used to speed-up some of the tag-group operations
of Section 3.2, as we will see in Section 4.2. Each entry also has a
pointer to the corresponding TaskHolder.
The architecture does not include special hardware for register communication between cores. All dependences are enforced
through memory. The reason is to minimize changes to off-the-shelf
cores.
4.2
Example: Use of the LID Table
To show that our baseline TLS architecture is efficient, we give as
an example the novel use of the LID Table for tag-group operations.
Each LID Table entry is extended with summary-use information:
the number of lines that the corresponding task still has in the cache,
and whether the task has been killed or committed. With this extra
information, all the tag-group operations of Section 3.2 are performed
efficiently.
Specifically, when a task receives a squash or commit signal, its
LID Table entry is updated by setting the Killed bit (Figure 3-(a)) or
the Committed bit, respectively. No tag walking is performed.
Killed Committed Line
Count
LID 1
Tag
Killed Committed
Data
LID 1
1 0 6
LID 2
LID 2
LID 1
LID Table
(a)
Line
Count
1 0 6
0 1 34
LID Table
L1 Cache
(b)
Figure 3: Using the LID Table on a task kill (a) and on a
cache line replacement (b).
At any time, when a processor issues a load, if the load’s address
and the LID match one of the L1 tag entries, a hit occurs. In this
case, the LID Table is not accessed. In all other cases, a miss occurs
and the LID Table is accessed. We index the LID Table with the
request’s LID, obtain the corresponding global task ID, and include it
in a request issued to the ring. Moreover, to decide which line to evict
from the L1, we also index the LID Table with the LIDs of the lines
that are currently using the cache set where space is needed (LID1
and LID2 in Figure 3-(b)). These are not time-critical accesses. If
we find entries that have the Killed bit set (LID1 in the example), the
count of cached lines is decremented, and the corresponding line in
the cache is either chosen as the replacement victim or invalidated.
Otherwise, if we find an entry with the Committed bit set (LID2 in
the example), we use the line as the victim (possibly after a write back
to memory), and decrement the count. If any one of these counters
reaches zero, that LID is recycled.
4.3
TLS Compiler
We have built a TLS compiler [27] that adds several passes to a development branch of gcc 3.5. The branch uses a static single-assignment
tree as the high-level intermediate representation [7]. With this approach, we leverage a complete compiler infrastructure. Our TLS
passes generate tasks out of loop iterations and the code that follows
(i.e., the continuation of) subroutines. The compiler first marks the
tasks, and then tries to place spawn statements for each of these tasks
as early in the code as it can. Only spawns that have moved up the
code significantly are retained.
Before running the compiler, we run SGI’s source-to-source optimizer (copt from MIPSPro), which performs PRE, loop unrolling,
inlining, and other optimizations. As a result, the non-TLS code has
a quality comparable to the MIPSPro SGI compiler for integer codes
at O3. Code quality when TLS is enabled is not as good, as explained
in Section 3.4.
The compilation process includes a simple profiler. The profiler
takes the initial TLS executable, runs a tiny data set, and identifies
those tasks that should be eliminated because they are unlikely to be
of much benefit (Section 5.3.2). Then, the compiler re-generates the
executable by eliminating these tasks.
5 Energy-Saving TLS Optimizations
To reduce the energy consumed by the TLS sources of Table 1, we can
use many performance-oriented TLS optimizations proposed elsewhere. Examples are improvements to the cache hierarchy to minimize conflicts [5] or enhancements to the coherence protocol to reduce communication latency [23]. While these optimizations improve performance, they typically also reduce the energy consumed
by a program.
In this paper, we are not interested in these optimizations. If they
are cost-effective, they should already be included in any baseline
TLS design. Instead, we are interested in energy-centric optimizations. These are optimizations that do not increase performance noticeably; in fact, they may even slightly reduce it. However, they reduce energy consumption significantly. They would not necessarily
be included in a performance-centric TLS design.
5.1
Reducing the Number of Checks
5.1.1 Avoid Eagerly “Walking” the Cache Tags in
the Background (NoWalk)
With the LID Table design described in Section 4.2, tag-group operations are very efficient. A task squash or commit only involves setting
the Killed or Committed bit, respectively, in the LID Table. As lines
belonging to squashed or committed tasks are eliminated from the
cache due to replacements, the corresponding counts in the LID Table are decremented. When a count reaches zero, its associated LID
can be recycled. Consequently, LID recycling is also very fast.
However, always waiting for LIDs to get “naturally” recycled in
this way may hurt performance because we may run out of LIDs.
Consequently, our baseline TLS architecture recycles LIDs in a more
aggressive manner, inspired in previous work [17]. Specifically, it has
a hardware FSM that periodically walks the tags of the cache in the
background when the cache is idle. It uses the LID Table to identify
tasks with the Killed bit set and then, when the cache is idle, it walks
the cache tags to invalidate their lines and free up their LIDs. It also
performs a similar operation to eagerly free up the LIDs of longcommitted tasks [18]. With this approach, performance is highest
because we never run out of LIDs. However, we spend energy with
many checks.
The energy optimization that we propose is to avoid in most cases
any eager walk of the cache tags. Instead, we rely on the “natural”
LID recycling when the associated count reaches zero, as described
above. We only activate a background walk of the tags like in the
baseline TLS architecture in one case: when there is only one free
LID left. With this optimization, we occasionally may have to stall
due to temporary lack of LIDs. However, we eliminate many tag
checks.
5.1.2 Lower Traffic to Check Version-IDs (TrafRed)
In TLS, many messages are sent to check version IDs. For example,
when a processor writes to a non-exclusive line, all the caches with a
version of the requested line are typically checked, to see if there is
an exposed read to the line from a more speculative task. Such task
will be squashed. Similarly, on displacement of a committed line to
L2, those same caches are checked, to invalidate older versions of the
line. Such versions cannot remain cached anymore.
We propose a simple optimization to reduce the number of checks
needed and, therefore, the traffic. Cache lines are extended with a
Newest and an Oldest bit. Every time that a line is loaded into a
cache, we set the Newest and/or Oldest bit if the line contains the
latest and/or the earliest cached version, respectively, of the corresponding address. As execution proceeds, Newest may be reset on
an access by another task. With this support, if a processor writes on
a Newest line cached locally, there is no need to check other caches
for exposed reads. Similarly, if a processor displaces a committed
line with the Oldest bit set, there is no need to check other caches for
older versions. This optimization applies to two rows in Table 1.
5.2
Reducing the Cost of Individual Checks
A simple example is to tag cache lines with short LIDs rather
than global task IDs. This approach of using indirection is well
known [22]. Consequently, we already use it in the baseline TLS
CMP and do not evaluate its impact. We call it Indirect in Table 1.
5.3
Eliminating Low-Return Work
5.3.1 Stall a Task After Two Squashes (StallSq)
A simple technique is to limit the number of times that a task is allowed to restart after a squash. After a task has been squashed N
times, it is not given a CPU again until it becomes non-speculative.
We performed experiments always restarting tasks immediately after they are squashed. We found that 73.0% of the tasks are never
squashed, 20.6% are squashed once, 4.1% twice, 1.4% three times,
and 0.9% four times or more. Restarting a task after its first squash
can be beneficial, as the L2 cache and branch predictor have been
warmed up. Restarting after further squashes delivers low performance returns while steadily consuming more energy. Consequently,
we reset and stall a task after its second squash. This is accomplished
with two bits per TaskHolder entry (Figure 2): the Restarted bit is set
after the task has been squashed and restarted once; the Stalled bit is
set after the second squash.
5.3.2 Energy-Aware Task Pruning (TaskOpt)
The profiler in our compilation pass includes a simple model that
identifies tasks that should be eliminated because they are unlikely to
be beneficial. The main focus is on tasks that cause squashes. For
the baseline TLS architecture, the model minimizes the duration of
the program. The energy-centric optimization is to use a model that
minimizes the product Energy × Delay 2 for the program.
Our compiler generates a binary with task spawn instructions (Figure 4-(a)). The profiler runs the binary sequentially, using the Train
data set for SPECint codes. As the profiler executes a task, it records
the variables written. When it executes tasks that would be spawned
earlier, it compares the addresses read against those written by predecessor tasks. With this, it can detect potential run-time violations.
The profiler also models a simple cache to estimate the number of
misses in the machine’s L2. For performance, cache timing is not
modeled. On average, the profiler takes around 5 minutes to run on a
3 GHz Pentium 4.
The profiler estimates if a task squash will occur and, if so, the
number of instructions squashed Isquashed (Figure 4-(b)) and the final instruction overlap after re-execution Ioverlap (Figure 4-(c)). In
addition, the profiler estimates the number of L2 misses Msquashed
in the squashed instructions. These misses will have a prefetching
effect that will speed up the re-execution of T2.
T1
Time
We propose three guidelines to identify energy-centric optimizations: (1) reduce the number of checks, (2) reduce the cost of individual checks, and (3) eliminate work with low performance returns.
As examples, we propose simple, yet effective techniques. They are
shown in the last column of Table 1.
spawn T2
T2
T1
spawn T2
I squashed
T2
M squashed
squash
(b) Estimated squash
(a) Code with spawn
T1
spawn T2
T2
re−execution
I overlap
(c) Estimated re−execution
Figure 4: Modeling task squash and restart. T1 and T2 are tasks.
Assuming that each instruction takes Ti cycles to execute, and an
L2 miss stalls the processor for Tm cycles, the estimated execution
time reduction (Tred ) is Ioverlap × Ti + Msquashed × Tm . Assuming
that the energy consumed by each instruction is Ei , the approximate
increase in energy (Einc ) is Isquashed × Ei .
Our profiler focuses on tasks that have a rate of squashes per commit higher than Rsquash . In the baseline architecture, it eliminates
a task if Tred is less than a threshold Tperf . In our our energyoptimized architecture, it eliminates a task if subtracting Tred from
the program time and adding Einc to the program energy, the program’s E × D2 product increases. In this case, voltage-frequency
scaling could (ideally) do better.
The values of the thresholds and parameters used are listed in Table 2. This optimization has significant impact: on average, the profiler eliminates 39.9% of the static tasks in performance mode, and
49.2% in energy mode.
TLS CMP with four 3-issue cores (TLS4-3i)
6-issue superscalar chip (Uni-6i)
Processor
Cache
D-L1
Size:
RT:
Assoc:
Line size:
Ports:
Pend ld/st:
16KB
3 cyc
4-way
64B
1
16
Processor
Fetch/issue/comm width: 6/3/3
I-window/ROB size: 68/126
Int/FP registers: 90/68
LdSt/Int/FP units: 1/2/1
Ld/St queue entries: 48/42
TaskHolders/processor: 8
TaskHolder access time, energy: 1 cyc, 0.25nJ
Frequency: 5.0 GHz @ 70 nm
Branch penalty: 13 cyc (min)
RAS: 32 entries
BTB: 2K entries, 2-way assoc.
Branch predictor (spec. update):
bimodal size: 16K entries
gshare-11 size: 16K entries
VC
L2
D-L1
4KB
8 cyc
4-way
64B
1
64
1MB
10 cyc
8-way
64B
1
64
LID Table:
entries/ports:
64/2
acc time/energy: 1cyc/0.11nJ
I-L1
Bus & Memory
Profiling parameters
VC
32/1
1cyc/0.07nJ
Latency from spawn to new thread: 14 cyc
Frequency: 5.0 GHz @ 70 nm
Branch penalty: 13 cyc (min)
RAS: 32 entries
BTB: 2K entries, 2-way assoc.
Branch predictor (spec. update):
bimodal size: 16K entries
gshare-11 size: 16K entries
Cache
D-L1
L2
Size:
RT:
Assoc:
Line size:
Ports:
Pend ld/st:
16KB
2 cyc
4-way
64B
2
16
1MB
10 cyc
8-way
64B
1
64
Fetch/issue/comm width: 6/6/6
I-window/ROB size: 104/204
Int/FP registers: 132/104
LdSt/Int/FP units: 2/4/2
Ld/St queue entries: 66/54
Size: 16KB; RT: 2 cyc; assoc: 2-way; line size: 64B; ports: 1
FSB frequency: 533MHz; FSB width: 128bit; memory: DDR-2; DRAM bandwidth: 8.528GB/s; memory RT: 98ns
Rsquash : 0.8; Ti : 1; Tm : 200 cyc; Ei : 8pJ; Tperf : 90 cyc; Sizeenergy : 45; Sizeperf : 30; Hoistenergy : 120; Hoistperf : 110
Table 2: TLS CMP with four 3-issue cores (TLS4-3i) and 6-issue superscalar chip (Uni-6i) modeled. In the table, RAS, FSB,
RT, and VC stand for Return Address Stack, Front-Side Bus, minimum Round-Trip time from the processor, and Victim Cache,
respectively. Cycle counts refer to processor cycles.
5.3.3 Eliminate Low-Return Tasks (TaskOpt)
Another energy-centric optimization is for the compilation pass to
aggressively remove tasks whose size is small or whose spawn point
has not been moved up in the code much. We use a threshold
size Sizeenergy and threshold spawn hoist distance Hoistenergy
that are more aggressive than their performance-centric counterparts
(Sizeperf and Hoistperf ). These optimizations reduce task boundaries and code bloat. They eliminate 36.1% of the static tasks in
energy mode compared to 34.7% in performance mode. For ease of
presentation, we combine this technique and the previous one into
TaskOpt in our evaluation, since they are very related.
5.4
Summary
We place the optimizations in the corresponding row of Table 1. As
indicated in Section 5.2, Indirect is not evaluated.
6 Evaluation Setup
To assess the energy efficiency of TLS, we compare an TLS CMP to
a non-TLS chip that has a single processor of the same or wider issue
width. We use execution-driven simulations, with detailed models
of out-of-order superscalars and advanced memory hierarchies, enhanced with models of dynamic and leakage energy from Wattch [3],
Orion [28], and HotLeakage [29].
6.1
Architectures Evaluated
The TLS CMP that we propose has four 3-issue cores, the microarchitecture of Section 4, and the energy-centric TLS optimizations of
Section 5. We call the chip TLS4-3i. The non-TLS chips have a single
superscalar with a conventional L1 and L2 on-chip cache hierarchy.
We consider two: one is a 6-issue superscalar (Uni-6i) and the other a
3-issue superscalar (Uni-3i). We choose to compare the TLS4-3i and
Uni-6i designs because both chips have approximately the same area,
as can be estimated from [11, 20].
Table 2 shows the parameters for TLS4-3i and Uni-6i. As we move
from 3-issue to 6-issue cores, we scale all the processor structures
(e.g., ports, FUs, etc) according to the issue width of the core. We
try to create a balanced processor as much as possible, by scaling the
processor resources. This is the same approach used in IBM’s Power
4.
In our comparison, we favor Uni-6i. We assume that Uni-6i has
the same frequency and the same pipeline depth as the cores in TLS43i. This helps Uni-6i because, in practice, a 6-issue core would not
cycle as fast as a 3-issue core with the same pipeline. For example, according to CACTI [20], the access time of the register file and
the instruction window in Uni-6i would be at least 1.25 times higher
and 1.35 times higher, respectively, than in the TLS4-3i cores. Moreover, extrapolating results from [15], the bypass network would have
2.6 times longer latency in Uni-6i (assuming a fully-connected bypass network). In our simulations, however, we assume the same
frequency for Uni-6i and TLS4-3i.
Since both processors have the same pipeline depth and branch
misprediction penalty, we feel that it is fair to also give them the
same branch predictor. In addition, both processors have an integer
and an FP cluster. Since we run integer codes in the evaluation, the
FP cluster is clock-gated almost all the time.
The tag array in TLS4-3i’s L1 caches is extended with the LID, and
the Write and Exposed-Read bits (Figure 2). At worst, the presence
of these bits increases the access time of the L1 only slightly. To see
why, note that the LID bits can simply be considered part of the line
address tag, as a hit requires address and LID match. Moreover, in
our protocol, the Write and Exposed-Read bits are not checked before
providing the data to the processor; they may be updated after that.
However, to be conservative, we increase the L1 access latency in
TLS4-3i one cycle over Uni-6i, to 3 cycles.
Uni-3i is like Uni-6i except that the core is 3-issue, like those in
TLS4-3i, and the L1 cache only has 1 port. For completeness, we also
evaluate one additional chip: TLS2-3i. TLS2-3i is a TLS CMP like
TLS4-3i, but with only two cores.
To maximize the use of commodity hardware, the TLS CMP has
no special hardware support for inter-processor register communication. [18, 19] have more details on the architecture evaluated.
6.2
Energy Considerations
We estimate and aggregate the dynamic and leakage energy consumed in all chip structures, including processors, cache hierarchies, and on-chip interconnect. For the dynamic energy, we use the
Wattch [3] and Orion [28] models. We apply aggressive clock gating to processor structures in all cores. In addition, unused cores in
the TLS CMP are also clock gated. Activating and deactivating corewide clock gating takes 100 cycles each time. Clock-gated structures
are set to consume 5% of their original dynamic energy, which is one
of the options in Wattch. We extend the Wattch models to support
our deeper pipelines and to take into account the area when computing the clock energy. The chip area is estimated using data from [11]
and CACTI [20].
Leakage energy is estimated with HotLeakage [29], which models
both sub-threshold and gated leakage currents. We use an iterative ap-
Apps
bzip2
crafty
gap
gzip
mcf
parser
twolf
vortex
vpr
Avg
Squashed
Instructions
(%)
No
Stall
Task
Opt
Sq
Opt
No
Opt
Stall
Sq
Task
Opt
Pruned
Tasks
(%)
Task
Opt
9.9
26.2
35.2
14.0
28.8
39.3
4.4
15.7
29.5
22.6
1.40
1.97
2.07
1.49
2.38
2.03
1.62
1.82
3.14
2.00
1.35
1.95
1.94
1.48
2.38
1.85
1.62
1.81
3.13
1.95
1.41
1.70
2.06
1.49
2.38
1.25
1.62
1.49
2.61
1.78
21.8
6.2
14.6
7.5
0.4
18.9
0.4
9.2
17.0
10.7
7.5
25.4
31.6
14.0
28.7
29.9
4.4
15.4
29.2
20.7
9.9
18.9
35.1
11.9
28.8
13.8
4.4
7.7
27.9
17.6
Busy
CPUs
Task
Size
(Instructions)
No
Task
Opt
Opt
743.4
932.0
1270.3
626.6
47.9
167.3
409.4
488.3
212.9
544.2
751.7
1064.0
1280.4
634.2
47.9
261.6
409.4
881.5
389.1
635.5
ED 2
Reduc.
(%)
Task
Opt
-0.3
6.8
-0.8
0.3
0.0
26.4
0.0
16.0
10.4
6.5
Ratio of Tag
Accesses
(TLS/Uni-3i)
No
No
Opt
Walk
Traffic
(TLS/
Uni-3i)
No
Traf
Opt
Red
Add’l Instruct.
in Non-Squashed
Dyn. Tasks (%)
No
Stall
Opt
Sq
3.2
2.9
3.6
3.5
3.8
3.6
3.3
2.9
3.2
3.3
14.0
7.1
16.7
12.3
42.0
9.8
55.9
7.4
10.9
19.6
5.6
5.6
3.8
6.5
31.9
20.8
6.5
7.5
23.9
12.5
1.3
2.0
2.2
1.9
2.7
3.2
1.6
1.9
3.1
2.2
2.5
3.6
8.4
4.0
11.5
7.1
3.2
3.9
6.4
5.6
5.6
5.6
3.8
6.5
31.9
18.0
6.5
7.4
21.2
11.9
Table 3: Architectural characteristics of the 4-core TLS CMP related to sources of energy consumption and their optimization.
proach suggested by Su et al. [24]: the temperature is estimated based
on the current total power, the leakage power is estimated based on
the current temperature, and the leakage power is added to the total
power. This is continued until convergence. The maximum temperature at the junction for any application is not allowed to go beyond
85◦ C, as recommended by the SIA Roadmap [1].
From our calculations, the average power consumed by the Uni3i and Uni-6i chips for the SPECint 2000 applications is 32 and 60
W, respectively (more data will be shown later). Of this power, leakage accounts for 38% and 32%, respectively. The majority of the
dynamic power increase from Uni-3i to Uni-6i is due to five structures that more than double their contribution, largely because they
double the number of ports. These are the rename table, register file,
I-window, L1 data cache, and data TLB. In addition, the data forwarding network also increases its dynamic contribution by 70%.
6.3
Applications Evaluated
We measure full SPECint 2000 applications with the Reference data
set except eon, which is in C++, and gcc and perlbmk, which our
compiler infrastructure does not compile. By full applications, we
mean that we include all the code in the measurement, not just the
more parallel sections such as loops. Uni-3i and Uni-6i run the binaries compiled with our TLS passes disabled. Such binaries have a
code quality comparable to integer codes generated by the MIPSPro
SGI compiler with O3 (Section 4.3).
TLS and non-TLS binaries are very different. Therefore, we cannot compare the execution of a fixed number of instructions. Instead,
we insert “simulation markers” in the code and simulate for a given
number of markers. After skipping the initialization (several billion
instructions), we execute up to a certain number of markers so that
Uni-6i graduates from 750 million to 1.5 billion instructions.
7 Evaluation
In our evaluation, we first characterize the TLS CMP architecturally,
with and without the energy optimizations. Then, we examine the
energy cost of TLS and the savings of the optimizations. Finally, we
compare the energy, power, and performance of the different chips.
In the following, NoOpt is TLS4-3i without the optimizations.
7.1
Architectural Characterization of the TLS CMP
We measure architectural characteristics of the 4-core TLS CMP that
are related to Table 1’s sources of TLS energy consumption and optimizations. The data are shown in Table 3. In the table, we compare
the chip before optimization (NoOpt), to the chip with one optimization at a time (StallSq, TaskOpt, NoWalk, or TrafRed).
The first TLS source of energy consumption in Table 1 is task
squashing. Column 2 of Table 3 shows that, on average, NoOpt loses
to task squashes 22.6% of the dynamic instructions executed. This is
a significant waste. With our optimizations, we reduce the number of
such instructions. Specifically, the average fraction becomes 20.7%
with StallSq (Column 3) and 17.6% with TaskOpt (Column 4). Although not shown in the table, the fraction becomes 16.9% with both
optimizations combined.
The next few columns of Table 3 provide more information on the
impact of StallSq and TaskOpt. Under NoOpt, the average number
of busy CPUs is 2.00 (Column 5). Since StallSq stalls tasks that are
likely to be squashed and TaskOpt removes them, they both reduce
CPU utilization. Specifically, the average number of busy CPUs is
1.95 and 1.78 with StallSq and TaskOpt, respectively (Columns 6 and
7). With both optimizations, the average can be shown to be 1.75.
TaskOpt has a significant impact on the tasks. Recall from Sections 5.3.2 and 5.3.3 that, on average, NoOpt already prunes 74.6%
of the static tasks using performance-only models. On top of that,
TaskOpt prunes an additional 10.7% of the static tasks (Column 8).
As a result, TaskOpt increases the average task size from 544 instructions in NoOpt (Column 9) to 635 (Column 10). Moreover, the average E × D2 of the applications, a metric for time and energy efficiency of computation [14], decreases by 6.5% (Column 11).
The second TLS source of energy in Table 1 is dominated by accesses to L1 cache tags. Such accesses in TLS are both more expensive (since tags have version IDs) and more frequent (e.g., due to
tag-group operations). Column 12 of Table 3 shows that, on average,
NoOpt has 3.3 times the number of tag checks in Uni-3i. However,
with our NoWalk optimization, we eliminate many of these checks.
Specifically, Column 13 shows that, with NoWalk, TLS only has 2.2
times as many tag checks as Uni-3i. Note that these figures include
the contribution of squashed tasks.
The third TLS source of energy is additional traffic. Column 14
of Table 3 shows that, on average, NoOpt has 19.6 times the traffic
of Uni-3i. To compute the traffic, we add up all the bytes of data
or control passed between caches. This traffic increase is caused by
the factors described in Section 3.3. However, after we apply our
TrafRed optimization, the traffic reduces considerably. On average,
with TrafRed, TLS only has 5.6 times the traffic of Uni-3i (Column
15).
The fourth TLS source of energy is additional instructions. Column 16 shows that NoOpt executes on average 12.5% more dynamic
instructions in non-squashed tasks than Uni-3i. The TaskOpt optimization, by eliminating small and inefficient tasks, reduces the additional instructions to 11.9% on average (Column 17).
7.2
The Energy Cost of TLS (∆ET LS )
In Section 3, we defined the energy cost of TLS (∆ET LS ) as the
difference between the energy consumed by our TLS CMPs and Uni3i. Figure 5 characterizes ∆ET LS for our 4-core TLS CMP. The
figure shows six bars for each application. They correspond to the total energy consumed by the chip without any optimization (NoOpt),
with individual optimizations enabled (StallSq, TaskOpt, NoWalk,
and TrafRed), and with all optimizations applied (TLS4-3i). For each
application, the bars are normalized to the energy consumed by Uni-
2.0
64%
∆E TLS
1.6
∆E Squash
∆E Inst
∆E Version
0%
4%
19%
24%
20%
Normalized Energy
2.4
1.2
0.8
0.4
0.0
A BCDE F
bzip2
A BCDE F
crafty
A BCDE F
gap
A BCDE F
gzip
A BCDE F
mcf
A BCDE F
A BCDE F
parser
twolf
A BCDE F
vortex
A BCDE F
A BCDE F
vpr Geo.Mean
∆E Traffic
Non−TLS
A: NoOpt
B: StallSq
C: TaskOpt
D: NoWalk
E: TrafRed
F: TLS4−3i
Figure 5: Energy cost of TLS (∆ET LS ) for our 4-core TLS CMP chip with and without energy-centric optimizations. The
percentages listed above the average bars are the decrease in ∆ET LS when the optimizations are enabled.
7.3
The Impact of Energy-Centric Optimizations
The rest of the bars in Figure 5 show the impact of our energy-centric
optimizations on the TLS energy sources. From the figure, we see
that each optimization effectively reduces the TLS energy sources
that it is expected to minimize from Table 1. This is best seen from
the average bars.
Consider TaskOpt first. In Figure 5, TaskOpt reduces ∆ESquash
and ∆EInst — its targets in Table 1. This is consistent with Table 3, where TaskOpt reduces the fraction of squashed instructions
from 22.6% to 17.6%, and decreases the additional dynamic instructions in non-squashed tasks from 12.5% to 11.9%.
Consider now NoWalk. In Figure 5, NoWalk mostly reduces
∆EV ersion — its target in Table 1. This was expected from Table 3,
where NoWalk reduces the number of tag accesses relative to Uni-3i
from 3.3 times to 2.2 times. In addition, since it reduces the temperature, it also reduces the leakage component in Non-TLS slightly.
If we consider TrafRed in Figure 5, we see that it mostly reduces
∆ET raf f ic — its target in Table 1. Again, this is consistent with
Table 3, where TrafRed reduces the traffic relative to Uni-3i from
19.6 times to 5.6 times on average.
Finally, StallSq only addresses ∆ESquash , which is its target in
Table 1. As expected from the modest numbers in Table 3, where it
reduces squashed instructions from 22.6% to 20.7%, it has a small
impact in Figure 5.
This analysis shows that each of TaskOpt, NoWalk, and TrafRed
effectively reduces a different energy source, and that the three techniques combined cover all sources considered. Consequently, when
we combine all four optimizations in TLS4-3i, all TLS sources of
consumption decrease substantially. The resulting TLS4-3i bar shows
the true energy cost of TLS. If we measure the section of the bar over
1.00, we see that this cost is on average only 25.4%. We feel that this
is a remarkably low energy overhead for TLS.
With our simple optimizations, we have been able to eliminate on
average 64% of ∆ET LS . Compared to the overall on-chip energy
consumed by NoOpt, this is a very respectable energy reduction of
26.5%. Moreover, as we will see later, the applications have only
been slowed down on average by less than 2%.
Finally, an analysis of individual applications reveals many interesting facts. Unfortunately, space limitations prevent any deep discussion. We only note that mcf has a negative ∆ET LS in some cases.
The reason is that, without TLS, the L2 suffers frequent misses; with
TLS, tasks prefetch data for other tasks, removing misses and speeding up the execution significantly (Section 7.5). The result is that the
TLS CMP has less time to leak and to spend dynamic energy cycling,
hence Non-TLS is very small.
7.4
Comparing Energy Consumption Across Chips
Figure 6 compares the energy consumed by our optimized TLS4-3i
chip and Uni-3i, Uni-6i and, for completeness, TLS2-3i. Each bar
is normalized to Uni-3i and broken down into dynamic energy consumed by the clock, core, and memory subsystem, and leakage energy. The memory category includes caches, TLBs, and interconnect.
Normalized Energy
3i. Consequently, the difference between the top of the bars and 1.00
is ∆ET LS .
Each bar in Figure 5 is broken into the contributions of the TLSspecific sources of energy consumption listed in Table 1. These
include task squashing (∆ESquash ), additional dynamic instructions in non-squashed tasks (∆EInst ), hardware for data versioning and dependence checking (∆EV ersion ), and additional traffic
(∆ET raf f ic ). The rest of the bar (Non-TLS) is energy that we do
not attribute to TLS.
Ideally, ∆ET LS should be roughly equal to the addition of the four
TLS-specific sources of energy consumption and, therefore, Non-TLS
should equal 1. In practice, this is not the case because a given program runs on the TLS CMP and on Uni-3i at different speeds and
temperatures. As a result, the “non-TLS” dynamic and leakage energy varies across runs, causing Non-TLS to deviate from 1. In fact,
since for all applications the TLS CMP is faster than Uni-3i (i.e., the
TLS bars in Figure 1-(a) are over 1), Non-TLS is less than 1: nonTLS hardware structures have less time to leak or to spend dynamic
energy cycling idly.
If we consider the NoOpt bars, we see that the energy cost of unoptimized TLS (∆ET LS ) is significant. On average, unoptimized TLS
adds 70.4% to the energy consumed by Uni-3i. We also see that all
four of our TLS sources of energy consumption contribute noticeably.
Of them, task squashing consumes the most energy, while additional
instructions consumes the least.
Clock
Core
Leakage
Mem
1.5
1.0
A: Uni−3i
B: TLS2−3i
C: TLS4−3i
D: Uni−6i
0.5
0.0
A B C D
A B C D
bzip2
crafty
A B C D
gap
A B C D
gzip
A B C D
mcf
A B C D
parser
A B C D
twolf
A B C D
vortex
A B C D
vpr
A B C D
Geo.Mean
Figure 6: Comparing the energy consumption after TLS CMP
optimization. All bars are normalized to Uni-3i.
Consider first Uni-6i. Its core, clock, and memory categories are
larger than in Uni-3i because of the bigger structures in the wide processor. Specifically, the rename table, register file, I-window, L1 data
cache, and data TLB have twice the number of ports. This roughly
doubles the energy per access [20]. Furthermore, all these structures
but the cache and TLB also have more entries. Finally, the forwarding
network also increases its complexity and, therefore, its consumption.
The figure also shows that leakage has increased. The reason is that,
while Uni-6i is faster than Uni-3i, it consumes the highest average
power (Section 7.5) and, therefore, has a higher temperature. Temperature has an exponential impact on leakage.
Compared to Uni-6i, TLS4-3i has smaller core and clock energies
100
51
60
80
Uni−3i
TLS2−3i
TLS4−3i
Uni−6i
42
60
32
Uni−3i
TLS2−3i
TLS4−3i
Uni−6i
Avg. Power (w)
1.19
1.27
1.23
Speedup over Uni−3i
2.2
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
40
20
0
bzip2
crafty
gap
gzip
mcf
parser
twolf
vortex
vpr
Geo.Mean
bzip2
crafty
gap
gzip
mcf
parser
(a)
twolf
vortex
vpr
A.Mean
(b)
Figure 7: Execution speedup relative to Uni-3i (a), and average power consumption (b) for different chips. Note that the mean
used for speedups is the geometric one.
7.5
curve follows possible speedup-power working points for one chip.
The lower a curve is, the more energy-efficient the architecture is.
Each curve shows one data point, which corresponds to the actual
working conditions in our experiments.
80
70
Avg. Power (W)
because it has the simpler hardware structures of Uni-3i. Its leakage
is also smaller because its average power (Section 7.5) and, therefore, temperature are lower than Uni-6i. Its memory category, however, is slightly higher than Uni-6i. The reason is TLS4-3i’s higher
consumption in structures and traffic to support data versioning and
dependence checking.
Comparing Performance/Power Across Chips
Finally, we take the optimized TLS4-3i and compare its performance,
and average power to other chips. Figure 7-(a) shows application
speedups relative to execution on Uni-3i, while Figure 7-(b) shows
the average power consumed during execution. As usual, TLS2-3i is
also shown. As a reference, the arithmetic mean of the average IPC
of the applications on TLS4-3i is 1.38.
Figure 7-(a) shows that, on average, TLS4-3i delivers a speedup of
1.27 over Uni-3i. This shows that our TLS compiler successfully extracts good tasks from these irregular codes. This speedup is slightly
lower than the 1.29 speedup shown in Figure 1-(a). The reason is that
our energy-centric optimizations reduce performance slightly.
Figure 7-(a) also shows that TLS4-3i is on average faster than Uni6i. The speculative parallelism enabled by TLS4-3i in these hardto-parallelize codes is more effective than doubling the issue width.
This is a good result, especially because it conservatively assumes the
same frequency for both chips. In practice, designing the wider issue
processor at this high frequency is likely to be more challenging.
Note that while the TLS4-3i speedup for most codes ranges from
1.10 to 1.35, mcf exhibits a higher speedup. As indicated in Section 7.3, mcf benefits from constructive data prefetching into L2 by
TLS tasks. Without considering mcf, the geometric mean of TLS43i’s speedup is 1.18, which is still comparable to Uni-6i’s, always
assuming the same frequency.
On the other hand, Figure 7-(b) shows that the average on-chip
power consumed by TLS4-3i is typically lower than Uni-6i’s. On
average, it is 15% lower. Moreover, it never reaches the high values
that Uni-6i dissipates in some applications. If we compare Figure 7(b) to Figure 1-(b), we see the effectiveness of our optimizations at
reducing the power consumed by the four-core TLS CMP.
We also compare the average E × D2 of TLS4-3i and Uni-6i. Unfortunately, due to lack of space, we cannot show the complete set of
data. On average, TLS4-3i’s E × D2 is 7.6% lower than Uni-6i’s.
We conclude, therefore, that TLS4-3i is more energy-efficient than
Uni-6i.
We can get further insight if we analytically apply ideal voltagefrequency scaling. We assume that performance is linearly proportional to frequency and scale frequency and voltage proportionally.
We also assume that average dynamic power is proportional to the
cube of frequency and that average leakage power is linearly proportional to voltage [4]. Then, for each chip, we can derive a curve that
relates the average power consumption with performance as:
«3
«
„
„
Speedupnew
Speedupnew
leak
total
dyn
+ Porig
×
Pnew
= Porig
×
Speeduporig
Speeduporig
Figure 8 shows the resulting curves for TLS4-3i and Uni-6i. Each
60
Uni-6i
50
40
TLS4-3i
30
TLS2-3i
20
10
0
0.5
0.75
1
1.25
Speedup
1.5
1.75
Figure 8: Ideal relation between speedup and average power.
We can see that Uni-6i is less energy-efficient than TLS4-3i. If we
scale down TLS4-3i’s frequency until TLS4-3i’s performance is equal
to Uni-6i’s, TLS4-3i consumes 20% less power than Uni-6i (horizontal arrow). Alternatively, if we scale down Uni-6i’s frequency until
Uni-6i’s power is equal to TLS4-3i’s, TLS4-3i is 13% faster than Uni6i (vertical arrow).
Finally, Figure 8 also shows a curve for TLS2-3i. The data shows
that TLS2-3i has a very small efficiency advantage over TLS4-3i.
7.6
Summary
The fundamental reason why TLS CMPs can be more energyefficient than wider-issue superscalars is that energy scales superlinearly and performance sublinearly with the size of processor structures. Consequently, multiple simple TLS cores can be more efficient
than a single wide core, as long as (i) TLS’s hardware overheads and
(ii) TLS’s wasted work are kept to a minimum.
We have addressed these issues with a lean TLS CMP microarchitecture and a set of energy-centric optimizations. The efficient
operation of the final TLS4-3i and TLS2-3i designs is shown in Table 4. Specifically, on average 1.40 and 1.75 cores in TLS2-3i and
TLS4-3i, respectively, are busy (Columns 2 and 3). Moreover, while
busy, these cores execute instructions from squashed tasks for only
10.3% and 16.9% of the cycles, respectively (Columns 4 and 5). This
is in contrast to NoOpt: on average, 2.00 cores are busy (Column 5 of
Table 3), and 22.6% of the instructions executed belong to squashed
tasks (Column 2 of Table 3).
8 Related Work
Past work on TLS CMP architectures has focused on performance
rather than energy (e.g., [9, 10, 13, 21, 22, 25, 26]). There has
been work on reducing the energy consumed in the pipeline due to
instruction-level speculation following a branch prediction [2, 12].
However, the issues addressed are very different.
Concurrently to our work, Petric and Roth [16] developed an infrastructure for selecting pre-execution (prefetching) threads in an
SMT processor. To select threads, they use models that minimize
Apps
bzip2
crafty
gap
gzip
mcf
parser
twolf
vortex
vpr
Avg
Busy CPUs
TLS2-3i
TLS4-3i
1.17
1.46
1.56
1.40
1.68
1.10
1.29
1.31
1.58
1.40
1.36
1.68
1.93
1.48
2.38
1.25
1.62
1.49
2.58
1.75
Squashed Instr. (%)
TLS2-3i
TLS4-3i
4.6
17.0
21.7
14.5
7.1
6.4
1.7
6.5
12.9
10.3
7.5
18.2
31.4
13.2
28.7
13.9
4.4
7.8
27.2
16.9
Table 4: Characterizing the optimized TLS4-3i and TLS2-3i chips.
execution time, energy consumption, or E × D2 . While their models
are somewhat similar to those used in our TaskOpt optimization, the
environments are very different. Our models are focused on trading
off performance and energy in the event of a task squash. Such an
event does not exist in their models. Unlike our tasks, their threads
never get squashed, do not offload computation, are only spawned
from the main thread, and are used in an SMT processor.
9 Conclusions
This paper refutes the claim that TLS consumes excessive energy and
power. Its thesis is based on three contributions. The first one is identifying the main sources of energy consumption in TLS: task squashing, structures for data versioning and dependence checking, additional traffic due to these two effects, and additional instructions. The
second contribution is proposing simple energy-saving optimizations
to mitigate these sources. These optimizations cut the energy cost of
TLS by over 60% on average, with minimal performance impact.
The third contribution is showing that the resulting TLS CMP offers a very desirable energy-performance trade-off, even for SPECint
codes. A TLS CMP with four 3-issue cores delivers an average
speedup of 1.27 over a 3-issue superscalar on full SPECint 2000
codes, while consuming only 25% more energy. Moreover, compared
to a 6-issue superscalar with the same frequency, the TLS CMP is on
average faster, while consuming only 85% of its total on-chip power,
and yielding a 7.6% lower E × D2 .
We hope that this work helps propel TLS into mainstream microprocessors. CMPs are attractive because they are more energyefficient, more scalable, and less complex than wide-issue superscalars. Moreover, they have an advantage for explicitly-parallel
codes. In this paper, we showed that TLS CMPs can also speed
up these most challenging SPECint codes, with lower power and energy consumption than wide superscalars. We expect better results
for more parallel codes.
REFERENCES
[1] International Technology Roadmap for Semiconductors. Semiconductor
Industry Association, 2002.
[2] J. L. Aragon, J. Gonzalez, and A. Gonzalez. Power-Aware Control
Speculation Through Selective Throttling. In International Symposium
on High-Performance Computer Architecture, pages 103–112, February
2003.
[3] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: a Framework for
Architectural-Level Power Analysis and Optimizations. In International
Symposium on Computer Architecture, pages 83–94, June 2000.
[4] J. A. Butts and G. S. Sohi. A Static Power Model for Architects. In International Symposium on Microarchitecture, pages 191–201, December
2000.
[5] M. Cintra, J. F. Martı́nez, and J. Torrellas. Architectural Support for
Scalable Speculative Parallelization in Shared-Memory Multiprocessors. In International Symposium on Computer Architecture, pages 13–
24, June 2000.
[6] M. J. Garzarán, M. Prvulovic, J. M. Llaberı́a, V. Viñals, L. Rauchwerger,
and J. Torrellas. Tradeoffs in Buffering Memory State for Thread-Level
Speculation in Multiprocessors. In International Symposium on HighPerformance Computer Architecture, pages 191–202, February 2003.
[7] SSA for Trees - GNU Project, May 2003. ”http://www.gccsummit.
org/2003/view abstract.php?talk=2”.
[8] S. Gopal, T. Vijaykumar, J. Smith, and G. Sohi. Speculative Versioning
Cache. In International Symposium on High-Performance Computer
Architecture, pages 195–205, February 1998.
[9] L. Hammond, M. Willey, and K. Olukotun. Data Speculation Support
for a Chip Multiprocessor. In International Conference on Architectural
Support for Programming Languages and Operating Systems, pages 58–
69, October 1998.
[10] V. Krishnan and J. Torrellas. A Chip-Multiprocessor Architecture with
Speculative Multithreading. IEEE Trans. on Computers, pages 866–880,
September 1999.
[11] R. Kumar, K. Farkas, N. Jouppi, P. Ranganathan, and D. Tullsen. SingleISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction. In International Symposium on Microarchitecture,
December 2003.
[12] S. Manne, A. Klauser, and D. Grunwald. Pipeline Gating: Speculation
Control for Energy Reduction. In International Symposium on Computer Architecture, pages 132–141, July 1998.
[13] P. Marcuello and A. Gonzalez. Clustered Speculative Multithreaded
Processors. In International Conference on Supercomputing, pages 365–
372, June 1999.
[14] A. J. Martin, M. Nystroem, and P. Penzes. ET2: A Metric for Time and
Energy Efficiency of Computation. Technical Report CSTR:2001.007,
California Institute of Technology, December 2001.
[15] S. Palacharla, N. P. Jouppi, and J. E. Smith. Complexity-Effective Superscalar Processors. In International Symposium on Computer Architecture, June 1997.
[16] V. Petric and A. Roth. Energy-Effectiveness of Pre-Execution and
Energy-Aware P-Thread Selection. Technical Report MS-CIS-03-34,
University of Pennsylvania, November 2003.
[17] M. Prvulovic, M. J. Garzarán, L. Rauchwerger, and J. Torrellas. Removing Architectural Bottlenecks to the Scalability of Speculative Parallelization. In International Symposium on Computer Architecture, pages
204–215, June 2001.
[18] J. Renau. Chip Multiprocessors with Speculative Multithreading: Design for Performance and Energy Efficiency. PhD thesis, University of
Illinois at Urbana-Champaign, 2004.
[19] J. Renau, J. Tuck, W. Liu, L. Ceze, K. Strauss, and J. Torrellas. Tasking
with Out-of-Order Spawn in TLS Chip Multiprocessors: Microarchitecture and Compilation. In International Conference on Supercomputing,
June 2005.
[20] P. Shivakumar and N. Jouppi. CACTI 3.0: An Integrated Cache Timing,
Power and Area Model. Technical Report 2001/2, Compaq Computer
Corporation, August 2001.
[21] G.S. Sohi, S.E. Breach, and T.N. Vijayakumar. Multiscalar Processors.
In International Symposium on Computer Architecture, pages 414–425,
June 1995.
[22] J. Steffan, C. Colohan, A. Zhai, and T. Mowry. A Scalable Approach
to Thread-Level Speculation. In International Symposium on Computer
Architecture, pages 1–12, June 2000.
[23] J. Steffan, C. Colohan, A. Zhai, and T. Mowry. Improving Value Communication for Thread-Level Speculation. In International Symposium
on High-Performance Computer Architecture, February 2002.
[24] H. Su, F. Liu, A. Devgan, E. Acar, and S. Nassif. Full Chip Leakage
Estimation Considering Power Supply and Temperature Variations. In
International Symposium on Low Power Electronics and Design, August
2003.
[25] M. Tremblay. MAJC: Microprocessor Architecture for Java Computing.
Hot Chips, August 1999.
[26] J. Tsai, J. Huang, C. Amlo, D. Lilja, and P. Yew. The Superthreaded
Processor Architecture. IEEE Trans. on Computers, 48(9):881–902,
September 1999.
[27] J. Tuck. A Novel Compiler Framework for a Chip-Multiprocessor Architecture with Thread-Level Speculation. Master’s thesis, University of
Illinois at Urbana-Champaign, 2004.
[28] H. S. Wang, X. P. Zhu, L. S. Peh, and S. Malik. Orion: A PowerPerformance Simulator for Interconnection Networks. In International
Symposium on Microarchitecture, December 2002.
[29] Y. Zhang, D. Parikh, K. Sankaranarayanan, K. Skadron, and M. Stan.
HotLeakage: A Temperature-Aware Model of Subthreshold and Gate
Leakage for Architects. Technical Report CS-2003-05, University of
Virginia, Department of Computer Science, March 2003.