Hyper-Threading Technology: Shaik Mastanvali (06951A0541)
Hyper-Threading Technology: Shaik Mastanvali (06951A0541)
Hyper-Threading Technology: Shaik Mastanvali (06951A0541)
SHAIK MASTANVALI
(06951A0541)
ACKOWLEDGEMENT
SHAIK MASTANVALI
ABSTRACT:
appear as two logical processors. The physical execution resources are shared and the
architecture state is duplicated for the two logical processors. From a software or
architecture perspective, this means operating systems and user programs can schedule
From a micro architecture perspective, this means that instructions from both logical
processors will persist and execute simultaneously on shared execution resources. This
• PROCESSOR MICROARCHITECTURE
• HYPER THREADING TECHNOLOGY ARCHITECTURE
• APPLICATIONS
• PERFORMANCE
• COMPATIBILITY
• CONCLUSION
INTRODUCTION
The amazing growth of the Internet and telecommunications is powered by ever-
faster systems demanding increasingly higher levels of processor performance. To keep
up with this demand we cannot rely entirely on traditional approaches to processor
design. Microarchitecture techniques used to achieve past processor performance
improvement–superpipelining, branch prediction, super-scalar execution, out-of-order
execution, caches–have made microprocessors increasingly more complex, have more
transistors, and consume more power. In fact, transistor counts and power are increasing
at rates greater than processor performance. Processor architects are therefore looking for
ways to improve performance at a greater rate than transistor counts and power
dissipation. Intel’s Hyper-Threading Technology is one solution.
Processor Microarchitecture
Traditional approaches to processor design have focused on higher clock speeds,
instruction-level parallelism (ILP), and caches. Techniques to achieve higher clock
speeds involve pipelining the microarchitecture to finer granularities, also called super-
pipelining. Higher clock frequencies can greatly improve performance by increasing the
number of instructions that can be executed each second. Because there will be far more
instructions in-flight in a superpipelined microarchitecture, handling of events that
disrupt the pipeline, e.g., cache misses, interrupts and branch mispredictions, can be
costly.
Figure 1 shows the relative increase in performance and the costs, such as die size and
power, over the last ten years on Intel processors1. In order to isolate the
microarchitecture impact, this comparison assumes that the four generations of
processors are on the same silicon process technology and that the speed-ups are
normalized to the performance of an Intel486 processor. Although we use Intel’s
processor history in this example, other high-performance processor manufacturers
during this time period would have similar trends. Intel’s processor performance, due to
microarchitecture advances alone, has improved integer performance five- or six-fold.
Most integer applications have limited ILP and the instruction flow can be hard to
predict.
Over the same period, the relative die size has gone up fifteen-fold, a three-times-
higher rate than the gains in integer performance. Fortunately, advances in silicon process
technology allow more transistors to be packed into a given amount of die area so that the
actual measured die size of each generation microarchitecture has not increased
significantly.
Thread-Level Parallelism
In both the high-end and mid-range server markets, multiprocessors have been
commonly used to get more performance from the system. By adding more processors,
applications potentially get substantial performance improvement by executing multiple
threads on multiple processors at the same time. These threads might be from the same
application, from different applications running simultaneously, from operating system
services, or from operating system threads doing background maintenance.
Multiprocessor systems have been used for many years, and high-end programmers are
familiar with the techniques to exploit multiprocessors for higher performance levels.
In recent years a number of other techniques to further exploit TLP have been
discussed and some products have been announced. One of these techniques is chip
multiprocessing (CMP), where two processors are put on a single die. The two processors
each have a full set of execution and architectural resources. The processors may or may
not share a large on-chip cache. CMP is largely orthogonal to conventional
multiprocessor systems, as you can have multiple CMP processors in a multiprocessor
configuration. Recently announced processors incorporate two processors on each die.
However, a CMP chip is significantly larger than the size of a single-core chip and
therefore more expensive to manufacture; moreover, it does not begin to address the die
size and power considerations.
Each logical processor maintains a complete set of the architecture state. The architecture
state consists of registers including the general-purpose registers, the control registers, the
advanced programmable interrupt controller (APIC) registers, and some machine state
registers. From a software perspective, once the architecture state is duplicated, the
processor appears to be two processors. The number of transistors to store the
architecture state is an extremely small fraction of the total. Logical processors share
nearly all other resources on the physical processor, such as caches, execution units,
branch predictors, control logic, and buses.
Each logical processor has its own interrupt controller or APIC. Interrupts sent to
a specific logical processor are handled only by that logical processor.
A third goal was to allow a processor running only one active software thread to
run at the same speed on a processor with Hyper-Threading Technology as on a processor
without this capability. This means that partitioned resources should be recombined when
only one software thread is active. A high-level view of the microarchitecture pipeline is
shown in Figure 4. As shown, buffering queues separate major pipeline logic blocks. The
buffering queues are either partitioned or duplicated to ensure independent forward
progress through each logic block.
In the following sections we will walk through the pipeline, discuss the implementation
of major functions, and detail several ways resources are shared or replicated.
FRONT END
The front end of the pipeline is responsible for delivering instructions to the later pipe
stages. As shown in Figure 5a, instructions generally come from the Execution Trace
Cache (TC), which is the primary or Level 1 (L1) instruction cache. Figure 5b shows that
only when there is a TC miss does the machine fetch and decode instructions from the
integrated Level 2 (L2) cache. Near the TC is the Microcode ROM, which stores decoded
instructions for the longer and morecomplex IA-32 instructions.
Figure 5: Front-end detailed pipeline (a) Trace Cache Hit (b) Trace Cache Miss
Microcode ROM
If there is a TC miss, then instruction bytes need to be fetched from the L2 cache
and decoded into uops to be placed in the TC. The Instruction Translation Lookaside
Buffer (ITLB) receives the request from the TC to deliver new instructions, and it
translates the next-instruction pointer address to a physical address. A request is sent to
the L2 cache, and instruction bytes are returned. These bytes are placed into streaming
buffers, which hold the bytes until they can be decoded.
The ITLBs are duplicated. Each logical processor has its own ITLB and its own
set of instruction pointers to track the progress of instruction fetch for the two logical
processors. The instruction fetch logic in charge of sending requests to the L2 cache
arbitrates on a first-come first-served basis, while always reserving at least one request
slot for each logical processor. In this way, both logical processors can have fetches
pending simultaneously.
Each logical processor has its own set of two 64-byte streaming buffers to hold
instruction bytes in preparation for the instruction decode stage. The ITLBs and the
streaming buffers are small structures, so the die size cost of duplicating these structures
is very low.
The branch prediction structures are either duplicated or shared. The return stack
buffer, which predicts the target of return instructions, is duplicated because it is a very
small structure and the call/return pairs are better predicted for software threads
independently. The branch history buffer used to look up the global history array is also
tracked independently for each logical processor. However, the large global history array
is a shared structure with entries that are tagged with a logical processor ID.
The decode logic takes instruction bytes from the streaming buffers and decodes
them into uops. When both threads are decoding instructions simultaneously, the
streaming buffers alternate between threads so that both threads share the same decoder
logic. The decode logic has to keep two copies of all the state needed to decode IA-32
instructions for the two logical processors even though it only decodes instructions for
one logical processor at a time. In general, several instructions are decoded for one
logical processor before switching to the other logical processor. The decision to do a
coarser level of granularity in switching between logical processors was made in the
interest of die size and to reduce complexity. Of course, if only one logical processor
needs the decode logic, the full decode bandwidth is dedicated to that logical processor.
The decoded instructions are written into the TC and forwarded to the uop queue.
Uop Queue
After uops are fetched from the trace cache or the Microcode ROM, or forwarded
from the instruction decode logic, they are placed in a “uop queue.” This queue decouples
the Front End from the Out-of-order Execution Engine in the pipeline flow. The uop
queue is partitioned such that each logical processor has half the entries. This partitioning
allows both logical processors to make independent forward progress regardless of front-
end stalls (e.g., TC miss) or
execution stalls.
Allocator
The out-of-order execution engine has several buffers to perform its re-ordering,
tracing, and sequencing operations. The allocator logic takes uops from the uop queue
and allocates many of the key machine buffers needed to execute each uop, including the
126 re-order buffer entries, 128 integer and 128 floating-point physical registers, 48 load
and 24 store buffer entries. Some of these key buffers are partitioned such that each
logical processor can use at most half the entries.
Figure 6: Out-of-order execution engine detailed pipeline
By limiting the maximum resource usage of key buffers, the machine helps
enforce fairness and prevents deadlocks.
Register Rename
The register rename logic renames the architectural IA- 32 registers onto the
machine’s physical registers. This allows the 8 general-use IA-32 integer registers to be
dynamically expanded to use the available 128 physical registers. The renaming logic
uses a Register Alias Table (RAT) to track the latest version of each architectural register
to tell the next instruction(s) here to get its input operands.
Since each logical processor must maintain and track its own omplete
architecture state, there are two RATs, one for each logical processor. The register
renaming process is done in parallel to the allocator logic described above, so the register
rename logic works onthe same uops to which the allocator is assigning resources.
Once uops have completed the allocation and register rename processes, they are placed
into two sets of queues, one for memory operations (loads and stores) and another for all
other operations. The two sets of queues are called the memory instruction queue and the
general instruction queue, respectively. The two sets of queues are also partitioned such
that uops from each logical processor can use at most half the entries.
Instruction Scheduling
The schedulers are at the heart of the out-of-order execution engine. Five uop
schedulers are used to schedule different types of uops for the various execution units.
Collectively, they can dispatch up to six uops each clock cycle. The schedulers determine
when uops are ready to execute based on the readiness of their dependent input register
operands and the availability of the execution unit resources.
The memory instruction queue and general instruction queues send uops to the
five scheduler queues as fast as they can, alternating between uops for the two logical
processors every clock cycle, as needed.
Each scheduler has its own scheduler queue of eight to twelve entries from which
it selects uops to send to the execution units. The schedulers choose uops regardless of
whether they belong to one logical processor or the other. The schedulers are effectively
oblivious to logical processor distinctions. The uops are simply evaluated based on
dependent inputs and availability of execution resources. For example, the schedulers
could dispatch two uops from one logical processor and two uops from the other logical
processor in the same clock cycle. To avoid deadlock and ensure fairness, there is a limit
on the number of active entries that a logical processor can have in each scheduler’s
queue. This limit is dependent on the size of the scheduler queue.
Execution Units
The execution core and memory hierarchy are also largely oblivious to logical
processors. Since the source and destination registers were renamed earlier to physical
registers in a shared physical register pool, uops merely access the physical register file to
get their destinations, and they write results back to the physical register file. Comparing
physical register numbers enables the forwarding logic to forward results to other
executing uops without having to understand logical processors.
After execution, the uops are placed in the re-order buffer. The re-order buffer
decouples the execution stage from the retirement stage. The re-order buffer is partitioned
such that each logical processor can use half the entries.
Retirement
Uop retirement logic commits the architecture state in program order. The
retirement logic tracks when uops from the two logical processors are ready to be retired,
then retires the uops in program order for each logical processor by alternating between
the two logical processors. Retirement logic will retire uops for one logical processor,
then the other, alternating back and forth. If one logical processor is not ready to retire
any uops then all retirement bandwidth is dedicated to the other logical processor.
Once stores have retired, the store data needs to be written into the level-one data
cache. Selection logic alternates between the two logical processors to commit store data
to the cache.
MEMORY SUBSYSTEM
The memory subsystem includes the DTLB, the lowlatency Level 1 (L1) data
cache, the Level 2 (L2) unified cache, and the Level 3 unified cache (the Level 3 cache is
only available on the IntelXeonprocessor MP). Access to the memory subsystem is
also largely oblivious to logical processors. The schedulers send load or store uops
without regard to logical processors and the memory subsystem handles them as they
come.
DTLB
The L1 data cache is 4-way set associative with 64-byte lines. It is a write-through cache,
meaning that writes are always copied to the L2 cache. The L1 data cache is virtually
addressed and physically tagged. The L2 and L3 caches are 8-way set associative with
128-byte lines. The L2 and L3 caches are physically addressed. Both logical processors,
without regard to which logical processor’s uops may have initially brought the data into
the cache, can share all entries in
all three levels of cache.
Because logical processors can share data in the cache, there is the potential for
cache conflicts, which can result in lower observed performance. However, there is also
the possibility for sharing data in the cache. For example, one logical processor may
prefetch instructions or data, needed by the other, into the cache; this is common in server
application code. In a producer-consumer usage model, one logical processor may
produce data that the other logical processor wants to use. In such cases, there is the
potential for good performance benefits.
BUS
Logical processor memory requests not satisfied by the cache hierarchy are
serviced by the bus logic. The bus logic includes the local APIC interrupt controller, as
well as off-chip system memory and I/O space. Bus logic also deals with cacheable
address coherency (snooping) of requests originated by other external bus agents, plus
incoming interrupt request delivery via the local APICs.
From a service perspective, requests from the logical processors are treated on a
first-come basis, with queue and buffering space appearing shared. Priority is not given to
one logical processor above the other. Distinctions between requests from the logical
processors are reliably maintained in the bus queues nonetheless. Requests to the local
APIC and interrupt delivery resources are unique and separate per logical processor. Bus
logic also carries out portions of barrier fence and memory ordering operations, which are
applied to the bus request queues on a per logical processor basis.
PERFORMANCE
The IntelXeonprocessor family delivers the highest server system
performance of any IA-32 Intel architecture processor introduced to date. Initial
benchmark tests show up to a 65% performance increase on high-end server applications
when compared to the previous-generation Pentium® IIIXeon™ processor on 4-way
server platforms. A significant portion of those gains can be attributed to Hyper-
Threading Technology.
Figure 8: Performance increases from Hyper-
Threading Technology on an OLTP workload
All the performance results quoted above are normalized to ensure that readers
focus on the relative performance and not the absolute performance.
Performance tests and ratings are measured using specific computer systems
and/or components and reflect the approximate performance of Intel products as
measured by those tests. Any difference in system hardware or software design or
configuration may affect actual performance. Buyers should consult other sources of
information to evaluate the performance of systems or components they are considering
purchasing.
CONCLUSION