SMT and CMP Architectures

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 19
At a glance
Powered by AI
The key takeaways are that there are different techniques like instruction level parallelism, thread level parallelism and simultaneous multithreading to exploit parallelism in modern processors. Different techniques like superscalar, multithreading and chip multiprocessing are used to achieve parallelism.

The different types of parallelism discussed are instruction level parallelism, thread level parallelism, fine grained multithreading, coarse grained multithreading and simultaneous multithreading.

The key differences between fine-grained and coarse-grained multithreading are that fine-grained multithreading switches threads on every clock cycle to hide latency of both short and long stalls while coarse-grained multithreading switches threads only on costly stalls like L2 cache misses to avoid slow down of ready threads but has limitations in hiding shorter stalls.

INTRODUCTION

SMT and CMP Architectures

DINESH

Contemporary forms of parallelism

Instruction-level parallelism(ILP)

Wide-issue Superscalar processors (SS)

4 or more instruction per cycle


Executing a single program or thread
Attempts to find multiple instructions to issue each cycle.
Out-of-order execution => instructions are sent to execution
units based on instruction dependencies rather than program
order

Thread-level parallelism(TLP)

Fine-grained multithreaded superscalars(FGMS)

Contain hardware state for several threads


Executing multiple threads
On any given cycle a processor executes instructions from one
of the threads

Multiprocessor(MP)

Performance improved by adding more CPUs

Simultaneous Multithreading

Key idea
Issue multiple instructions from multiple threads each
cycle

Features

Fully exploit thread-level parallelism and instructionlevel parallelism.


Multiple functional units

Modern processors have more functional units available then a


single thread can utilize.

Register renaming and dynamic scheduling

Multiple instructions from independent threads can co-exist


and co-execute.

Time (processor cycle)

Summary: Multithreaded Categories


Superscalar

Fine-Grained Coarse-Grained

Thread 1
Thread 2

Thread 3
Thread 4

Simultaneous
Multithreading

Thread 5
Idle slot

Horizontal dimension represents the instruction issue


capabilty in each clock cycles.
Vertical dimension represents a sequence of clock cycles.
Empty slots indicates that the corresponding issue slots
are unused in that clock cycles.

Superscalar processor with no multithreading:


only one thread is processed in one clock cycle

Use of issue slots is limited by a lack of ILP.

Stalls such as an instruction cache miss leaves the entire processor


idle.
Fine-grained multithreading:
switches threads on every clock cycle

Pro: hide latency of from both short and long stalls

Con: Slows down execution of the individual threads ready to go.


Only one thread issues inst. In a given clock cycle.
Course-grained multithreading:
switches threads only on costly stalls
(e.g., L2 stalls)

Pros: no switching each clock cycle, no slow down for ready-to-go


threads. Reduces no of completely idle clock cycles.

Con: limitations in hiding shorter stalls

Simultaneous Multithreading:

exploits TLP at the same time it exploits ILP with multiple


threads using the issue slots in a single-clock cycle.
issue slots is limited by the following factors:

Imbalances in the resource needs.


Resource availability over multiple threads.
Number of active threads considered.
Finite limitations of buffer.
Ability to fetch enough instructions from
multiple threads.
Practical limitations of what instructions
combinations can issue from one thread and
multiple threads.

Performance Implications of SMT

Single thread performance is likely to go down (caches,


branch predictors, registers, etc. are shared) this effect
can be mitigated by trying to prioritize one thread
While fetching instructions, thread priority can
dramatically influence total throughput a widely
accepted heuristic (ICOUNT): fetch such that each thread
has an equal share of processor resources
With eight threads in a processor with many resources,
SMT yields throughput improvements of roughly 2-4
Alpha 21464 and Intel Pentium 4 are examples of SMT

Effectively Using Parallelism on a SMT Processor


Parallel workload

threads

SS

MP2

MP4

FGMT

SMT

3.3

2.4

1.5

3.3

3.3

--

4.3

2.6

4.1

4.7

--

--

4.2

4.2

5.6

--

--

--

3.5

6.1

Instruction Throughput executing a parallel workload

Comparison of SMT vs
Superscalar
SMT processors are compared to base superscalar
processors in several key measures :
Utilization of functional units.
Utilization of fetch units.
Accuracy of branch predictor.
Hit rates of primary caches.
Hit rates of secondary caches.
Performance improvement:

Issue slots.

Funtional units.

Renaming registers.

CMP Architecture

Chip-level multiprocessing(CMP or multicore):


integrates two or more independent cores(normally a
CPU) into a single package composed of a single
integrated circuit(IC), called a die, or more dies
packaged, each executing threads independently.
Every funtional units of a processor is duplicated.
Multiple processors, each with a full set of
architectural resources, reside on the same die
Processors may share an on-chip cache
or each can have its own cache
Examples: HP Mako, IBM Power4
Challenges: Power, Die area (cost)

Single core computer

Single core CPU chip


Single core

Multi-core CPU chip


Core 1

Core 2

Core 3

Core 4

Chip Multithreading
Chip Multithreading = Chip Multiprocessing + Hardware
Multithreading.

Chip Multithreading is the capability of a processor to process


multiple s/w threads simulataneous h/w threads of execution.

CMP is achieved by multiple cores on a single chip or multiple


threads on a single core.

CMP processors are especially suited to server workloads, which


generally have high levels of Thread-Level Parallelism(TLP).

CMPs Performance
CMPs are now the only way to build high performance
microprocessors , for a variety of reasons:
o
Large uniprocessors are no longer scaling in performance,
because it is only possible to extract a limited amount of
parallelism from a typical instruction stream.
o
Cannot simply ratchet up the clock speed on todays
processors,or the power dissipation will become prohibitive.
o
CMT processors support many h/w strands through efficient
sharing of on-chip resources such as pipelines, caches and
predictors.
o
CMT processors are a good match for server workloads,which
have high levels of TLP and relatively low levels of ILP.

SMT and CMP

The performance race between SMT and CMP is not yet decided.
CMP is easier to implement, but only SMT has the ability to hide
latencies.
A functional partitioning is not exactly reached within a SMT
processor due to the centralized instruction issue.
o
A separation of the thread queues is a possible solution,
although it does not remove the central instruction issue.
o
A combination of simultaneous multithreading with the CMP
may be superior.
Research : combine SMT or CMP organization with the ability to
create threads with compiler support of fully dynamically out of a
single thread.
o
Thread-level speculation
o
Close to multiscalar

Time (Processor cycle)

Multiprocessor vs. SMT


Multiprocessor(MP2)

SMT

Unutilized
Thread 1
Thread 2

THANK U GUYS

You might also like