Lec 01
Lec 01
Lec 01
1
Trends in Chip Industry
• Long history since 1971
– Introduction of Intel 4004
– http://www.intel4004.com/
• Today we talk about more than one billion
transistors on a die
– Intel’s 4-core Sandy Bridge and Ivy Bridge
have more than a billion transistors; 8-core
Sandy Bridge has more than two billion
transistors
– AMD’s Valencia and Interlagos families of
Opteron and IBM’s POWER7 have more than
a billion transistors
– Minimum feature size has shrunk from 10
micron in 1971 to 0.014 micron today 2
Agenda
• Unpipelined microprocessors
• Pipelining: simplest form of ILP
• Out-of-order execution: more ILP
• Multiple issue: drink more ILP
• Scaling issues and Moore’s Law
• Why multi-core
– TLP and de-centralized design
• Tiled CMP and shared cache
• Implications on software
3
Unpipelined Microprocessors
• Typically an instruction enjoys five
phases in its life
– Instruction fetch from memory
– Instruction decode and operand register
read
– Execute
– Data memory access
– Register write
• Unpipelined execution would take a long
single cycle or multiple short cycles
– Only one instruction inside processor at any
point in time
4
Pipelining
• One simple observation
– Exactly one piece of hardware is active at
any point in time
• Why not fetch a new instruction every
cycle?
– Five instructions in five different phases
– Throughput increases five times (ideally)
• Bottom-line is
– If consecutive instructions are independent,
they can be processed in parallel
– The first form of instruction-level
parallelism (ILP) 5
Pipelining Hazards
• Instruction dependence limits achievable
parallelism
– Control and data dependence (aka hazards)
• Finite amount of hardware limits
achievable parallelism
– Structural hazards
• Control dependence
– On average, every fifth instruction is a
branch (coming from if-else, for, do-while,…)
– Branches execute in the third phase
• Introduces bubbles unless you are smart 6
Control Dependence
Branch IF ID EX MEM WB
Instr. X IF ID EX MEM WB
Instr. Y IF ID EX …
Target if (cond)IF
{…} ID …
else { … }
What do you fetch in X and y slots?
Options: nothing, fall-through, learn past
history and predict (today best
predictors achieve very high accuracy,
often more than 95%) 7
Data Dependence
Hardware bypass
add r1, r2, r3 IF ID EX MEM WB
xor r5, r1, r2 IF ID EX MEM WB
10
Out-of-order Execution
load r2, addr Cache miss
24
Tiled CMP (Hypothetical Floor-plan)
Non-uniform access L2
(NUCA)
27
Quad-core Sandy Bridge
28
Communication in Multi-core
• Ideal for shared address space
– Fast on-chip hardwired communication
through cache (no OS intervention)
– Two types of architectures
• Tiled CMP: each core has its private cache
hierarchy (no cache sharing); Intel Pentium
D, Dual Core Opteron, Intel Montecito, Sun
UltraSPARC IV, IBM Cell (more specialized)
• Shared cache CMP: Outermost level of cache
hierarchy is shared among cores; Intel
Nehalem (4 and 8 cores), Intel Dunnington (6
cores), Intel Woodcrest (server-grade Core
duo), Intel Conroe (Core2 duo for desktop),
29
Sun Niagara, IBM Power4, IBM Power5
Shared vs. Private in CMPs
• Shared caches are often very large in
the CMPs
– They are banked to avoid worst-case wire
delay
– The banks are usually distributed across the
floor of the chip on an interconnect
B B P B B
P P
B B
P B
B 30
Shared vs. Private in CMPs
• In shared caches, getting a block from a
remote bank takes time proportional to
the physical distance between the
requester and the bank
– Non-uniform cache access (NUCA)
• This is same for private caches, if the
data resides in a remote cache
• Shared cache may have higher average
hit latency than the private cache
– Hopefully most hits in the latter will be local
• Shared caches are most likely to have
less misses than private caches
– Replication of data and partitioning of space
31
Implications on Software
• A tall memory hierarchy
– Each core could run multiple threads
• Each core in Niagara runs four threads
– Within core, threads communicate through
private cache (fastest)
– Across cores communication happens through
shared L2 or coherence controller (if tiled)
– Multiple such chips can be connected over a
scalable network
• Adds one more level of memory hierarchy
• A very non-uniform access stack
32
Research Directions
• Puzzles that keep my brain occupied
– Running single-threaded programs efficiently
on this sea of cores
– Managing energy envelope efficiently
– Allocating shared cache efficiently
– Allocating shared off-chip bandwidth
– Utilizing DRAM efficiently
– Making parallel programming easy
• Transactional memory
• Speculative parallelization
– Verification of hardware and parallel
software 33
Research Directions
• Puzzles that keep my brain occupied
– Tolerating faults
– Automatic and semi-automatic parallelization
– Programming model
• Message passing? (a la 48-core SCC or 80-core
Polaris)
• Shared memory? (current state of the art)
• Hybrid? Hardware-software trade-off?
– Architecting massively parallel accelerators
– Security
34