CPU Vs GPU Architectures PDF
CPU Vs GPU Architectures PDF
CPU Vs GPU Architectures PDF
Kayvon Fatahalian
15-462 (Fall 2011)
Today
2
Part 1:
The graphics pipeline
(an abstraction)
3
Vertex processing
Vertices are transformed into “screen space”
v2
v0
v5
v4
v3 v1
Vertices
Vertex processing
Vertices are transformed into “screen space”
v2
v0
v5
EACH VERTEX IS
v4
TRANSFORMED
INDEPENDENTLY
v3 v1
Vertices
Primitive processing
Then organized into primitives that are clipped and
culled…
v2
v2
v0
v0 v5
v5
v4
v4
v3 v1 v1
v3
Vertices Primitives
(triangles)
Rasterization
Primitives are rasterized into “pixel fragments”
Fragments
Rasterization
Primitives are rasterized into “pixel fragments”
Shaded fragments
Fragment processing
Fragments are shaded to compute a color at each pixel
Pixels
Pipeline entities
v2 v2
v0 v0
v5 v5
v4 v4
v3 v1 v3 v1
Primitive Generation
Primitive stream
Primitives
Primitive Processing Textures
Primitive stream
Fragment Generation
Fragment stream
Fragments
Fragment Processing Textures
Fragment stream
14
Independent
• What’s so important about “independent”
computations?
15
Silicon Graphics RealityEngine (1993)
“graphics supercomputer” Vertex Generation
Vertex Processing
Primitive Generation
Primitive Processing
Fragment Generation
Fragment Processing
Pixel Operations
Pre-1999 PC 3D graphics accelerator
Vertex Generation
CPU
Vertex Processing
Primitive Generation
Primitive Processing
3dfx Voodoo Clip/cull/rasterize
Fragment Generation
NVIDIA RIVA TNT Tex Tex
Pixel Operations
GPU* circa 1999
CPU
Vertex Generation
Vertex Processing
Primitive Processing
Fragment Generation
Fragment Processing
NVIDIA GeForce 256
Pixel Operations
Direct3D 9 programmability: 2002
Vertex Generation
Primitive Generation
Tex Tex Tex Tex
Frag Frag Frag Frag
Primitive Processing
Tex Tex Tex Tex
Frag Frag Frag Frag
Fragment Generation
Pixel operations
Fragment Processing
ATI Radeon 9700
Pixel Operations
Direct3D 10 programmability: 2006
Vertex Generation
Vertex Processing
Core Core Core Core Pixel op Pixel op
Fragment Generation
21
GPUs are fast
(obtainable if you code your program to (obtainable if you write OpenGL programs
use 4 threads and SSE vector instr) like you’ve done in this class)
22
A diffuse reflectance shader
sampler(mySamp;
Texture2D<float3>(myTex;
float3(lightDir; Shader programming model:
float4(diffuseShader(float3(norm,(float2(uv)
{
Fragments are processed independently,
((float3(kd; but there is no explicit parallel
((kd(=(myTex.Sample(mySamp,(uv); programming.
((kd(*=(clamp((dot(lightDir,(norm),(0.0,(1.0);
((return(float4(kd,(1.0);(((
}(
Independent logical sequence of control
per fragment. ***
A diffuse reflectance shader
sampler(mySamp;
Texture2D<float3>(myTex;
float3(lightDir; Shader programming model:
float4(diffuseShader(float3(norm,(float2(uv)
{
Fragments are processed independently,
((float3(kd; but there is no explicit parallel
((kd(=(myTex.Sample(mySamp,(uv); programming.
((kd(*=(clamp((dot(lightDir,(norm),(0.0,(1.0);
((return(float4(kd,(1.0);(((
}(
Independent logical sequence of control
per fragment. ***
A diffuse reflectance shader
sampler(mySamp;
Texture2D<float3>(myTex;
float3(lightDir; Shader programming model:
float4(diffuseShader(float3(norm,(float2(uv)
{
Fragments are processed independently,
((float3(kd; but there is no explicit parallel
((kd(=(myTex.Sample(mySamp,(uv); programming.
((kd(*=(clamp((dot(lightDir,(norm),(0.0,(1.0);
((return(float4(kd,(1.0);(((
}(
Independent logical sequence of control
per fragment. ***
Big Guy, lookin’ diffuse
Compile shader
1 unshaded fragment input record
sampler(mySamp;
Texture2D<float3>(myTex;
float3(lightDir; <diffuseShader>:
sample(r0,(v4,(t0,(s0
mul((r3,(v0,(cb0[0]
float4(diffuseShader(float3(norm,(float2(uv) madd(r3,(v1,(cb0[1],(r3
{ madd(r3,(v2,(cb0[2],(r3
((float3(kd; clmp(r3,(r3,(l(0.0),(l(1.0)
mul((o0,(r0,(r3
((kd(=(myTex.Sample(mySamp,(uv);
mul((o1,(r1,(r3
((kd(*=(clamp((dot(lightDir,(norm),(0.0,(1.0); mul((o2,(r2,(r3
((return(float4(kd,(1.0);((( mov((o3,(l(1.0)
Fetch/
Decode
<diffuseShader>:
sample(r0,(v4,(t0,(s0
ALU mul((r3,(v0,(cb0[0]
(Execute) madd(r3,(v1,(cb0[1],(r3
madd(r3,(v2,(cb0[2],(r3
clmp(r3,(r3,(l(0.0),(l(1.0)
Execution mul((o0,(r0,(r3
Context mul((o1,(r1,(r3
mul((o2,(r2,(r3
mov((o3,(l(1.0)
Execute shader
Fetch/
Decode
<diffuseShader>:
sample(r0,(v4,(t0,(s0
ALU mul((r3,(v0,(cb0[0]
(Execute) madd(r3,(v1,(cb0[1],(r3
madd(r3,(v2,(cb0[2],(r3
clmp(r3,(r3,(l(0.0),(l(1.0)
Execution mul((o0,(r0,(r3
Context mul((o1,(r1,(r3
mul((o2,(r2,(r3
mov((o3,(l(1.0)
Execute shader
Fetch/
Decode
<diffuseShader>:
sample(r0,(v4,(t0,(s0
ALU mul((r3,(v0,(cb0[0]
(Execute) madd(r3,(v1,(cb0[1],(r3
madd(r3,(v2,(cb0[2],(r3
clmp(r3,(r3,(l(0.0),(l(1.0)
Execution mul((o0,(r0,(r3
Context mul((o1,(r1,(r3
mul((o2,(r2,(r3
mov((o3,(l(1.0)
Execute shader
Fetch/
Decode
<diffuseShader>:
sample(r0,(v4,(t0,(s0
ALU mul((r3,(v0,(cb0[0]
(Execute) madd(r3,(v1,(cb0[1],(r3
madd(r3,(v2,(cb0[2],(r3
clmp(r3,(r3,(l(0.0),(l(1.0)
Execution mul((o0,(r0,(r3
Context mul((o1,(r1,(r3
mul((o2,(r2,(r3
mov((o3,(l(1.0)
Execute shader
Fetch/
Decode
<diffuseShader>:
sample(r0,(v4,(t0,(s0
ALU mul((r3,(v0,(cb0[0]
(Execute) madd(r3,(v1,(cb0[1],(r3
madd(r3,(v2,(cb0[2],(r3
clmp(r3,(r3,(l(0.0),(l(1.0)
Execution mul((o0,(r0,(r3
Context mul((o1,(r1,(r3
mul((o2,(r2,(r3
mov((o3,(l(1.0)
Execute shader
Fetch/
Decode
<diffuseShader>:
sample(r0,(v4,(t0,(s0
ALU mul((r3,(v0,(cb0[0]
(Execute) madd(r3,(v1,(cb0[1],(r3
madd(r3,(v2,(cb0[2],(r3
clmp(r3,(r3,(l(0.0),(l(1.0)
Execution mul((o0,(r0,(r3
Context mul((o1,(r1,(r3
mul((o2,(r2,(r3
mov((o3,(l(1.0)
“CPU-style” cores
Fetch/
Decode
Data cache
(a big one)
ALU
(Execute)
Memory pre-fetcher
Slimming down
Fetch/
Decode
Idea #1:
ALU
(Execute) Remove components that
help a single instruction
Execution stream run fast
Context
Two cores (two fragments in parallel)
fragment 1 fragment 2
Fetch/ Fetch/
Decode Decode
<diffuseShader>: <diffuseShader>:
sample(r0,(v4,(t0,(s0 ALU ALU sample(r0,(v4,(t0,(s0
mul((r3,(v0,(cb0[0] mul((r3,(v0,(cb0[0]
madd(r3,(v1,(cb0[1],(r3 (Execute) (Execute) madd(r3,(v1,(cb0[1],(r3
madd(r3,(v2,(cb0[2],(r3 madd(r3,(v2,(cb0[2],(r3
clmp(r3,(r3,(l(0.0),(l(1.0) clmp(r3,(r3,(l(0.0),(l(1.0)
mul((o0,(r0,(r3 mul((o0,(r0,(r3
mul((o1,(r1,(r3
mul((o2,(r2,(r3
Execution Execution mul((o1,(r1,(r3
mul((o2,(r2,(r3
mov((o3,(l(1.0)
Context Context mov((o3,(l(1.0)
Four cores (four fragments in parallel)
Fetch/ Fetch/
Decode Decode
ALU ALU
(Execute) (Execute)
Execution Execution
Context Context
Fetch/ Fetch/
Decode Decode
ALU ALU
(Execute) (Execute)
Execution Execution
Context Context
Sixteen cores (sixteen fragments in parallel)
Fetch/
Decode
ALU
(Execute)
Execution
Context
Add ALUs
Fetch/
Idea #2:
Decode Amortize cost/complexity of
ALU 1 ALU 2 ALU 3 ALU 4
managing an instruction
ALU 5 ALU 6 ALU 7 ALU 8
stream across many ALUs
Fetch/
Decode <diffuseShader>:
sample(r0,(v4,(t0,(s0
mul((r3,(v0,(cb0[0]
ALU 1 ALU 2 ALU 3 ALU 4
madd(r3,(v1,(cb0[1],(r3
madd(r3,(v2,(cb0[2],(r3
ALU 5 ALU 6 ALU 7 ALU 8 clmp(r3,(r3,(l(0.0),(l(1.0)
mul((o0,(r0,(r3
mul((o1,(r1,(r3
Ctx Ctx Ctx Ctx mul((o2,(r2,(r3
mov((o3,(l(1.0)
Fetch/
Decode <VEC8_diffuseShader>:
VEC8_sample(vec_r0,(vec_v4,(t0,(vec_s0
VEC8_mul((vec_r3,(vec_v0,(cb0[0]
ALU 1 ALU 2 ALU 3 ALU 4
VEC8_madd(vec_r3,(vec_v1,(cb0[1],(vec_r3
VEC8_madd(vec_r3,(vec_v2,(cb0[2],(vec_r3
ALU 5 ALU 6 ALU 7 ALU 8
VEC8_clmp(vec_r3,(vec_r3,(l(0.0),(l(1.0)
VEC8_mul((vec_o0,(vec_r0,(vec_r3
VEC8_mul((vec_o1,(vec_r1,(vec_r3
Ctx Ctx Ctx Ctx VEC8_mul((vec_o2,(vec_r2,(vec_r3
VEC8_mov((o3,(l(1.0)
vertices
primitives
fragments
But what about branches?
1 2 ... ... 8
Time (clocks)
ALU 1 ALU 2 . . . . . . ALU 8
<unconditional
(shader(code>
if((x(>(0)({
y(=(pow(x,(exp);
y(*=(Ks;
refl(=(y(+(Ka;((
}(else({
x(=(0;(
refl(=(Ka;((
}
<resume(unconditional
(shader(code>
But what about branches?
1 2 ... ... 8
Time (clocks)
ALU 1 ALU 2 . . . . . . ALU 8
<unconditional
(shader(code>
T T F T F F F F if((x(>(0)({
y(=(pow(x,(exp);
y(*=(Ks;
refl(=(y(+(Ka;((
}(else({
x(=(0;(
refl(=(Ka;((
}
<resume(unconditional
(shader(code>
But what about branches?
1 2 ... ... 8
Time (clocks)
ALU 1 ALU 2 . . . . . . ALU 8
<unconditional
(shader(code>
T T F T F F F F if((x(>(0)({
y(=(pow(x,(exp);
y(*=(Ks;
refl(=(y(+(Ka;((
}(else({
x(=(0;(
refl(=(Ka;((
}
<resume(unconditional
Not all ALUs do useful work! (shader(code>
T T F T F F F F if((x(>(0)({
y(=(pow(x,(exp);
y(*=(Ks;
refl(=(y(+(Ka;((
}(else({
x(=(0;(
refl(=(Ka;((
}
<resume(unconditional
(shader(code>
Terminology
▪ “Coherent” execution*** (admittedly fuzzy definition): when processing of different
entities is similar, and thus can share resources for efficient execution
- Instruction stream coherence: different fragments follow same sequence of logic
- Memory access coherence:
– Different fragments access similar data (avoid memory transactions by reusing data in cache)
– Different fragments simultaneously access contiguous data (enables efficient, bulk granularity memory
transactions)
*** Do not confuse this use of term “coherence” with cache coherence protocols
GPUs share instruction streams across many fragments
sampler(mySamp;
Texture2D<float3>(myTex;
float3(lightDir;
float4(diffuseShader(float3(norm,(float2(uv)
{
((float3(kd;
((kd(=(myTex.Sample(mySamp,(uv); Texture access:
((kd(*=(clamp((dot(lightDir,(norm),(0.0,(1.0); Latency of 100’s of cycles
((return(float4(kd,(1.0);(((
}(
Recall: CPU-style core
Branch predictor
Fetch/Decode
Data cache
ALU (a big one: several MB)
Execution
Context
CPU-style memory hierarchy
Processing Core (several cores per chip)
We’ve removed the fancy caches and logic that helps avoid stalls.
But we have LOTS of independent fragments.
(Way more fragments to process than ALUs)
Idea #3:
Interleave processing of many fragments on a single core to avoid
stalls caused by high latency operations.
Hiding shader stalls
Time (clocks) Frag 1 … 8
Fetch/
Decode
ALU 1 ALU 2 ALU 3 ALU 4
1 2 3 4
Fetch/
Decode
ALU 1 ALU 2 ALU 3 ALU 4
1 2
3 4
Hiding shader stalls
Time (clocks) Frag 1 … 8 Frag 9 … 16 Frag 17 … 24 Frag 25 … 32
1 2 3 4
Stall
Runnable
Hiding shader stalls
Time (clocks) Frag 1 … 8 Frag 9 … 16 Frag 17 … 24 Frag 25 … 32
1 2 3 4
Stall
Stall
Stall
Runnable
Stall
Throughput!
Time (clocks) Frag 1 … 8 Frag 9 … 16 Frag 17 … 24 Frag 25 … 32
1 2 3 4
Start
Stall
Start
Stall
Start
Stall
Runnable
Stall
Runnable
Done! Runnable
Done! Runnable
Fetch/
Decode
ALU 1 ALU 2 ALU 3 ALU 4
Fetch/
Decode
ALU 1 ALU 2 ALU 3 ALU 4
1 2
3 4
My chip!
16 cores
16 simultaneous
instruction streams
64 concurrent (but interleaved)
instruction streams
2. Pack cores full of ALUs (by sharing instruction stream overhead across
groups of fragments)
– Option 1: Explicit SIMD vector instructions
– Option 2: Implicit sharing managed by hardware
▪ Generic speak:
– 15 cores
– 2 groups of 16 SIMD functional units per core
NVIDIA GeForce GTX 480 “core”
Fetch/
Decode • The core contains 32 functional units
Execution contexts
• Two groups are selected each clock
(128 KB) (decode, fetch, and execute two instruction
streams in parallel)
“Shared” scratchpad memory
(16+48 KB)
= CUDA core
Fetch/
(1 MUL-ADD per clock)
Decode
Fetch/
Decode • The SM contains 32 CUDA cores
Execution contexts
• Two warps are selected each clock
(128 KB) (decode, fetch, and execute two warps in
parallel)
“Shared” scratchpad memory
(16+48 KB) • Up to 48 warps are interleaved, totaling 1536
CUDA threads
Source: Fermi Compute Architecture Whitepaper
CUDA Programming Guide 3.1, Appendix G
NVIDIA GeForce GTX 480
Credit: NVIDIA
Ray tracing:
for accurate reflections, shadows
For more efficient scaling to many lights (1000 lights, [Andersson 09])
Simulation
Cinematic scene complexity