EE292A Lecture 2.ML - Hardware
EE292A Lecture 2.ML - Hardware
EE292A Lecture 2.ML - Hardware
Patrick Groeneveld
AMD
[email protected]
Antun
Circuit
Patrick
Mask Physical
LP CPU GPU
L3 Cache
L3 Cache
Cache GPU X4 HP X5
GPU
L2 cache
X4 NPU CPU
HP HP X4 X2 LP
NPU
NPU CPU LP CPU CPU
X2 CPU X4
X4
X2
L3 Cache
L2 cache
CPU
LP X2
LP CPU
CPU LP CPU
GPU
X4
L3 Cache
GPU X4 GPU X4 X6
Cache X4 X5
NPU HP CPU
HP
NPU
L2 cache
CPU
X2 X2
A17: iPhone 15pro (2023) Sources: TechInsights/Anandtech/Angstronomics
A Neural Network with 7 layers 0
1
Fully Fully
down- down- Connected Connected 2
filter sample filter sample Network Network
3
4
5
6
7
8
28x28 image
9
(MNIST database)
7
Stanford EE292A Lecture 2
Convolutional Neural Network for Self Driving
Camera images
Hardware 4: (2023)
• 20X CPU (ARM A72)
• 16X GPU (ARM Mali)
• 3X TPU
• 7nm Samsung process??
FSD Hardware 4
15
Source: https://twitter.com/greentheonly/status/1625905234076741641?s=20
The Dense (Fully Connected) Layer
1
2 1
3 2 Weight matrix is These weights are
𝑖𝑛 4 3 𝑜𝑢𝑡 generally quite sparse ‘trainable’
5 4
6 5
7
w(1,1) w(2,1) w(3,1) w(4,1) w(5,1)
"#% Weight W(1,2) W(2,2) W(3,2) W(4,2) W(5,2)
W(1,3) W(2,3) W(3,3) W(4,3) W(5,3)
𝑜𝑢𝑡! = % 𝑖𝑛" ∗ 𝑤(𝑖, 𝑗) IN W(1,4) W(2,4) W(3,4) W(4,4) W(5,4)
OUT
for (int i = 1; i <= out.size(), i++) {
out[i] += w[i][j] * in[j];
}
} Add Multiply
return out; 16
}
Convolutional Layer: Code Sketch
tensor rgb_convolution_layer(const tensor & input,
const convFilter & weights)
{
tensor output(weights.size(), weights.width(), weights.height());
Typical Smartphone picture: 3X3840x2160 pixels (RGB), 32 filters or 5x5 pixels each
So, this convolution layers would require 3x3840x2160x32x5x5 = 19,906,560,000 Multiply-accumulate per layer
So ~20B Floating Point Multiply-Adds.
With 1 MAC per clock cycle @ 2Ghz, that would take 20 seconds… 17
Convolution layer = Matrix Multiplication
D
• Example on right: One 3x3 RGB input image (so 3x3x3 values)
• Two sets of 2x2 convolution weights for feature extraction over the 3 colors
• So 2x2x2x3 filter weights
• No 0-padding, 4 outputs per convolution filter
• Construct 4x12 Image Data matrix D:
• Repeats each data element
• Construct 12x2 Filter weight matrix F:
• Stores all weights
• Multiply matrices: DxF = O
• O = 4x2
𝑂!" 𝑂!$ 𝑂!% 𝑂!&
F 𝑂#" 𝑂#$ 𝑂#% 𝑂#&
Matrix multiplication D x F = O
Need hardware that is good at this
IEEE754 Floating Point Format Standard
16-bit Half-precision format:
Sign Exponent Bias offset
bit S E E E E E M M M M M M M M M M
1! 1! 1! 1! 1! 1 1! 1! 1! 1!
! %&"#'( 𝑠𝑢𝑏𝑡𝑟𝑎𝑐𝑡 𝑏𝑖𝑎𝑠 15 32 !64 128 256 512 1024
𝑋 = (−1) ∗ (1. 𝑀)"#$ ∗ 2
2 4 8 16
1.0110000000!"# = 1.375$%&
x 1 1 1 1 1 x x x x x x x x x 1 NaN (Not a Number): exponent all 1 and mantissa not all zeroes
20
Binary Integer Multiplication
0 0 0 1 1 1 0 1
1
1
0
1
1
1
0
0
11 bit in
1 M M M M M M M M M M 1 M M M M M M M M M M
+
15 (Fixed Exponent Bias) 11x11 bit
Binary
Multiplier
XOR -
M M M M M M M M M M M M M M M M M M M M M M 22 bit out
Rounding logic
Normalization 1 0
+ mux
S E E E E E M M M M M M M M M M
22
Result:
IEEE754 16-bit
𝑣𝑎𝑙 = 𝑆 ∗ 𝑀)#**+$, ∗ 2%&"#'(
Floating Point Multiplication Hardware
1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 x 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 -4
S E E E E E M M M M M M M M M M S E E E E E M M M M M M M M M M
11 bit in
1 M M M M M M M M M M
15 17 1 M M M M M M M M M M
1024 1024
+
15 (Fixed Exponent Bias) 11x11 bit
Binary
32 Multiplier
XOR -
1048576
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
17 M M M M M M M M M M M M M M M M M M M M M M 22 bit out
Rounding logic
Normalization 1 0
+ mux
17
S E E E E E M M M M M M M M M M
23
Result: 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 -4
IEEE754 16-bit
𝑣𝑎𝑙 = 𝑆 ∗ 𝑀)#**+$, ∗ 2%&"#'(
Floating Point Multiplication Hardware
1.5 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1.5
S E E E E E M M M M M M M M M M S E E E E E M M M M M M M M M M
1 M M M M M M M M M M
15 15 1 M M M M M M M M M M
1536 1536
+
15 (Exp. Bias) 11X11 bit
30 multiplier
XOR -
2359296
1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
15 M M M M M M M M M M M M M M M M M M M M M M
Rounding
Normalization 1 0
+ mux
16
S E E E E E M M M M M M M M M M
Result: 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 = 1 ∗ 1.125 ∗ 2,-&,. = 2.25
Speeding up Multiply-Add:
Selecting a Floating-Point format
Exponent Mantissa
e11 m52 IEEE 754 double float
FP64 Floating Point 64 Bit s
Range: [5.4 E-79, 7.2E +75]
e8 m23
FP32 Floating Point 32 Bit s IEEE 754 float
Range: [1.75 E-38, 3.7 E+38]
e8 m10
TF32 Tensor Floating Point 19 Bit s Nvidia Hopper, Blackwell
Range: [1.75 E-38, 3.7 E+38]
BF16 Brain Floating Point 16 Bit e8 m7
Range: [1.75 E-38, 3.7 E+38] s Google, nVidia, others
CF16 Floating Point 16 Bit e6 m9
s Cerebras
Range: [9.31 E-10, 4.29 E+9]
e5 m10
FP16 Floating Point 16 Bit s IEEE 754 half
Range: [6.135 E-5, 6.5504 E+5]
• Therefore, it is popular in ML
FP16 Floating Point 16 Bit
e5 m10
Range: [6.135E-5, 6.5504E+5] s
• But sacrifices range
and resolution.
• Upper range is just CF16 Floating Point 16 Bit e6 m9 CF16:
Less resolution,
65504 Range: [9.31e-10, 4.29e+9] s more range
• Smallest ‘normal’
number is just BF16 Brain Floating Point 16 Bit e8 m7
0.00006135 Range: [1.75E-38, 3.7E+38] s BF16:
A lot less resolution,
• Small numbers are not much more range
rare in ML L
A A
B B
0.1111111111!"# = 0.99902344$%&
𝑋= 1 ∗ 0.999 ∗ 2"$"# = 0.000060604
1 M M M M M M M M M M 0 M M M M M M M M M M
+ Pipelined
15 11x11bit
Multiplier
XOR - Normal/Denormal hardware
control 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
1
M M M M M M M M M M M M M M M M M M M M M M
Input normal?
1 0
+ mux
1 0
mux
S E E E E E M M M M M M M M M M
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0.000000118 29
Schwarz, E.M.; Schmookler, M.; Son Dao Trong. ”Hardware Implementations of Denormalized Numbers".
ML Implementation:
Software, Hardware or a Hybrid of Both?
+ Custom precision + Fast, low-cost, low power + Floating point + Easy, compilation
+ Best performance + Exploits memory locality - Fixed precision + Universal
- Higher design effort - Single integer/FP precision - Memory overhead
- 1000X slower 30
Machine Learning Hardware Different Philosophies
31
Bringing Memory and Processing Closer
p w p w p w
P
Weights
p w p w p w 40G 40G
Weights Weights
RAM
m m m m m m m m m m p w p w p w RAM RAM
m m m m m m m m m m
m m m m m m m m m m
m m m m m m m m m m
p w p w p w
P
p p p
m m m m m m m m m m
w w w
m m m m m m m m m m
Data Data
Data Data
Data Data Data
Data RAM
32
Note that Tesla’s FSD Chip and Smartphone SoCs
deploy a hybrid of all options
16X GPU
12X CPU
LP CPU
X4
Cache
GPU
X4
2X TPU NPU HP CPU
X2
March 28, 2022
TPU
GPU Die: 80B Transistor, 50MB L2 Cache. 4 Nanometers TSMC, ~400W, 2.8cmx2.8cm = 814mm2
On Silicon Interposer with 40Gb HBM3 Memory
Datacenter board 8 of these retails at ~$200K
Stanford EE272A 36
NVidia Blackwell B200 (2024)
• 8 Graphic Processing
Clusters
• 7 Texture Processing
Clusters
• 14 Volta Streaming
Multiprocessors
• 144 Streaming
50 MB L2 Cache Multiprocessors
• 64 FP32 Cores
• 64 INT32 Cores
• 32 FP64 Cores
• 8 Tensor Cores
• 6 memory controllers
• 30 TFLOPS
https://devblogs.nvidia.com/inside-volta/ , http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/
38
GPU Structure ‘Streaming Multiprocessor’
Claims:
Double throughput,
Half memory
Source: https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/
40
NVidia GPU: programmed in CUDA with
cuDNN (Deep Neural Network Library)
• Nvidia CUDA
• C++ Programming API abstraction for GPUs
Single-precision
aX + Y =
multiply-add
https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
An Old New Idea: Systolic Arrays
Definition: A systolic array is a network of processors that
rhythmically computes and passes data through the system
𝑥𝑥𝑥 𝑦𝑦𝑦
𝑤𝑤𝑤 X
𝑥𝑥𝑥 = 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥 X33
X32 X23
X21 X12
X11
W
W11 W12 W13 Y
W21 W22 W23
H. T. Kung, C. E. Leiserson: Algorithms for VLSI processor arrays; in: C. Mead, L. Conway (eds.): Introduction to VLSI Systems; Addison-Wesley, 1979
43
An Old New Idea: Systolic Arrays
𝑥𝑥𝑥 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥 = 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥
1 X31
X32
X22
X23
X13
X21 X12
44
An Old New Idea: Systolic Arrays
𝑥𝑥𝑥 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥 = 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥
X23
Y12 = w11x12
X2111 W
W X1212 W13
45
An Old New Idea: Systolic Arrays
𝑥𝑥𝑥 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥 = 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥
X3111 W
W X2212 W
X13
13
X2121 W
W X1222 W23
X11
Stanford EE292A Lecture 18 46
An Old New Idea: Systolic Arrays
𝑥𝑥𝑥 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥 = 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥
X3121 W
W X2222 W
X13
23
X21 X12
47
X11
An Old New Idea: Systolic Arrays
𝑥𝑥𝑥 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥 = 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥
W21 W
X3222 W
X23
23
X21 X12
X11
An Old New Idea: Systolic Arrays
𝑥𝑥𝑥 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥 = 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥
W21 W22 W
X33
23
X32 X23
49
X21 X12
X11
An Old New Idea: Systolic Arrays
𝑥𝑥𝑥 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥 = 𝑦𝑦𝑦
𝑤𝑤𝑤
𝑥𝑥𝑥
https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
Stanford EE292A 51
Bringing Memory and Processing Closer
Mesh
processor TPU GPU CPU
p w p w p w p w p w
p w p w p w
P
p w p w 40G p w p w p w 40G 40G
Weights Weights Weights
RAM p w p w p w RAM RAM
p w p w
m m m m m m m m m m
m m m m m m m m m m
m m m m m m m m m
m m m m m m m m m m
m
p w p w p w
P
p w p w m m m m m m m m m m
m m m m m m m m m m p w p w p w
Data Data
Data Data Data
Data Data Data
streaming RAM
Data
52
Layer-pipelined Layer-sequential – batched
Cerebras
Wafer Scale Engine
AI Supercomputer for Layer-
Pipelined computing
53
Cerebras Wafer Scale Engine
54
CS-1 Hardware: the box
55
Wavelets Flowing on the Network-on-Chip
Transmits one 32-bit
wavelet per clock cycle
in each direction
Wavelet Buffers
PE + forwarding table PE
per buffer
PE PE PE
A Dense Layer Kernel
in Hardware
w w w w w w w w w w
d Xp X
p Xp Xp Xp Xp Xp Xp Xp Xp
• The ML layer implementation on CS-1 is + + + + + + + + + +
massive parallel multiply-accumulate: d X
w
X
w
X
w
X
w
X
w
X
w
X
w
X
w
X
w
X
w
• The data operand is efficiently + + + + + + + + + +
w w w w w w w w w w
streamed in at a high rate d X
+
X
+
X
+
X
+
X
+
X
+
X
+
X
+
X
+
X
+
Streaming
• The weight operand is stationary in w w w w w w w w w w
data
d X X X X X X X X X X
40Gb high speed memory + + + + + + + + + +
w w w w w w w w w w
• Result (sum-of-products) is streamed out to d X X X X X X X X X X
the next layer + + + + + + + + + +
d
w w w w w w w w w w
X X X X X X X X X X
+ + + + + + + + + +
850,000 (1020x830) processors: w w w w w w w w w w
d X X X X X X X X X X
+ + + + + + + + + +
in[0] F
= = = = = = = = = =
in[1] F 48K SRAM out[0]
in[2] F
FIFO queues
FIFO queues
out[1]
in[3]
F
PE out[2] Result = [D] x [W]
in[4] F out[3]
F out[4]
in[5] Processor
in[6] F out[5]
in[7] F
Layer-pipelined execution
59
5-channel LTE20 Wireless (Xilinx.com white paper)
Xilinx: Systolic Array of WLIW RISC processors
60
Summary