Future of Computer Architecture
Future of Computer Architecture
Future of Computer Architecture
Future of
Computer
Architecture
David A. Patterson
Pardee Professor of Computer Science, U.C. Berkeley
President, Association for Computing Machinery
February, 2006
2
High Level Message
Everything is changing; Old conventional
wisdom is out
We DESPERATELY need a new architectural
solution for microprocessors based on
parallelism
Need to create a watering hole to bring
everyone together to quickly find that
solution
architects, language designers, application experts, numerical
analysts, algorithm designers, programmers,
3
Outline
Part I: A New Agenda for Computer
Architecture
Old Conventional Wisdom vs. New Conventional Wisdom
Part II: A Watering Hole for Parallel
Systems
Research Accelerator for Multiple Processors
Conclusion
4
Old CW: Chips reliable internally, errors at pins
New CW: 65 nm high soft & hard error rates
Old CW: Demonstrate new ideas by building chips
New CW: Mask costs, ECAD costs, GHz clock rates
researchers cant build believable prototypes
Old CW: Innovate via compiler optimizations +
architecture
New: Takes > 10 years before new optimization at
leading conference gets into production compilers
Old: Hardware is hard to change, SW is flexible
New: Hardware is flexible, SW is hard to change
Conventional Wisdom (CW)
in Computer Architecture
5
Old CW: Power is free, Transistors expensive
New CW: Power wall Power expensive, Xtors free
(Can put more on chip than can afford to turn on)
Old: Multiplies are slow, Memory access is fast
New: Memory wall Memory slow, multiplies fast
(200 clocks to DRAM memory, 4 clocks for FP multiply)
Old : Increasing Instruction Level Parallelism via
compilers, innovation (Out-of-order, speculation, VLIW, )
New CW: ILP wall diminishing returns on more ILP
New: Power Wall + Memory Wall + ILP Wall = Brick Wall
Old CW: Uniprocessor performance 2X / 1.5 yrs
New CW: Uniprocessor performance only 2X / 5 yrs?
Conventional Wisdom (CW)
in Computer Architecture
6
1
10
100
1000
10000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
P
e
r
f
o
r
m
a
n
c
e
(
v
s
.
V
A
X
-
1
1
/
7
8
0
)
25%/year
52%/year
??%/year
Uniprocessor Performance (SPECint)
VAX : 25%/year 1978 to 1986
RISC + x86: 52%/year 1986 to 2002
RISC + x86: ??%/year 2002 to present
From Hennessy and Patterson,
Computer Architecture: A Quantitative
Approach, 4th edition, 2006
Sea change in chip
design: multiple cores or
processors per chip
3X
7
Sea Change in Chip Design
Intel 4004 (1971): 4-bit processor,
2312 transistors, 0.4 MHz,
10 micron PMOS, 11 mm
2
chip
Processor is the new transistor?
RISC II (1983): 32-bit, 5 stage
pipeline, 40,760 transistors, 3 MHz,
3 micron NMOS, 60 mm
2
chip
125 mm
2
chip, 0.065 micron CMOS
= 2312 RISC II+FPU+Icache+Dcache
RISC II shrinks to 0.02 mm
2
at 65 nm
Caches via DRAM or 1 transistor SRAM (www.t-ram.com) ?
Proximity Communication via capacitive coupling at > 1 TB/s ?
(Ivan Sutherland @ Sun / Berkeley)
8
Dj vu all over again?
todays processors are nearing an impasse as
technologies approach the speed of light..
David Mitchell, The Transputer: The Time Is Now (1989)
Transputer had bad timing (Uniprocessor performance)
Procrastination rewarded: 2X seq. perf. / 1.5 years
We are dedicating all of our future product development to
multicore designs. This is a sea change in computing
Paul Otellini, President, Intel (2005)
All microprocessor companies switch to MP (2X CPUs / 2 yrs)
Procrastination penalized: 2X sequential perf. / 5 yrs
Manufacturer/Year
AMD/05 Intel/06 IBM/04 Sun/05
Processors/chip
2 2 2 8
Threads/Processor
1 2 2 4
Threads/chip
2 4 4 32
9
Old CW: Since cannot know future programs,
find set of old programs to evaluate designs
of computers for the future
E.g., SPEC2006
What about parallel codes?
Few available, tied to old models, languages, architectures,
New approach: Design computers of future
for numerical methods important in future
Claim: key methods for next decade are 7
dwarves (+ a few), so design for them!
Representative codes may vary over time, but these numerical
methods will be important for > 10 years
21
st
Century Computer Architecture
10
High-end simulation in the physical
sciences = 7 numerical methods:
1. Structured Grids (including
locally structured grids, e.g.
Adaptive Mesh Refinement)
2. Unstructured Grids
3. Fast Fourier Transform
4. Dense Linear Algebra
5. Sparse Linear Algebra
6. Particles
7. Monte Carlo
Well-defined targets from algorithmic,
software, and architecture standpoint
Phillip Colellas Seven dwarfs
If add 4 for embedded,
covers all 41 EEMBC
benchmarks
8. Search/Sort
9. Filter
10. Combinational logic
11. Finite State Machine
Note: Data sizes (8 bit
to 32 bit) and types
(integer, character)
differ, but algorithms
the same
Slide from Defining Software
Requirements for Scientific
Computing, Phillip Colella, 2004
11
SPECfp
8 Structured grid
3 using Adaptive Mesh Refinement
2 Sparse linear algebra
2 Particle methods
5 TBD: Ray tracer, Speech Recognition, Quantum
Chemistry, Lattice Quantum Chromodynamics
(many kernels inside each benchmark?)
SPECint
8 Finite State Machine
2 Sorting/Searching
2 Dense linear algebra (data type differs from dwarf)
1 TBD: 1 C compiler (many kernels?)
6/11 Dwarves Covers 24/30 SPEC
12
21
st
Century Code Generation
Old CW: Takes a decade for compilers to
introduce an architecture innovation
New approach: Auto-tuners 1st run
variations of program on computer to find
best combinations of optimizations (blocking,
padding, ) and algorithms, then produce C
code to be compiled for that computer
E.g., PHiPAC (BLAS), Atlas (BLAS),
Sparsity (Sparse linear algebra), Spiral (DSP), FFT-W
Can achieve 10X over conventional compiler
One Auto-tuner per dwarf?
Exist for Dense Linear Algebra, Sparse Linear Algebra, Spectral
13
Reference
Best: 4x2
Mflop/s
Mflop/s
Sparse Matrix Search for Blocking
for finite element problem [Im, Yelick, Vuduc, 2005]
14
Best Sparse Blocking for 8 Computers
All possible column block sizes selected for 8
computers; How could compiler know?
Intel
Pentium
M
Sun Ultra 2,
Sun Ultra 3,
AMD Opteron
IBM Power 4,
Intel/HP
Itanium
Intel/HP
Itanium 2
IBM
Power 3
8
4
2
1
1 2 4 8
r
o
w
b
l
o
c
k
s
i
z
e
(
r
)
column block size (c)
15
21
st
Century Measures of Success
Old CW: Dont waste resources on accuracy,
reliability
Speed kills competition
Blame Microsoft for crashes
New CW: SPUR is critical for future of IT
Security
Privacy
Usability (cost of ownership)
Reliability
Success not limited to performance/cost
20th century vs. 21st century C&C: the SPUR manifesto,
Communications of the ACM , 48:3, 2005.
16
Style of Parallelism
Explicitly Parallel
Data Level Parallel
( SIMD)
Thread Level Parallel
( MIMD)
Streaming
(time is one
dimension)
General
DLP
No
Coupling
TLP
Barrier
Synch.
TLP
Tight
Coupling
TLP
Programmer wants code to run on as many
parallel architectures as possible so (if possible)
Architect wants to run as many different types
of parallel programs as possible so
Less HW Control,
Simpler Prog. model
More Flexible
17
Parallel Framework Apps (so far)
Original 7 dwarves: 6 data parallel, 1 no coupling TLP
Bonus 4 dwarves: 2 data parallel, 2 no coupling TLP
EEMBC (Embedded): Stream 10, DLP 19, Barrier TLP 2
SPEC (Desktop): 14 DLP, 2 no coupling TLP
E
E
M
B
C
E
E
M
B
C
S
P
E
C
S
P
E
C
D
w
a
r
f
S
D
W
A
R
F
S
Streaming DLP DLP No coupling TLP Barrier TLP Tight TLP
Most New
Architectures
Most
Important
Apps?
18
Outline
Part I: A New Agenda for Computer
Architecture
Old Conventional Wisdom vs. New Conventional Wisdom
Part II: A Watering Hole for Parallel
Systems
Research Accelerator for Multiple Processors
Conclusion
19
1. Algorithms, Programming Languages, Compilers,
Operating Systems, Architectures, Libraries, not
ready for 1000 CPUs / chip
2. Only companies can build HW, and it takes years
3. Software people dont start working hard until
hardware arrives
3 months after HW arrives, SW people list everything that must be
fixed, then we all wait 4 years for next iteration of HW/SW
4. How get 1000 CPU systems in hands of researchers
to innovate in timely fashion on in algorithms,
compilers, languages, OS, architectures, ?
5. Avoid waiting years between HW/SW iterations?
Problems with Sea Change
20
Build Academic MPP from FPGAs
As 25 CPUs will fit in Field Programmable Gate
Array (FPGA), 1000-CPU system from 40 FPGAs?
16 32-bit simple soft core RISC at 150MHz in 2004 (Virtex-II)
FPGA generations every 1.5 yrs; 2X CPUs, 1.2X clock rate
HW research community does logic design (gate
shareware) to create out-of-the-box, MPP
E.g., 1000 processor, standard ISA binary-compatible, 64-bit, cache-
coherent supercomputer @ 100 MHz/CPU in 2007
RAMPants: Arvind (MIT), Krste Asanovc (MIT), Derek Chiou (Texas),
James Hoe (CMU), Christos Kozyrakis (Stanford), Shih-Lien Lu
(Intel), Mark Oskin (Washington), David Patterson (Berkeley, Co-PI),
Jan Rabaey (Berkeley), and John Wawrzynek (Berkeley, PI)
Research Accelerator for Multiple Processors
21
Why RAMP Good for Research MPP?
SMP Cluster Simulate RAMP
Scalability (1k CPUs) C A A A
Cost (1k CPUs) F ($40M) C ($2-3M) A+ ($0M) A ($0.1-0.2M)
Cost of ownership A D A A
Power/Space
(kilowatts, racks)
D (120 kw,
12 racks)
D (120 kw,
12 racks)
A+ (.1 kw,
0.1 racks)
A (1.5 kw,
0.3 racks)
Community D A A A
Observability D C A+ A+
Reproducibility B D A+ A+
Reconfigurability D C A+ A+
Credibility A+ A+ F B+/A-
Perform. (clock) A (2 GHz) A (3 GHz) F (0 GHz) C (0.1-.2 GHz)
GPA C B- B A-
22
Completed Dec. 2004 (14x17 inch 22-layer PCB)
Board:
5 Virtex II FPGAs, 18
banks DDR2-400
memory,
20 10GigE conn.
RAMP 1 Hardware
BEE2: Berkeley Emulation Engine 2
By John Wawrzynek and Bob Brodersen with
students Chen Chang and Pierre Droz
1.5W / computer,
5 cu. in. /computer,
$100 / computer
1000 CPUs :
1.5 KW,
rack,
$100,000
Box:
8 compute modules in
8U rack mount chassis
23
RAMP Milestones
Name Goal Target CPUs Details
Red
(Stanf
ord)
Get
Started
1H06 8 PowerPC
32b hard cores
Transactional
memory SMP
Blue
(Cal)
Scale 2H06 1000 32b soft
(Microblaze)
Cluster, MPI
White
(All)
Full
Features
1H07? 128? soft 64b,
Multiple
commercial
ISAs
CC-NUMA,
shared address,
deterministic,
debug/monitor
2.0 3
rd
party
sells it
2H07? 4X CPUs of 04
FPGA
New 06 FPGA,
new board
24
Can RAMP keep up?
FGPA generations: 2X CPUs / 18 months
2X CPUs / 24 months for desktop microprocessors
1.1X to 1.3X performance / 18 months
1.2X? / year per CPU on desktop?
However, goal for RAMP is accurate system
emulation, not to be the real system
Goal is accurate target performance, parameterized
reconfiguration, extensive monitoring, reproducibility,
cheap (like a simulator) while being credible and fast
enough to emulate 1000s of OS and apps in parallel
(like hardware)
25
Multiprocessing Watering Hole
Killer app: All CS Research, Advanced Development
RAMP attracts many communities to shared artifact
Cross-disciplinary interactions
Ramp up innovation in multiprocessing
RAMP as next Standard Research/AD Platform?
(e.g., VAX/BSD Unix in 1980s, Linux/x86 in 1990s)
Parallel file system
Flight Data Recorder Transactional Memory
Fault insertion to check dependability
Data center in a box
Internet in a box
Dataflow language/computer
Security enhancements
Router design Compile to FPGA
Parallel languages
RAMP
128-bit Floating Point Libraries
26
Supporters and Participants
Gordon Bell (Microsoft)
Ivo Bolsens (Xilinx CTO)
Jan Gray (Microsoft)
Norm Jouppi (HP Labs)
Bill Kramer (NERSC/LBL)
Konrad Lai (Intel)
Craig Mundie (MS CTO)
Jaime Moreno (IBM)
G. Papadopoulos (Sun CTO)
Jim Peek (Sun)
Justin Rattner (Intel CTO)
Michael Rosenfield (IBM)
Tanaz Sowdagar (IBM)
Ivan Sutherland (Sun Fellow)
Chuck Thacker (Microsoft)
Kees Vissers (Xilinx)
Jeff Welser (IBM)
David Yen (Sun EVP)
Doug Burger (Texas)
Bill Dally (Stanford)
Susan Eggers (Washington)
Kathy Yelick (Berkeley)
RAMP Participants: Arvind (MIT), Krste Asanovc (MIT),
Derek Chiou (Texas), James Hoe (CMU), Christos Kozyrakis (Stanford),
Shih-Lien Lu (Intel), Mark Oskin (Washington), David Patterson (Berkeley,
Co-PI), Jan Rabaey (Berkeley), and John Wawrzynek (Berkeley, PI)
27
Conclusion [1/2]
Parallel Revolution has occurred;
Long live the revolution!
Aim for Security, Privacy, Usability, Reliability
as well as performance and cost of purchase
Use Applications of Future to design
Computers, Languages, of the Future
7 + 5? Dwarves as candidates for programs of future
Although most architects focusing toward
right, most dwarves are toward left
Streaming DLP DLP No coupling TLP Barrier TLP Tight TLP
28
Research Accelerator for Multiple Processors
Carpe Diem: Researchers need it ASAP
FPGAs ready, and getting better
Stand on shoulders vs. toes: standardize on Berkeley
FPGA platforms (BEE, BEE2) by Wawrzynek et al
Architects aid colleagues via gateware
RAMP accelerates HW/SW generations
System emulation + good accounting vs. FPGA computer
Emulate, Trace, Reproduce anything; Tape out every day
Multiprocessor Research Watering Hole
ramp up research in multiprocessing via common
research platform innovate across fields hasten
sea change from sequential to parallel computing
Conclusions [2 / 2]
29
Acknowledgments
Material comes from discussions on new directions for
architecture with:
Professors Krste Asanovc (MIT), Raz Bodik, Jim Demmel,
Kurt Keutzer, John Wawrzynek, and Kathy Yelick
LBNL:Parry Husbands, Bill Kramer, Lenny Oliker, John Shalf
UCB Grad students Joe Gebis and Sam Williams
RAMP based on work of RAMP Developers:
Arvind (MIT), Krste Asanovc (MIT), Derek Chiou (Texas),
James Hoe (CMU), Christos Kozyrakis (Stanford),
Shih-Lien Lu (Intel), Mark Oskin (Washington),
David Patterson (Berkeley, Co-PI), Jan Rabaey (Berkeley),
and John Wawrzynek (Berkeley, PI)
30
Backup Slides
31
Operand Size and Type
Programmer should be able to specify data
size, type independent of algorithm
1 bit (Boolean*)
8 bits (Integer, ASCII)
16 bits (Integer, DSP fixed pt, Unicode*)
32 bits (Integer, SP Fl. Pt., Unicode*)
64 bits (Integer, DP Fl. Pt.)
128 bits (Integer*, Quad Precision Fl. Pt.*)
1024 bits (Crypto*)
* Not supported well in most programming
languages and optimizing compilers
32
Amount of Explicit Parallelism
1
10
100
1000
P
a
r
a
l
l
e
l
i
s
m
1 4 16 64 256 1024
Data - Streaming
Data - General
TLP - No Coupling
TLP - Barrier
TLP - Tightly
Operand Size
Crypto Boolean
Given natural operand size and level of
parallelism, how parallel is computer or how must
parallelism available in application?
Proposed Parallel Framework
33
Parallel Framework - Architecture
1
10
100
1000
P
a
r
a
l
l
e
l
i
s
m
1 4 16 64 256 1024
Data - Streaming
Data - General
TLP - No Coupling
TLP - Barrier
TLP - Tightly
Operand Size
Crypto Boolean
Examples of good architectural matches to each
style
MMX
T
C
C
C
M
5
C
L
U
S
T
E
R
Vec-
tor
I
M
A
G
I
N
E
34
Parallel Framework Apps (so far)
1
10
100
1000
P
a
r
a
l
l
e
l
i
s
m
1 4 16 64 256 1024
Data - Streaming
Data - General
TLP - No Coupling
TLP - Barrier
TLP - Tightly
Operand Size
Crypto Boolean
Original 7: 6 data parallel, 1 no coupling TLP
Bonus 4: 2 data parallel, 2 no coupling TLP
EEMBC (Embedded): Stream 10, DLP 19, Barrier TLP 2
SPEC (Desktop): 14 DLP, 2 no coupling TLP
E
E
M
B
C
E
E
M
B
C
S
P
E
C
S
P
E
C
D
W
A
R
F
S
D
W
A
R
F
S
35
RAMP FAQ
Q: How can many researchers get RAMPs?
A1: RAMP 2.0 to be available for purchase
at low margin from 3
rd
party vendor
A2: Single board RAMP 2.0 still interesting
as FPGA 2X CPUs/18 months
RAMP 2.0 FPGA two generations later than RAMP 1.0, so
256? simple CPUs per board vs. 64?
36
Parallel FAQ
Q: Wont the circuit or processing guys solve
CPU performance problem for us?
A1: No. More transistors, but cant help with
ILP wall, and power wall is close to
fundamental problem
Memory wall could be lowered some, but hasnt
happened yet commercially
A2: One time jump. IBM using strained
silicon on Silicon On Insulator to increase
electron mobility (Intel doesnt have SOI)
clock rate or leakage power
Continue making rapid semiconductor investment?
37
Gateware Design Framework
Design composed of units
that send messages over
channels via ports
Units (10,000 + gates)
CPU + L1 cache, DRAM controller.
Channels ( FIFO)
Lossless, point-to-point,
unidirectional, in-order message
delivery
Channel Receiving Unit Sending Unit
Port
Port
Sending Unit
Channel
Port DataOut
DataOut
__DataOut_READY
__DataOut_WRITE
Receiving Unit
Port DataIn
DataIn
__DataIn_READ
__DataIn_READY
38
Quick Sanity Check
BEE2 uses old FPGAs (Virtex II), 4 banks DDR2-400/cpu
16 32-bit Microblazes per Virtex II FPGA,
0.75 MB memory for caches
32 KB direct mapped Icache, 16 KB direct mapped Dcache
Assume 150 MHz, CPI is 1.5 (4-stage pipe)
I$ Miss rate is 0.5% for SPECint2000
D$ Miss rate is 2.8% for SPECint2000, 40% Loads/stores
BW need/CPU = 150/1.5*4B*(0.5% + 40%*2.8%)
= 6.4 MB/sec
BW need/FPGA = 16*6.4 = 100 MB/s
Memory BW/FPGA = 4*200 MHz*2*8B = 12,800 MB/s
Plenty of BW for tracing,
39
Handicapping ISAs
Got it: Power 405 (32b), SPARC v8 (32b),
Xilinx Microblaze (32b)
Very Likely: SPARC v9 (64b)
Likely: IBM Power 64b
Probably (havent asked): MIPS32, MIPS64
No: x86, x86-64
But Derek Chiou of UT looking at x86 binary translation
Well sue: ARM
But pretty simple ISA & MIT has good lawyers
40
Related Approaches (1)
Quickturn, Axis, IKOS, Thara:
FPGA- or special-processor based gate-level hardware emulators
Synthesizable HDL is mapped to array for cycle and bit-accurate netlist
emulation
RAMPs emphasis is on emulating high-level architecture behaviors
Hardware and supporting software provides architecture-level
abstractions for modeling and analysis
Targets architecture and software research
Provides a spectrum of tradeoffs between speed and
accuracy/precision of emulation
RPM at USC in early 1990s:
Up to only 8 processors
Only the memory controller implemented with configurable logic
41
Related Approaches (2)
Software Simulators
Clusters (standard microprocessors)
PlanetLab (distributed environment)
Wisconsin Wind Tunnel (used CM-5 to simulate
shared memory)
All suffer from some combination of:
Slowness, inaccuracy, scalability, unbalanced
computation/communication, target inflexibility
42
1
10
100
1000
10000
100000
P
a
r
a
l
l
e
l
i
s
m
1 4 16 64 256 1024
Data
TLP - No coupling
TLP - Stream
TLP - Barrier
TLP - Tightly
Data flow
Operand Size
Parallel Framework - Benchmarks
Crypto Boolean
7 Dwarfs: Use simplest parallel model that works
Monte Carlo
Dense
Structured
Unstructured
Sparse
FFT
Particle
43
1
10
100
1000
P
a
r
a
l
l
e
l
i
s
m
1 4 16 64 256 1024
Data
TLP - No coupling
TLP - Stream
TLP - Barrier
TLP - Tightly
Data flow
Operand Size
Parallel Framework - Benchmarks
Crypto Boolean
Additional 4 Dwarfs (not including FSM, Ray tracing)
Comb.
Logic
Searching / Sorting
crypto Filter
44
1
10
100
1000
P
a
r
a
l
l
e
l
i
s
m
1 4 16 64 256 1024
Data
TLP - No coupling
TLP - Stream
TLP - Barrier
TLP - Tightly
Data flow
Operand Size
Crypto Boolean
Angle to Time
CAN Remote
Bit
Manipulation
Basic Int
Cache Buster
Number EEMBC
kernels
Parallelism Style Operand
14 1000 Data 8 - 32 bit
5 100 Data 8 - 32 bit
10 10 Stream 8 - 32 bit
2 10 Tightly Coupled 8 - 32 bit
Parallel Framework EEMBC Benchmarks
45
SPECintCPU: 32-bit integer
FSM: perlbench, bzip2, minimum cost flow
(MCF), Hidden Markov Models (hmm), video
(h264avc), Network discrete event
simulation, 2D path finding library (astar),
XML Transformation (xalancbmk)
Sorting/Searching: go (gobmk), chess
(sjeng),
Dense linear algebra: quantum computer
(libquantum), video (h264avc)
TBD: compiler (gcc)
46
SPECfpCPU: 64-bit Fl. Pt.
Structured grid: Magnetohydrodynamics (zeusmp),
General relativity (cactusADM), Finite element code
(calculix), Maxwell's E&M eqns solver (GemsFDTD),
Fluid dynamics (lbm; leslie3d-AMR), Finite element
solver (dealII-AMR), Weather modeling (wrf-AMR)
Sparse linear algebra: Fluid dynamics (bwaves),
Linear program solver (soplex),
Particle methods: Molecular dynamics (namd, 64-
bit; gromacs, 32-bit),
TBD: Quantum chromodynamics (milc), Quantum
chemistry (gamess), Ray tracer (povray), Quantum
crystallography (tonto), Speech recognition
(sphinx3)