Exclusive Squashing for Thread-Level Speculation
Alvaro Garcı́a-Yágüez, Diego R.Llanos, and Arturo González-Escribano
Universidad de Valladolid, Spain
[email protected],
[email protected],
[email protected]
Introduction
Speculative parallelization aims to extract loop and task-level paralelism when a compile-time dependence
analysis can not guarantee that a given sequential code is safely parallelizable. Speculative parallelization
optimistically assumes that the code can be executed in parallel, and relies on a runtime monitor to ensure
that no dependence violation is produced.
If the runtime monitor detects a dependence violation, the runtime monitor should decide what to do with
the parallel execution:
Our proposal: Exclusive squash
Inclusive squash [3]
Dx
Dx has been speculatively loaded (ExpLd state)
Dx
Dx has been speculatively written (Update state)
C
window
non_spec
most_spec
1
D2
3
D3
4
RUN
RUN
Thread 1
Thread 2
Thread 3
...
Thread 4
B
most_spec
1
1
W
RUN
global
2
D2
3
D3
4
RUN
RUN
Thread 1
Thread 2
C
E
Thread 3
Thread 4
D1
W
...
RUN
D
A
F
D3
D3
...
...
B
D3
Thread W
...
D1
...
M
SQUASH
3
RUN
...
D1
SQUASH
2
local versions
ref
1
A Thread W speculatively loads element D3 from the speculative structure.
B Thread 2 loads the same element D3, forwarding it from the reference value.
C Thread 2 speculatively writes element D1 to the speculative structure; dependence
violations are not found.
D Thread 3 speculatively loads element D1. Since thread 2 has the value, thread 3 writes
in consumer list[3][2] to mark that it will consume a value from thread 2.
E Thread 3 forwards datum D1 from thread 2.
F Thread 1 speculatively writes element D3.
G A squash operation takes place. Threads that have incorrectly consumed the value
D3 are squashed.
H Consumer list is checked in search for threads that have consumed any datum from
squashed threads. In our example, thread 3 is also squashed, and its consumer list
column is also checked.
Note that most speculative pointer is not modified and bubbles are generated.
H
SQUASH
non_spec
...
...
DM
consumer list
1
...
2
3
X
...
...
...
...
...
Ref. Original user data structure
Window.
Holds the state of W
slots where block of iterations are
executed (FREE, DONE, RUNNING,
SQUASHED)
Version. Stores W copies of Ref data
Execution example
G
...
A Thread 2 speculatively loads
element D3 from the speculative
structure.
B Thread 1 speculatively writes
element D3.
C Since a dependence violation appears, Thread 2 and all successors
are squashed.
D Most-speculative pointer is modified.
RUN
DM
M
Dx
Dx has been speculatively written (Update state)
W
Thread W
...
D3
...
...
A
D3
Dx has been speculatively loaded (ExpLd state)
window
SQUASH
...
...
2
RUN
...
D1
3
local versions
ref
1
SQUASH
SQUASH
2
RUN
W 1 D
global
SQUASH
Dx
G
Execution example
1
• Restart serially. Discarding the parallel work done so far and restarting the loop serially [1]
• Inclusive Squashing IS. Restarting the offending thread and all its successors [2, 3]
• Exclusive Squashing ES (our proposal). Only offending threads, and recursively, successors that
have consumed any value generated by them are restarted.
• Perfect Squashing. Only offending threads, and recursively, successors that have consumed wrong
values generated by them are restarted.
W
1
2
3
4
W
Results
2
1
0
0
2
4
6
8
10
12
14
16
Processors
3
2
1
0
0
2
4
6
8
10
Processors
12
14
16
0
2
4
6
8
Processors
10
12
14
16
0
2
4
6
8
10
Squashes
12
14
16
IS
ES
3
2
1
0
0
2
4
6
8
10
Processors
12
14
16
0
2
4
6
8
10
12
14
16
Processors
Conclusions
• Exclusive squashing reduces number of squashes from 10% for 4 threads, to 85% for 16 threads.
• Usefulness in terms of speedup heavily depends on the cost associated to discard potentially valid work
for each application.
• Computational load is not high enough for the two applications with dependences considered: Adding
an artificial load to 2DT improves the speedup in comparison to inclusive squashing policy.
This research is partly supported by the Ministerio de Educación y Ciencia, Spain (TIN2007-62302) and Junta de
Castilla y León, Spain (VA094A08).
Violations: Delaunay-BLOCK20-WINDOW02
8000
IS
ES
Speedup: Delaunay-BLOCK20-WINDOW02
4.5
IS
ES
7000
5000
4000
3000
2
4
6
8
10
12
14
16
Speedup: DelaunayOverload2-BLOCK20-WINDOW02
18
IS
ES
16
Speedup: DelaunayOverload3-BLOCK20-WINDOW02
18
IS
ES
16
3.5
14
14
14
3
12
12
12
2.5
2
10
8
10
8
10
8
6
6
6
2000
1
4
4
4
1000
0.5
2
2
2
0
0
Processors
2
4
6
8
10
Processors
12
14
16
0
0
2
4
6
8
10
Processors
12
14
16
0
0
2
4
6
8
10
Processors
12
14
16
IS
ES
16
1.5
0
0
IS
ES
4
6000
Speedup: DelaunayOverload1-BLOCK20-WINDOW02
18
Speedup
4
IS
ES
Squashes: Delaunay-BLOCK20-WINDOW02
50000
45000
40000
35000
30000
25000
20000
15000
10000
5000
0
Speedup
5
IS
ES
Speedup, 2D-Hull, Disc, block=2500
4
2DT with overloaded computation
Speedup
6
Violations, 2D-Hull, Disc, block=2500
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Speedup
3
IS
ES
7
22000
20000
18000
16000
14000
12000
10000
8000
6000
4000
2000
0
Squahes
4
Squashes, 2D-Hull, Disc, block=2500
Squashes
5
Speedup
Speedup
6
IS
ES
Speedup
IS
ES
7
Speedup: TREE-WINDOW02-BLOCK02
8
2D Delaunay Triangulation (2DT)
Speedup
Speedup: MDG-WINDOW02-BLOCK02
10
9
8
7
6
5
4
3
2
1
0
Violations
Speedup: TREE-WINDOW02-BLOCK02
8
2D Convex Hull
Violations
Applications without dependences
0
0
2
4
6
8
10
Processors
12
14
16
0
2
4
6
8
10
12
14
16
Processors
References
[1] Rauchwerger, L., and Padua, D. The LRPD test: speculative run-time parallelization of loops with privatization and
reduction parallelization. In Proceedings of the ACM SIGPLAN 1995 conference on Programming Language Design and
Implementation (La Jolla, California, United States, 1995), ACM, pp. 218–232.
[2] Dang, F., Yu, H., and Rauchwerger, L. The R-LRPD test: Speculative parallelization of partially parallel loops. In
Parallel and Distributed Processing Symposium., Proc. Intl. Par. and Distr. Processing Symposium (2002), IEEE, pp. 20–
29.
[3] Cintra, M., and Llanos, D. R. Toward efficient and robust software speculative parallelization on multiprocessors. In
Proceedings of the ninth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (San Diego,
California, USA, 2003), ACM, pp. 13–24.