Academia.eduAcademia.edu

Exclusive squashing for thread-level speculation

2011, Proceedings of the 20th international symposium on High performance distributed computing

AI-generated Abstract

This paper presents a novel approach to thread-level speculation called Exclusive Squashing (ES), aimed at enhancing speculative parallelization by optimizing the squashing process. ES selectively restarts only the offending threads and their successors that have consumed invalid values, thereby significantly reducing the number of squashes compared to traditional methods like Inclusive Squashing. The results demonstrate that ES can achieve up to an 85% reduction in squashes for applications with multiple threads, offering substantial improvements in computational efficiency, especially when the cost of discarding valid work is suitably managed.

Exclusive Squashing for Thread-Level Speculation Alvaro Garcı́a-Yágüez, Diego R.Llanos, and Arturo González-Escribano Universidad de Valladolid, Spain [email protected], [email protected], [email protected] Introduction Speculative parallelization aims to extract loop and task-level paralelism when a compile-time dependence analysis can not guarantee that a given sequential code is safely parallelizable. Speculative parallelization optimistically assumes that the code can be executed in parallel, and relies on a runtime monitor to ensure that no dependence violation is produced. If the runtime monitor detects a dependence violation, the runtime monitor should decide what to do with the parallel execution: Our proposal: Exclusive squash Inclusive squash [3] Dx Dx has been speculatively loaded (ExpLd state) Dx Dx has been speculatively written (Update state) C window non_spec most_spec 1 D2 3 D3 4 RUN RUN Thread 1 Thread 2 Thread 3 ... Thread 4 B most_spec 1 1 W RUN global 2 D2 3 D3 4 RUN RUN Thread 1 Thread 2 C E Thread 3 Thread 4 D1 W ... RUN D A F D3 D3 ... ... B D3 Thread W ... D1 ... M SQUASH 3 RUN ... D1 SQUASH 2 local versions ref 1 A Thread W speculatively loads element D3 from the speculative structure. B Thread 2 loads the same element D3, forwarding it from the reference value. C Thread 2 speculatively writes element D1 to the speculative structure; dependence violations are not found. D Thread 3 speculatively loads element D1. Since thread 2 has the value, thread 3 writes in consumer list[3][2] to mark that it will consume a value from thread 2. E Thread 3 forwards datum D1 from thread 2. F Thread 1 speculatively writes element D3. G A squash operation takes place. Threads that have incorrectly consumed the value D3 are squashed. H Consumer list is checked in search for threads that have consumed any datum from squashed threads. In our example, thread 3 is also squashed, and its consumer list column is also checked. Note that most speculative pointer is not modified and bubbles are generated. H SQUASH non_spec ... ... DM consumer list 1 ... 2 3 X ... ... ... ... ... Ref. Original user data structure Window. Holds the state of W slots where block of iterations are executed (FREE, DONE, RUNNING, SQUASHED) Version. Stores W copies of Ref data Execution example G ... A Thread 2 speculatively loads element D3 from the speculative structure. B Thread 1 speculatively writes element D3. C Since a dependence violation appears, Thread 2 and all successors are squashed. D Most-speculative pointer is modified. RUN DM M Dx Dx has been speculatively written (Update state) W Thread W ... D3 ... ... A D3 Dx has been speculatively loaded (ExpLd state) window SQUASH ... ... 2 RUN ... D1 3 local versions ref 1 SQUASH SQUASH 2 RUN W 1 D global SQUASH Dx G Execution example 1 • Restart serially. Discarding the parallel work done so far and restarting the loop serially [1] • Inclusive Squashing IS. Restarting the offending thread and all its successors [2, 3] • Exclusive Squashing ES (our proposal). Only offending threads, and recursively, successors that have consumed any value generated by them are restarted. • Perfect Squashing. Only offending threads, and recursively, successors that have consumed wrong values generated by them are restarted. W 1 2 3 4 W Results 2 1 0 0 2 4 6 8 10 12 14 16 Processors 3 2 1 0 0 2 4 6 8 10 Processors 12 14 16 0 2 4 6 8 Processors 10 12 14 16 0 2 4 6 8 10 Squashes 12 14 16 IS ES 3 2 1 0 0 2 4 6 8 10 Processors 12 14 16 0 2 4 6 8 10 12 14 16 Processors Conclusions • Exclusive squashing reduces number of squashes from 10% for 4 threads, to 85% for 16 threads. • Usefulness in terms of speedup heavily depends on the cost associated to discard potentially valid work for each application. • Computational load is not high enough for the two applications with dependences considered: Adding an artificial load to 2DT improves the speedup in comparison to inclusive squashing policy. This research is partly supported by the Ministerio de Educación y Ciencia, Spain (TIN2007-62302) and Junta de Castilla y León, Spain (VA094A08). Violations: Delaunay-BLOCK20-WINDOW02 8000 IS ES Speedup: Delaunay-BLOCK20-WINDOW02 4.5 IS ES 7000 5000 4000 3000 2 4 6 8 10 12 14 16 Speedup: DelaunayOverload2-BLOCK20-WINDOW02 18 IS ES 16 Speedup: DelaunayOverload3-BLOCK20-WINDOW02 18 IS ES 16 3.5 14 14 14 3 12 12 12 2.5 2 10 8 10 8 10 8 6 6 6 2000 1 4 4 4 1000 0.5 2 2 2 0 0 Processors 2 4 6 8 10 Processors 12 14 16 0 0 2 4 6 8 10 Processors 12 14 16 0 0 2 4 6 8 10 Processors 12 14 16 IS ES 16 1.5 0 0 IS ES 4 6000 Speedup: DelaunayOverload1-BLOCK20-WINDOW02 18 Speedup 4 IS ES Squashes: Delaunay-BLOCK20-WINDOW02 50000 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 Speedup 5 IS ES Speedup, 2D-Hull, Disc, block=2500 4 2DT with overloaded computation Speedup 6 Violations, 2D-Hull, Disc, block=2500 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 Speedup 3 IS ES 7 22000 20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 Squahes 4 Squashes, 2D-Hull, Disc, block=2500 Squashes 5 Speedup Speedup 6 IS ES Speedup IS ES 7 Speedup: TREE-WINDOW02-BLOCK02 8 2D Delaunay Triangulation (2DT) Speedup Speedup: MDG-WINDOW02-BLOCK02 10 9 8 7 6 5 4 3 2 1 0 Violations Speedup: TREE-WINDOW02-BLOCK02 8 2D Convex Hull Violations Applications without dependences 0 0 2 4 6 8 10 Processors 12 14 16 0 2 4 6 8 10 12 14 16 Processors References [1] Rauchwerger, L., and Padua, D. The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization. In Proceedings of the ACM SIGPLAN 1995 conference on Programming Language Design and Implementation (La Jolla, California, United States, 1995), ACM, pp. 218–232. [2] Dang, F., Yu, H., and Rauchwerger, L. The R-LRPD test: Speculative parallelization of partially parallel loops. In Parallel and Distributed Processing Symposium., Proc. Intl. Par. and Distr. Processing Symposium (2002), IEEE, pp. 20– 29. [3] Cintra, M., and Llanos, D. R. Toward efficient and robust software speculative parallelization on multiprocessors. In Proceedings of the ninth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (San Diego, California, USA, 2003), ACM, pp. 13–24.