Academia.eduAcademia.edu

Hierarchical multithreading: programming model and system software

Parallel and Distributed …

This paper addresses the underlying sources of performance degradation (e.g. latency, overhead, and starvation) and the difficulties of programmer productivity (e.g. explicit locality management and scheduling, performance tuning, fragmented memory, and synchronous global barriers) to dramatically enhance the broad effectiveness of parallel processing for high end computing. We are developing a hierarchical threaded virtual machine (HTVM) that defines a dynamic, multithreaded execution model and programming model, providing an architecture abstraction for HEC system software and tools development. We are working on a prototype language, LITL-X (pronounced "little-X") for Latency Intrinsic-Tolerant Language, which provides the application programmers with a powerful set of semantic constructs to organize parallel computations in a way that hides/manages latency and limits the effects of overhead. This is quite different from locality management, although the intent of both strategies is to minimize the effect of latency on the efficiency of computation. We will work on a dynamic compilation and runtime model to achieve efficient LITL-X program execution. Several adaptive optimizations will be studied. A methodology of incorporating domainspecific knowledge in program optimization will be studied. Finally, we plan to implement our method in an experimental testbed for a HEC architecture and perform a qualitative and quantitative evaluation on selected applications.

Hierarchical Multithreading: Programming Model and System Software Guang R. Gao1 , Thomas Sterling2,3 , Rick Stevens4 , Mark Hereld4 , Weirong Zhu1 1 2 Department of Electrical and Computer Engineering University of Delaware {ggao,weirong}@capsl.udel.edu 3 [email protected] 4 Department of Computer Science Louisiana State University This paper addresses the underlying sources of performance degradation (e.g. latency, overhead, and starvation) and the difficulties of programmer productivity (e.g. explicit locality management and scheduling, performance tuning, fragmented memory, and synchronous global barriers) to dramatically enhance the broad effectiveness of parallel processing for high end computing. We are developing a hierarchical threaded virtual machine (HTVM) that defines a dynamic, multithreaded execution model and programming model, providing an architecture abstraction for HEC system software and tools development. We are working on a prototype language, LITL-X (pronounced “little-X”) for Latency Intrinsic-Tolerant Language, which provides the application programmers with a powerful set of semantic constructs to organize parallel computations in a way that hides/manages latency and limits the effects of overhead. This is quite different from locality management, although the intent of both strategies is to minimize the effect of latency on the efficiency of computation. We will work on a dynamic compilation and runtime model to achieve efficient LITL-X program execution. Several adaptive optimizations will be studied. A methodology of incorporating domainspecific knowledge in program optimization will be studied. Finally, we plan to implement our method in an experimental testbed for a HEC architecture and perform a qualitative and quantitative evaluation on selected applications. 1-4244-0054-6/06/$20.00 ©2006 IEEE Mathematics and Computer Science Division Argonne National Laboratory {stevens,hereld}@mcs.anl.gov [email protected] Abstract Center for Advanced Computing Research California Institute of Technology 1 Introduction With the rapid increase in both the scale and complexity of scientific and engineering problems, the computational demands grow accordingly. Breakthroughquality scientific discoveries and optimal engineering designs often rely on large scale simulations on HighEnd Computing (HEC) systems with performance requirement reaching peta-flops and beyond. However, current HEC systems lack system software and tools optimized for advanced scientific and engineering work of interest, and are extremely difficult to program and to port applications to. Consequently, applications rarely achieve an acceptable fraction of the peak capability of the system. To radically improve this situation, the following key features are expected to be supported in the future HEC systems: (1) Architecture support for coarseand/or fine-grain multithreading at enormous scale (up to millions of threads). (2) Architecture support for runtime thread migration, and (3) Architecture support for large shared address space across nodes. These features can be observed in the IBM Bluegene L [6] and Cyclops architectures [5], Processor-In-Memorybased architectures [17], fine-grain multithreaded architectures like HTMT [10] and CARE [14]. In this paper, we propose a hierarchical threaded virtual machine (HTVM) that defines a dynamic, multithreaded execution model, which provides an architecture abstraction for HEC system software and tools development. A corresponding programming model will efficiently exploit the ability of the execution model by users. We will perform research on programming model and language issues, continuous compilation and runtime software that are critical to enable the dynamic adaptation of the HEC system. We propose a method to enable domain-expert knowledge input and exploitation, and runtime performance monitoring mechanism to support the above continuous compilation. Finally, we report the current status of the implementation, performance analysis, and evaluation of the proposed methods under an experimental HEC system software testbed. 2 An Overview of the Hierarchical Threaded Virtual Machine Model This section gives our overall vision on the HEC system software/tools. A major challenge is to accommodate dynamic adaptivity in the design due to the complex and dynamic nature of ultra-large scale HEC applications and machines. Under a real HEC program execution scenario, millions of threads at various levels in the thread hierarchy may be generated and executed at different time and places in the machine. Each thread should be mapped to a desirable physical thread unit when resources become available and dependences are resolved. We identify four classes of adaptivity critical to the performance of the system: • Loop parallelism adaptation. Scientific applications tend to have computation-intensive kernels consisting of loop nests. Exploitable parallelism in a loop nest, and the grain size of the parallelism, are runtime dependent on the machine resource availability and data locality, which change more drastically in a highly threaded environment with deep memory hierarchy. • Dynamic load adaptation. The computation load may become unbalanced and a large number of threads may need to migrate to balance the load of the machine. • Locality adaptation. Data objects may need to migrate, and copies be generated and moved in the memory hierarchy to achieve high locality, while copy consistency needs to be preserved. • Latency adaptation. The deep memory hierarchy usually found in an HEC machine makes the memory access latencies vary more drastically during the execution, depending on the locality of references, the number of concurrent accesses, and the available memory bandwidth. The system needs dynamically adapt to such variations. A main task of our research is to study the key system software technologies that support the above dy- namic adaptiveness of the HEC system. Fig. 1 shows our overall system software architecture. At the core is a Hierarchical Threaded Virtual Machine(HTVM) execution model that features dynamic multi-level multithreaded execution. HTVM includes three components: a thread model, a memory model and a synchronization model. This design focuses on adaptivity features, as will be discussed in detail in Section 3. The functionalities of HTVM will be supported and explored through the HTVM parallel programming language (called LITL-X), compiler and runtime software. The compiler has two parts: a static part and a dynamic part. As shown in Fig. 1, the dynamic compiler is responsible for the adaptation of loop parallelism, dynamic load, locality and latency. Since the dynamic compiler closely interacts with the runtime system and will be called during the execution of the HEC applications, its functionality extends smoothly to the runtime system as well, as indicated by the boxes that span across the dynamic compiler and the runtime system. To take advantage of the adaptivity features of HTVM more effectively, a domain-experts knowledge base is provided. Domain-specific knowledge is expressed as scripts, which give specific annotations to the source of the HEC applications to guide the compilation process of the static compiler. To assist adaptivity, a system of structured hints guides the dynamic compiler for selection and completion of the partial schedules generated by the static compiler, and for selection of runtime algorithms, based on the dynamic facts such as memory access patterns found by a runtime performance monitor during the execution of the HEC applications. The flow of the mapping process of an HEC application under our proposed research software and tools is indicated by the big shaded arrows in Fig. 1. The components of the software infrastructure have been annotated by the corresponding section numbers. The HTVM compilation and execution process is an iterative process with the assistance of a feedback process as shown in the figure. 3 3.1 Hierarchical Multithreading: Programming Model and System Software A Hierarchical Threaded Virtual Machine Model One of our primary objectives is to define the hierarchical threaded virtual machine (HTVM). We first outline our research in the HTVM execution model, which consists of a thread model, a memory model and Figure 1. An Overview of the Proposed HEC Software/Tools a synchronization model. We then outline research issues and tasks for HTVM programming model. 3.1.1 HTVM Execution Model A novel aspect of our HTVM model is to provide a smooth and integrated abstraction that directly represents these thread levels and provides an integrated thread hierarchy. We will target future HEC architectures - they provide rich hardware support for a hierarchy of threads at different grain levels, as discussed earlier. Intuitively, the following levels of threads are to be defined under HTVM. • Large-Grain Threads (LGTs) under HTVM. Large-grain threads are a universally supported feature of many HEC architectures. These threads normally perform a substantial computation task, building up their state, of considerable “weight”, during the course of their execution. There is usually considerable cost associated with such a coarse thread invocation and management, even with architectural support. Examples of LGTs are the high-weight threads under Cascade architecture [4] or coarse-grain threads under PERCS architecture [1], and the threads under Cyclops-64 TiNy threadTM [7]. • Small-Grain Threads (SGTs) under HTVM: Small-grain threads are another feature of certain HEC architectures interested in this proposal. These threads normally expect to perform a much smaller computation task, building some state but with substantial less “weight”. Therefore, cost of their invocation and management is much lower when comparing with large-grain threads. An example of SGTs is the threaded function calls under CILK [9] and EARTH[19], parcels under HTMT [10] and Cascade [4], and asynchronous calls being considered under PERCS [1]. • Tiny-Grain Threads (TGTs) under HTVM: threads with much lighter weight than SGTs will be supported in some future HEC architectures. The partition of TGTs and their resource usage (e.g., registers) are done by automatic thread partitioning [18]. Examples of TGTs include fibers under EARTH [19] and strands under CARE [14]. An important research task is to provide a solid definition and specification of the three levels of threads under a unified thread hierarchy. The specification needs to be general enough to capture the features of a family of future HEC architectures to ensure portability, while simple enough for compiler and/or programmers to generate efficient code, and to facilitate runtime optimization. Figure 2. A Case Study of Hierarchical Thread Execution Model: Large Scale Simulation of Brain Neuron Networks Our current plan is: An LGT has its own private memory space, and all LGTs share a global address space. A group of SGTs invoked from an LGT will see the private memory of the LGT. An SGT invocation will have its own private frame storage, where its local state is stored. The TGTs within an SGT will share the frame storage of the enclosing SGT invocation, but may communicate efficiently by using registers under the compiler control. To illustrate our thoughts, Figure 2 shows a mapping of the computation of a multi-level neural system simulation onto our HTVM hierarchy of threads. We hope the figure should be self-explanatory. 3.2 Parallel LITL-X Programming Model and For developing the parallel programming model for HTVM, we leverage our own experience in participating recent on-going research on new parallel programming models and languages such as the language proposal X10 under the IBM PERCS project [2] and Chapel under the Cray Cascade project [4]. All are seeking a potential alternative programming model that is more aggressive in addressing the combined challenges of latency and overhead. To be more concrete, We are working on a prototype language, LITL-X (pronounced “little-X”) for Latency Intrinsic-Tolerant Language, which provides the application programmers with a powerful set of semantic constructs to organize parallel computations in a way that hides/manages latency and limits the effects of overhead. This is quite different from locality management, although the intent of both strategies is to minimize the effect of latency on the efficiency of computation. Locality management attempts to avoid latency events by aggregating data for local computation and reducing large message communications. Latency management attempts to hide latency by overlapping communications with computation. LITL-X will incorporate the following classes of parallel constructs for latency tolerance and overhead reduction: • Coarse-grain multithreading, with thread contextswitching built in the application’s instruction stream (rather than in the operating system) for keeping the processors busy in the presence of remote requests. This is connected to the LGT under HTVM. • Parcel(intelligent messages)-driven splittransaction computation [17], to reduce communication and to enable the moving of the work to the data (when it makes sense). This is connected to the SGT under HTVM. • Futures [11] for eager producer-consumer computing, with efficient localized buffering of requests at the site of the needed values. This is connected to the TGT under HTVM. • Percolation [12] of program instruction blocks and data at the site of the intended computation, to eliminate waiting for remote accesses, which are determined at run time prior to actual block execution. • Synchronization constructs for data-flow style operations, as well as atomic blocks of memory operations. 3.3 System Software: Compiler and Runtime Solutions In this section, we describe how the compiler and runtime software to address the challenges of efficient execution under the HTVM model. Our solution is moving from static analysis and optimization toward a hybrid scheme, combining both static compilation and runtime adaptation. The compiler and the runtime system software are intimately connected under our adaptive/continuous compilation strategy, where some key functions of runtime system software can also be viewed as an extension of the compiler. As mentioned in Section 2, the HTVM system software addresses four types of runtime adaptation: loop parallelism adaptation, dynamic load adaptation, locality adaptation, and latency adaption. In this paper, we take the loop multithreading and parallelism adaptation as an example to illustrate how the system software should be designed. Scientific applications heavily rely on loop nests to compute their results. Often, more than 90% of the execution time is spent on some computation-intensive kernels composed of loop nests. It is of extreme importance to schedule these loops effectively to improve the overall performance of the application. Loop scheduling on a parallel distributed system can be broadly divided into two classes: static and dynamic scheduling. Static scheduling tends to cause load imbalance, since the exploitable parallelism, and the grain size of the parallelism, vary with the machine resource availability, data distribution and the latency of memory accesses, especially in the context of the highly dynamic and threaded HEC machines. Consequently, dynamic scheduling has been developed and shown promising performance improvement. The dynamic loop scheduling methods, however, target only Thread-Level Parallelism (TLP). In contrast, there is another important technology, namely, software pipelining, aims to exploit Instruction-Level Parallelism (ILP) from loops. Software pipelining is a most widely and successfully used loop parallelization technique for existing microprocessor architectures (e.g. VLIW or superscalar architectures) [13]. Traditionally, software pipelining is mainly applied to the innermost loop of a given loop nest. Recently we have introduced a new approach, called Single-dimension Software Pipelining (SSP) [16],to software pipeline a loop nest at an arbitrary loop level with desirable optimization objectives such as data locality and/or parallelism. The SSP method has been successfully tested on a uniprocessor architecture (Intel IA-64 architecture) and shows significant performance improvement. In this research, we will further extend SSP from single-processor single-thread environments to multiprocessor multithreading environments, by combining the strength of software pipelining (a static scheduling) and dynamic scheduling. The basic approach can be described as follows: First choose the most profitable loop level [16], which may have its own inner loops and therefore a loop nest itself. This loop level is software pipelined first. After that, the software pipelined code is partitioned into threads, each thread composed of several iterations of the selected loop level. The approach is unique in that it exploits instruction-level and thread-level parallelism simultaneously. There are several issues we need to study: (1) What is the performance and cost model for such partition of the software pipelined code into threads? (2) How to integrate this approach with runtime optimization? Software pipelining uses a machine resource model, including the memory access latencies, to scheduling the loop. The available resources and actual memory access latencies, however, are runtime dependent in an HEC machine, as explained before. (3)What semantics constructs can be provided in LITL-X specifically for SSP and multi-threading? For example, a pragma may be presented to indicate the most beneficial loop level, or indicate the scheduling strategies. The static compiler acts according to the pragma and generates some (partial) schedules, and stores this pragma as a structured hint in appropriate format if it is dependent on runtime statistics. 4 Efficient Interaction between Applications and System Software In this section, we outline the solution strategies for the efficient interaction between applications and system software. There are two important aspects: the first is developing the methods, models, and tools to facilitate mapping complex domain application codes to the HTVM model; the second is developing a mon- itoring methodology and interface that will help the adaptive compiler and runtime system to optimize execution and resource utilization on the fly. 4.1 Domain-Specific Knowledge Input to System Software Today, it is a widely accepted belief that some ultralarge scale scientific applications targeted by the HEC machines are so complex that efficient mapping of such applications to the architecture will require application scientists/programmers who are domain experts. The gap between expressions of domain-specific computations and expressions tailored to efficient execution on a given system architecture is widening. Domain experts are and will continue to be challenged to write high performance codes. ming model and runtime system software are under the guidance of the domain experts’ knowledge. It shows the progression from a domain-specific script-based description of a simulation to HTVM code. The domain expert’s knowledge is built into the script language and/or idiomatic modules that augment the script or other programming language. Pseudo-code distills the simulation down to its key structural and computational components, and includes hints to be used to guide optimization. This code is then translated to run on the HTVM. The resulting code is ready for compilation and execution. Guidance from the application programmer, and more generally available from domain-specific idioms and algorithms used explicitly or implicitly by the application programmer, must be passed to the adaptive compiler, runtime system, and monitoring system to enable them to efficiently optimize the execution of the code. We plan to define and implement a system of structured hints to capture and apply the combined expertise of the domain specialist and the compiler. Our notion of structured hints embodies the idea that the compiler and the domain expert can collaborate to reduce the number of possible optimization strategies to a modest and manageable set of options which are most likely to produce high performance code in the context of a complex and adaptive system architecture. • The compiler will identify points in the code which present the potential for optimization, but for which it has insufficient information to proceed on its own. Figure 3. Mapping pNeocortex to system software HTVM embodies a model that we believe expresses effectively the low-level idioms and interfaces required by future ultra-large scale computing platforms. But programming at the HTVM level will require expertise well outside the domain of typical application specialists. To bridge this gap with as little compromise as possible we present a layered architecture that we will implement manually first, with an eye toward later automation as relevant technologies mature. Fig.3 illustrates the basic idea as it has evolved in the context of the EARTH project [19] and a particular application: a simulation of electrical activity in the neocortex. With the assistance from the domain specific knowledge embedded in the neocortex simulation, a test model of the PGENESIS neocortex is designed and the code mapping and optimization on the EARTH base program- • The domain expert, led by the structured list of opportunities generated by the compiler, will add priorities and rules to this list of opportunities that will aid the compiler and runtime to streamline code execution. The resulting organized and expertly culled guide to optimization, the structured hints, includes data structures, dependencies, weights, and rules. In addition to focusing the compilers attempts at optimization, the resulting structured hints will be an integrated part of our Program/Execution Knowledge Database, providing the runtime system with an informed and tailored set of options around which to make its choices. Each hint can be expressly targeted at some part of the execution model: the adaptive compiler, the runtime system, or monitoring system. For example, informed choices about which pieces of the code to instrument, and how, will become part of the metric suite used by the adaptive compiling and runtime system to adjust resource allocation and compilation strategy during execution. As another example, the domain expert can identify critical parameters to be adjusted by the compiler for its adaptive optimizations, thereby narrowing the parameter space to be searched. Without reference to the underlying hardware architecture, or even to the HTVM software architecture, the hints must address, in a general way, issues of: 1) data locality, 2) monitoring priorities, 3) data access patterns, and 4) computation patterns. These will be mapped directly to specific actions, weighting schemes, and optimization strategies in the HTVM system software. The adaptive compile and runtime system will require feedback derived from the execution and resource allocation monitoring. The hints discussed in the previous subsection will drive both static and dynamic optimizations of the program execution. In this later context, they will provide the system with guidance on degrees of freedom most likely to affect performance, likely bottlenecks in the code, unpredictable aspects of data locality and computational work patterns to steer monitoring resources to develop heuristic models. Our plan is to implement a hint schema that is fed by the application programming and domain-expert interactions. It will be used by various stages of the code translation process, the HTVM system, and the runtime monitoring system. system for a threaded virtual machine at LGT level, a thread communication and synchronization library, a OpenMP compiler and runtime system, a functionaccurate simulator, a cycle-accurate simulator, a Gccbased compiler, the binary utilities (assembler, linker, etc.), and the libraries (e.g. libc/libm). We also continue develop the EARTH and CARE software infrastructure at SGT and TGT levels. We have re-targeted the Open64 compilation infrastructure from the 64-bit Intel IA-64 architecture to the 32-bit Intel XScale embedded architecture [3], and recently we have successfully implemented the SSP scheduling [16], register allocation, and code generation [15] in this compiler. Based on the above infrastructures, we are constructing an experimental testbed in the following way. First we are modifying the current virtual machine that is for large-grain threads under C64 software infrastructure, and implement the HTVM small-grain and tiny-grain threads. Second we are implementing runtime system software support for both, leveraging our experience with the EARTH and CARE software infrastructures. Third, we are extending and modifying the function-accurate simulator to include the support of relevant architecture features. We are also extending the above runtime system, compiler and simulator to implement the algorithms developed during this research for continuous compilation and runtime optimization, as introduced in Section 3.3. 5 5.2 4.2 Monitoring of Application Execution Infrastructure and Experimentation Plan and Status In this section, we report the current status of the implementation and experimentation of the system software and tools. 5.1 The Infrastructure of an Experimental Testbed For the implementation and experimental study of the HTVM system software, we continue to develop and refine our software infrastructures. As the starting point, we choose the system software infrastructure for IBM Cyclops-64 cellular architecture, a petaflops supercomputing chip-multithreaded architecture under development at IBM Research Laboratory, featured with 160 thread units and a number of memory modules interconnected by a high speed on-chip interconnection network [8]. We will leverage an system software infrastructure and tool-chain for this architecture being developed jointly by a collaborative effort between ETI and CAPSL at University of Delware. The system software infrastructure includes a runtime Experimentation Plan and Status The primary goal of the experimentation is to validate the proposed HEC system software and tools in addressing the needs of selected HEC applications. Have we created a practical methodology, ultimately amenable to automation, that enables efficient application programming by domain experts while producing codes that perform well? We have selected two codes for our study: the computational neuroscience, which simulates large networks of biological neurons, and the fine grain molecular dynamics, which simulates relatively modest sized molecules, a single protein or protein complex in water with multiple ion species. Both codes are representative HEC applications, therefore they are important for the research and development of HTVM execution model, programming model, and system software. Our proposed experimental methodology comprises five major tasks enumerated below. We will use our Neuroscience code to blaze the trail and follow the task list for the molecular dynamics code with a delay. In this way we will begin the development and testing of the process and tool set with one code, and use the second code to validate the process and provide opportunity for refinement. • Instrument and characterize the application codes on existing machines to establish base performance properties. • Develop performance models for each code in terms of the proposed HTVM model. • Develop a new implementation of each code to use the proposed HTVM model using the application mapping methodology. • Validate on the simulation testbed. • Project performance and impact of the proposed new HEC software and tools. 6 Acknowledgments We acknowledge the support from the National Science Foundation (CNS-0509332) and Laboratory Director Research and Development funding, IBM, ETI, and other government sponsors. We acknowledge Hongbo Rong, who has been instrumental in the development and documentation of some key ideas in this paper. We would also like to acknowledge other members at the CAPSL group, who provide a stimulus environment for scientific discussions and collaborations, in particular Ziang Hu, Juan del Cuvillo, and Ge Gan. References [1] DARPA: High Productivity Computing Systems (HPCS). [2] IBM: PERCS (Productive, Easy-to-use, Reliable Computing System). [3] Kylin C Compiler. [4] D. Callahan, B. L. Chamberlain, and H. P. Zima. The cascade high productivity language. In Ninth International Workshop on High-Level Parallel Programming Models and Supportive Environments, Santa Fe, New Mexico, April 26th 2004. [5] C. Cascaval, J. Castanos, L. Ceze, M. Denneau, M. Gupta, J. M. D. Lieber, K. Strauss, and J. H.S. Warren. Evaluation of a multithreaded architecture for cellular computing. In Proceedings of the 8th International Symposium on High Performance Computer Architecture (HPCA), Boston, Massachusetts. [6] K. Davis, A. Hoisie, G. Johnson, D. J. Kerbyson, M. Lang, S. Pakin, and F. Petrini. A performance and scalability analysis of the BlueGene/L architecture. In Proceedings of SC2004: High Performance Networking and Computing, Pittsburgh, PA, Nov. 2004. ACM SIGARCH and IEEE Computer Society. [7] J. del Cuvillo, W. Zhu, Z. Hu, and G. R. Gao. TiNy Threads: A thread virtual machine for the Cyclops64 cellular architecture. In Fifth Workshop on Massively Parallel Processing, in conjuction with 19th International Parallel and Distributed Processing Symposium (IPDPS 2005), page 265, Denver, Colorado, USA, April 2005. [8] J. B. del Cuvillo, Z. Hu, W. Zhu, F. Chen, and G. R. Gao. Toward a software infrastructure for the cyclops64 cellular architecture. CAPSL Technical Memo 55, April 26th 2004. [9] M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN ’98 Conference on Programming Language Design and Implementation, pages 212–223, 1998. [10] G. Gao, K. Theobald, A. Marquez, and T. Sterling. The htmt program execution model. CAPSL Technical Memo 09, July 1997. [11] R. H. Halstead, Jr. Multilisp: A Language for Concurrent Symbolic Computation. ACM Transactions on Programming Languages and Systems, 7(4):501–538, Oct. 1985. [12] A. Jacquet, V. Janot, R. Govindarajan, C. Leung, G. Gao, and T. Sterling. Executable performance model and evaluation of high performance architectures with percolation. Technical Report 43, Newark, DE, Nov. 2002. [13] M. Lam. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the SIGPLAN ’88 Conference on Programming Language Design and Implementation, pages 318–328, Atlanta, Georgia, June 22–24, 1988. SIGPLAN Notices, 23(7), July 1988. [14] A. Marquez and G. R. Gao. CARE: Overview of an adaptive multithreaded architecture. In Fifth International Symposium on High Performance Computing (ISHPC-V), Tokyo, Japan, October 20-22, 2003. [15] H. Rong, A. Douillet, R. Govindarajan, and G. R. Gao. Code generation for single-dimension software pipelining of multi-dimensional loops. In Proc. of the 2004 Intl. Symp. on Code Generation and Optimization (CGO), pages 175–186, Palo Alto, California, March 2004. [16] H. Rong, Z. Tang, R. Govindarajan, A. Douillet, and G. R. Gao. Single-dimension software-pipelining for multi-dimensional loops. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO), pages 163–174, Palo Alto, California, March 2004. Best Paper Award. [17] T. Sterling. An introduction to the gilgamesh PIM architecture. Lecture Notes in Computer Science, 2150, 2001. LNCSD9. [18] X. Tang and G. R. Gao. Automatically partitioning threads based on remote paths. Technical Report 23, Newark, DE, July 1998. [19] K. B. Theobald. EARTH: An Efficient Architecture for Running Threads. PhD thesis, May 1999.