Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06)
We have been developing a large scale PC cluster named PACS-CS (Parallel Array Computer System fo... more We have been developing a large scale PC cluster named PACS-CS (Parallel Array Computer System for Computational Sciences) at Center for Computational Sciences, University of Tsukuba, for wide variety of computational science applications such as computational physics, computational material science, computational biology, etc. We consider the most important issue on the computation node is the memory access bandwidth, then a node is equipped with a single CPU which is different from ordinary high-end PC clusters. The interconnection network for parallel processing is configured as a multi-dimensional Hyper-Crossbar Network based on trunking of Gigabit Ethernet to support large scale scientific computation with physical space modeling. Based on the above concept, we are developing an original mother board to configure a single CPU node with 8 ports of Gigabit Ethernet, which can be implemented in the half size of 19 inch rack-mountable 1U size platform. Under the preliminary performance evaluation, we confirmed that the computation part in practical Lattice QCD code will be able to achieve 30% of peak performance, and up to 600 Mbyte/sec of bandwidth at single directed neighboring communication will be achieved. PACS-CS will start its operation on July 2006 with 2560 CPUs and 14.3 Tflops of peak performance.
International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems
The •performance gap between processor and memo ry is very serious pro blem in high-performance c... more The •performance gap between processor and memo ry is very serious pro blem in high-performance comput ing because effective performance is limited by memo ry ability. In order to overcome this problem, it is in dispensable to make good use of wide on-chip memo ry bandwidth. For this purpose, arc hitecture and com piler co-optimization is a promising approach beCause most of data access is regular and/or predictable in high performance computing. Thus, we propose a new VLSI architecture called SCI MA as a platform of the co-optimization. SCIMA inte grates software controllable memory (SCM) into a pro cessor chip in addition to ordinary data cache. SCM and cache-can be reconfigured by software during com putation. Hence, memory hierarchy itself is the target of compiler optimization. In this sense, architecture and compiler co-optimization is realized in SCIMA. Towards the co-optimization, we have developed a directive-based compiler and an algorithm of SCM us age to insert directives automatically. In this paper, we present the directives and the outline of the algorithm for automatic optimization.
Proceedings of ASP-DAC '97: Asia and South Pacific Design Automation Conference, 1997
In order to design advanced processors in a short time, designers must simulate their designs and... more In order to design advanced processors in a short time, designers must simulate their designs and reect the results to the designs at the very early stages. However, conventional hardware description languages (HDLs) do not have enough ability to describe designs easily and accurately at these stages. Then, we have proposed a new hardware description language AIDL. In this paper, in order to evaluate the eectiveness of AIDL, we describe and compare three processors in AIDL and VHDL descriptions.
2012 41st International Conference on Parallel Processing Workshops, 2012
ABSTRACT In this paper, we propose a solution framework to enable the work sharing of parallel pr... more ABSTRACT In this paper, we propose a solution framework to enable the work sharing of parallel processing by the coordination of CPUs and GPUs on hybrid PC clusters based on the high-level parallel language XcalableMPdev. Basic XcalableMP enables high-level parallel programming using sequential code directives that support data distribution and loop/task distribution among multiple nodes on a PC cluster. XcalableMP-dev is an extension of XcalableMP for a hybrid PC cluster, where each node is equipped with accelerated computing devices such as GPUs, many-core environments, etc. Our new framework proposed here, named XcalableMP-dev/Star PU, enables the distribution of data and loop execution among multiple GPUs and multiple CPU cores on each node. We employ a Star PU run-time system for task management with dynamic load balancing. Because of the large performance gap between CPUs and GPUs, the key issue for work sharing among CPU and GPU resources is the task size control assigned to different devices. Since the compiler of the new system is still under construction, we evaluated the performance of hybrid work sharing among four nodes of a GPU cluster and confirmed that the performance gain by the traditional XcalableMP-dev system on NVIDIA CUDA is up to 1.4 times faster than GPU-only execution.
Proceedings High Performance Computing on the Information Superhighway. HPC Asia '97
CP-PACS (Computational Physics by Parallel Array Computer System) is a massively parallel process... more CP-PACS (Computational Physics by Parallel Array Computer System) is a massively parallel processing system with 2048 node processors for large scale scientific calculations. On a node processor of CP-PACS, there is a special hardware feature called PVP-SW (Pseudo Vector Processor based on Slide Window), which realizes an efficient vector processing on a superscalar processor without depending on the cache. The
The CP-PACS is a massively parallel MIMD computer with the theoretical peak speed of 614 GFLOPS w... more The CP-PACS is a massively parallel MIMD computer with the theoretical peak speed of 614 GFLOPS which has been developed for computational physics applications at the University of Tsukuba, Japan. We report on the performance of the CP-PACS computer measured during recent production runs using our Quantum Chromodynamics code for the simulation of quarks and gluons in particle physics. With the full 2048 processing nodes, our code shows a sustained speed of 237.5 GFLOPS for the heat-bath update of gluon variables, 264.6 GFLOPS for the over-relaxation update, and 325.3 GFLOPS for quark matrix inversion with an even-odd preconditioned minimal residual algorithm.
In this paper, we propose a high performance processing unit for multiprocessor systems for scien... more In this paper, we propose a high performance processing unit for multiprocessor systems for scientific calculations. This processing unit is called IMPULSE. IMPULSE is equipped with a hardware process controI mechanism, and a powerfulfloating point processor and its controller. The process control method is based on the concurrent process model called the NC model. In the NC model, the processes and their communicating channels are static, and it is relatively easy to implement the interpnrcess communication server and process scheduler in hardware according to this model. To enhance the system performance, IMPULSE is composed of three parts, the TASK ENGINE, IPC ENGINE, and FPP ENGINE. From the results of simulations, it appears that the IPC ENGINE provides eficient process control even if the granularity of the processes is ve y fine.
8th International Symposium on Parallel Architectures,Algorithms and Networks (ISPAN'05)
Abstract In the present paper, we propose a practical and feasible method to construct VLAN-based... more Abstract In the present paper, we propose a practical and feasible method to construct VLAN-based fat tree network that provides wide bisection bandwidth on a tree network constructed with cheap Layer-2 Gigabit Ethernet switches for high-performance PC clusters. The ...
2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing, 2010
We have proposed a power-aware, high-performance, dependable communication link using PCI Express... more We have proposed a power-aware, high-performance, dependable communication link using PCI Express as a direct communication device, referred to as PEARL, for application in a wide range of parallel processing systems from high-end embedded system to small-scale high-performance clusters. In the present study, we describe the structure and function of a communicator chip, referred to as the PEACH chip, for realizing PEARL. The PEACH chip connects four ports of PCI Express Gen 2 with four lanes, and uses an M32R processor with four cores and several DMACs. We also develop the PEACH board as the network interface card for implementing the PEACH chip. The PEACH board provides a power-aware, dependable communication link with a theoretical peak bandwidth of 2 Gbytes/s per link.
Evolving OpenMP in an Age of Extreme Parallelism, 2009
Recently, the use of embedded systems with complicated functions, such as digi-tal home appliance... more Recently, the use of embedded systems with complicated functions, such as digi-tal home appliances and car navigation systems, has become widespread. These systems require increasingly higher performance as the functionality of the user interface becomes increasingly higher and ...
... Laboratory, Livermore, California 94550 Toshio Kawai Department of Physics, Keio University, ... more ... Laboratory, Livermore, California 94550 Toshio Kawai Department of Physics, Keio University, Yagami Campus, Yokohama 223, Japan Brad Lee Holian ... 6 In general, indi-vidual thermostat temperatures TF and relaxation times ri can be imposed on selected subsets [xi } of the ...
Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06)
We have been developing a large scale PC cluster named PACS-CS (Parallel Array Computer System fo... more We have been developing a large scale PC cluster named PACS-CS (Parallel Array Computer System for Computational Sciences) at Center for Computational Sciences, University of Tsukuba, for wide variety of computational science applications such as computational physics, computational material science, computational biology, etc. We consider the most important issue on the computation node is the memory access bandwidth, then a node is equipped with a single CPU which is different from ordinary high-end PC clusters. The interconnection network for parallel processing is configured as a multi-dimensional Hyper-Crossbar Network based on trunking of Gigabit Ethernet to support large scale scientific computation with physical space modeling. Based on the above concept, we are developing an original mother board to configure a single CPU node with 8 ports of Gigabit Ethernet, which can be implemented in the half size of 19 inch rack-mountable 1U size platform. Under the preliminary performance evaluation, we confirmed that the computation part in practical Lattice QCD code will be able to achieve 30% of peak performance, and up to 600 Mbyte/sec of bandwidth at single directed neighboring communication will be achieved. PACS-CS will start its operation on July 2006 with 2560 CPUs and 14.3 Tflops of peak performance.
International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems
The •performance gap between processor and memo ry is very serious pro blem in high-performance c... more The •performance gap between processor and memo ry is very serious pro blem in high-performance comput ing because effective performance is limited by memo ry ability. In order to overcome this problem, it is in dispensable to make good use of wide on-chip memo ry bandwidth. For this purpose, arc hitecture and com piler co-optimization is a promising approach beCause most of data access is regular and/or predictable in high performance computing. Thus, we propose a new VLSI architecture called SCI MA as a platform of the co-optimization. SCIMA inte grates software controllable memory (SCM) into a pro cessor chip in addition to ordinary data cache. SCM and cache-can be reconfigured by software during com putation. Hence, memory hierarchy itself is the target of compiler optimization. In this sense, architecture and compiler co-optimization is realized in SCIMA. Towards the co-optimization, we have developed a directive-based compiler and an algorithm of SCM us age to insert directives automatically. In this paper, we present the directives and the outline of the algorithm for automatic optimization.
Proceedings of ASP-DAC '97: Asia and South Pacific Design Automation Conference, 1997
In order to design advanced processors in a short time, designers must simulate their designs and... more In order to design advanced processors in a short time, designers must simulate their designs and reect the results to the designs at the very early stages. However, conventional hardware description languages (HDLs) do not have enough ability to describe designs easily and accurately at these stages. Then, we have proposed a new hardware description language AIDL. In this paper, in order to evaluate the eectiveness of AIDL, we describe and compare three processors in AIDL and VHDL descriptions.
2012 41st International Conference on Parallel Processing Workshops, 2012
ABSTRACT In this paper, we propose a solution framework to enable the work sharing of parallel pr... more ABSTRACT In this paper, we propose a solution framework to enable the work sharing of parallel processing by the coordination of CPUs and GPUs on hybrid PC clusters based on the high-level parallel language XcalableMPdev. Basic XcalableMP enables high-level parallel programming using sequential code directives that support data distribution and loop/task distribution among multiple nodes on a PC cluster. XcalableMP-dev is an extension of XcalableMP for a hybrid PC cluster, where each node is equipped with accelerated computing devices such as GPUs, many-core environments, etc. Our new framework proposed here, named XcalableMP-dev/Star PU, enables the distribution of data and loop execution among multiple GPUs and multiple CPU cores on each node. We employ a Star PU run-time system for task management with dynamic load balancing. Because of the large performance gap between CPUs and GPUs, the key issue for work sharing among CPU and GPU resources is the task size control assigned to different devices. Since the compiler of the new system is still under construction, we evaluated the performance of hybrid work sharing among four nodes of a GPU cluster and confirmed that the performance gain by the traditional XcalableMP-dev system on NVIDIA CUDA is up to 1.4 times faster than GPU-only execution.
Proceedings High Performance Computing on the Information Superhighway. HPC Asia '97
CP-PACS (Computational Physics by Parallel Array Computer System) is a massively parallel process... more CP-PACS (Computational Physics by Parallel Array Computer System) is a massively parallel processing system with 2048 node processors for large scale scientific calculations. On a node processor of CP-PACS, there is a special hardware feature called PVP-SW (Pseudo Vector Processor based on Slide Window), which realizes an efficient vector processing on a superscalar processor without depending on the cache. The
The CP-PACS is a massively parallel MIMD computer with the theoretical peak speed of 614 GFLOPS w... more The CP-PACS is a massively parallel MIMD computer with the theoretical peak speed of 614 GFLOPS which has been developed for computational physics applications at the University of Tsukuba, Japan. We report on the performance of the CP-PACS computer measured during recent production runs using our Quantum Chromodynamics code for the simulation of quarks and gluons in particle physics. With the full 2048 processing nodes, our code shows a sustained speed of 237.5 GFLOPS for the heat-bath update of gluon variables, 264.6 GFLOPS for the over-relaxation update, and 325.3 GFLOPS for quark matrix inversion with an even-odd preconditioned minimal residual algorithm.
In this paper, we propose a high performance processing unit for multiprocessor systems for scien... more In this paper, we propose a high performance processing unit for multiprocessor systems for scientific calculations. This processing unit is called IMPULSE. IMPULSE is equipped with a hardware process controI mechanism, and a powerfulfloating point processor and its controller. The process control method is based on the concurrent process model called the NC model. In the NC model, the processes and their communicating channels are static, and it is relatively easy to implement the interpnrcess communication server and process scheduler in hardware according to this model. To enhance the system performance, IMPULSE is composed of three parts, the TASK ENGINE, IPC ENGINE, and FPP ENGINE. From the results of simulations, it appears that the IPC ENGINE provides eficient process control even if the granularity of the processes is ve y fine.
8th International Symposium on Parallel Architectures,Algorithms and Networks (ISPAN'05)
Abstract In the present paper, we propose a practical and feasible method to construct VLAN-based... more Abstract In the present paper, we propose a practical and feasible method to construct VLAN-based fat tree network that provides wide bisection bandwidth on a tree network constructed with cheap Layer-2 Gigabit Ethernet switches for high-performance PC clusters. The ...
2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing, 2010
We have proposed a power-aware, high-performance, dependable communication link using PCI Express... more We have proposed a power-aware, high-performance, dependable communication link using PCI Express as a direct communication device, referred to as PEARL, for application in a wide range of parallel processing systems from high-end embedded system to small-scale high-performance clusters. In the present study, we describe the structure and function of a communicator chip, referred to as the PEACH chip, for realizing PEARL. The PEACH chip connects four ports of PCI Express Gen 2 with four lanes, and uses an M32R processor with four cores and several DMACs. We also develop the PEACH board as the network interface card for implementing the PEACH chip. The PEACH board provides a power-aware, dependable communication link with a theoretical peak bandwidth of 2 Gbytes/s per link.
Evolving OpenMP in an Age of Extreme Parallelism, 2009
Recently, the use of embedded systems with complicated functions, such as digi-tal home appliance... more Recently, the use of embedded systems with complicated functions, such as digi-tal home appliances and car navigation systems, has become widespread. These systems require increasingly higher performance as the functionality of the user interface becomes increasingly higher and ...
... Laboratory, Livermore, California 94550 Toshio Kawai Department of Physics, Keio University, ... more ... Laboratory, Livermore, California 94550 Toshio Kawai Department of Physics, Keio University, Yagami Campus, Yokohama 223, Japan Brad Lee Holian ... 6 In general, indi-vidual thermostat temperatures TF and relaxation times ri can be imposed on selected subsets [xi } of the ...
Uploads
Papers by T. Boku