Skip to main content

T. Boku

Followers

14

Following

4

Co-authors

17

Public Views

Interests

Uploads

Papers by T. Boku

PACS-CS: a large-scale bandwidth-aware PC cluster for scientific computation

Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06)

We have been developing a large scale PC cluster named PACS-CS (Parallel Array Computer System fo... more We have been developing a large scale PC cluster named PACS-CS (Parallel Array Computer System for Computational Sciences) at Center for Computational Sciences, University of Tsukuba, for wide variety of computational science applications such as computational physics, computational material science, computational biology, etc. We consider the most important issue on the computation node is the memory access bandwidth, then a node is equipped with a single CPU which is different from ordinary high-end PC clusters. The interconnection network for parallel processing is configured as a multi-dimensional Hyper-Crossbar Network based on trunking of Gigabit Ethernet to support large scale scientific computation with physical space modeling. Based on the above concept, we are developing an original mother board to configure a single CPU node with 8 ports of Gigabit Ethernet, which can be implemented in the half size of 19 inch rack-mountable 1U size platform. Under the preliminary performance evaluation, we confirmed that the computation part in practical Lattice QCD code will be able to achieve 30% of peak performance, and up to 600 Mbyte/sec of bandwidth at single directed neighboring communication will be achieved. PACS-CS will start its operation on July 2006 with 2560 CPUs and 14.3 Tflops of peak performance.

Architecture and compiler co-optimization for high performance computing

International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems

The •performance gap between processor and memo ry is very serious pro blem in high-performance c... more The •performance gap between processor and memo ry is very serious pro blem in high-performance comput ing because effective performance is limited by memo ry ability. In order to overcome this problem, it is in dispensable to make good use of wide on-chip memo ry bandwidth. For this purpose, arc hitecture and com piler co-optimization is a promising approach beCause most of data access is regular and/or predictable in high performance computing. Thus, we propose a new VLSI architecture called SCI MA as a platform of the co-optimization. SCIMA inte grates software controllable memory (SCM) into a pro cessor chip in addition to ordinary data cache. SCM and cache-can be reconfigured by software during com putation. Hence, memory hierarchy itself is the target of compiler optimization. In this sense, architecture and compiler co-optimization is realized in SCIMA. Towards the co-optimization, we have developed a directive-based compiler and an algorithm of SCM us age to insert directives automatically. In this paper, we present the directives and the outline of the algorithm for automatic optimization.

Heterogeneous multi-computer system: a new paradigm of parallel processing

Proceedings. International Conference on Parallel Computing in Electrical Engineering

ABSTRACT

Advanced processor design using hardware description language AIDL

Proceedings of ASP-DAC '97: Asia and South Pacific Design Automation Conference, 1997

In order to design advanced processors in a short time, designers must simulate their designs and... more In order to design advanced processors in a short time, designers must simulate their designs and reect the results to the designs at the very early stages. However, conventional hardware description languages (HDLs) do not have enough ability to describe designs easily and accurately at these stages. Then, we have proposed a new hardware description language AIDL. In this paper, in order to evaluate the eectiveness of AIDL, we describe and compare three processors in AIDL and VHDL descriptions.

GPU/CPU Work Sharing with Parallel Language XcalableMP-dev for Parallelized Accelerated Computing

2012 41st International Conference on Parallel Processing Workshops, 2012

ABSTRACT In this paper, we propose a solution framework to enable the work sharing of parallel pr... more ABSTRACT In this paper, we propose a solution framework to enable the work sharing of parallel processing by the coordination of CPUs and GPUs on hybrid PC clusters based on the high-level parallel language XcalableMPdev. Basic XcalableMP enables high-level parallel programming using sequential code directives that support data distribution and loop/task distribution among multiple nodes on a PC cluster. XcalableMP-dev is an extension of XcalableMP for a hybrid PC cluster, where each node is equipped with accelerated computing devices such as GPUs, many-core environments, etc. Our new framework proposed here, named XcalableMP-dev/Star PU, enables the distribution of data and loop execution among multiple GPUs and multiple CPU cores on each node. We employ a Star PU run-time system for task management with dynamic load balancing. Because of the large performance gap between CPUs and GPUs, the key issue for work sharing among CPU and GPU resources is the task size control assigned to different devices. Since the compiler of the new system is still under construction, we evaluated the performance of hybrid work sharing among four nodes of a GPU cluster and confirmed that the performance gain by the traditional XcalableMP-dev system on NVIDIA CUDA is up to 1.4 times faster than GPU-only execution.

Performance improvement for matrix calculation on CP-PACS node processor

Proceedings High Performance Computing on the Information Superhighway. HPC Asia '97

CP-PACS (Computational Physics by Parallel Array Computer System) is a massively parallel process... more CP-PACS (Computational Physics by Parallel Array Computer System) is a massively parallel processing system with 2048 node processors for large scale scientific calculations. On a node processor of CP-PACS, there is a special hardware feature called PVP-SW (Pseudo Vector Processor based on Slide Window), which realizes an efficient vector processing on a superscalar processor without depending on the cache. The

The architecture of massively parallel processor CP-PACS

Proceedings of IEEE International Symposium on Parallel Algorithms Architecture Synthesis

Performance of lattice QCD programs on CP-PACS

Parallel Computing, 1999

The CP-PACS is a massively parallel MIMD computer with the theoretical peak speed of 614 GFLOPS w... more The CP-PACS is a massively parallel MIMD computer with the theoretical peak speed of 614 GFLOPS which has been developed for computational physics applications at the University of Tsukuba, Japan. We report on the performance of the CP-PACS computer measured during recent production runs using our Quantum Chromodynamics code for the simulation of quarks and gluons in particle physics. With the full 2048 processing nodes, our code shows a sustained speed of 237.5 GFLOPS for the heat-bath update of gluon variables, 264.6 GFLOPS for the over-relaxation update, and 325.3 GFLOPS for quark matrix inversion with an even-odd preconditioned minimal residual algorithm.

High-Performance, Power-Aware Computing��HPPAC

Towards exascale with the ANR-JST Japanese-French Project FP3C

Ninth International Conference on Computer Science and Information Technologies Revised Selected Papers, 2013

IMPULSE: a high performance processing unit for multiprocessors for scientific calculation

ACM SIGARCH Computer Architecture News, 1988

In this paper, we propose a high performance processing unit for multiprocessor systems for scien... more In this paper, we propose a high performance processing unit for multiprocessor systems for scientific calculations. This processing unit is called IMPULSE. IMPULSE is equipped with a hardware process controI mechanism, and a powerfulfloating point processor and its controller. The process control method is based on the concurrent process model called the NC model. In the NC model, the processes and their communicating channels are static, and it is relatively easy to implement the interpnrcess communication server and process scheduler in hardware according to this model. To enhance the system performance, IMPULSE is composed of three parts, the TASK ENGINE, IPC ENGINE, and FPP ENGINE. From the results of simulations, it appears that the IPC ENGINE provides eficient process control even if the granularity of the processes is ve y fine.

Low-cost High-bandwidth Tree Network for PC Clusters based on Tagged-VLAN Technology

8th International Symposium on Parallel Architectures,Algorithms and Networks (ISPAN'05)

Abstract In the present paper, we propose a practical and feasible method to construct VLAN-based... more

Workshop Organizers

PEARL: Power-Aware, Dependable, and High-Performance Communication Link Using PCI Express

2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing, 2010

We have proposed a power-aware, high-performance, dependable communication link using PCI Express... more We have proposed a power-aware, high-performance, dependable communication link using PCI Express as a direct communication device, referred to as PEARL, for application in a wide range of parallel processing systems from high-end embedded system to small-scale high-performance clusters. In the present study, we describe the structure and function of a communicator chip, referred to as the PEACH chip, for realizing PEARL. The PEACH chip connects four ports of PCI Express Gen 2 with four lanes, and uses an M32R processor with four cores and several DMACs. We also develop the PEACH board as the network interface card for implementing the PEACH chip. The PEACH board provides a power-aware, dependable communication link with a theoretical peak bandwidth of 2 Gbytes/s per link.

Heterogeneous remote computing system for computational astrophysics with OmniRPC

2004 International Symposium on Applications and the Internet Workshops. 2004 Workshops., 2004

Evaluation of Multicore Processors for Embedded Systems by Parallel Benchmark Program Using OpenMP

Evolving OpenMP in an Age of Extreme Parallelism, 2009

Recently, the use of embedded systems with complicated functions, such as digi-tal home appliance... more

HMCS-G: grid-enabled hybrid computing system for computational astrophysics

CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings., 2003

Jack Dongarra Pete Beckman Terry Moore Jean-Claude Andre

Large-scale elastic-plastic indentation simulations via nonequilibrium molecular dynamics

Physical Review A, 1990

... Laboratory, Livermore, California 94550 Toshio Kawai Department of Physics, Keio University, ... more

PACS-CS: a large-scale bandwidth-aware PC cluster for scientific computation

Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06)

We have been developing a large scale PC cluster named PACS-CS (Parallel Array Computer System fo... more We have been developing a large scale PC cluster named PACS-CS (Parallel Array Computer System for Computational Sciences) at Center for Computational Sciences, University of Tsukuba, for wide variety of computational science applications such as computational physics, computational material science, computational biology, etc. We consider the most important issue on the computation node is the memory access bandwidth, then a node is equipped with a single CPU which is different from ordinary high-end PC clusters. The interconnection network for parallel processing is configured as a multi-dimensional Hyper-Crossbar Network based on trunking of Gigabit Ethernet to support large scale scientific computation with physical space modeling. Based on the above concept, we are developing an original mother board to configure a single CPU node with 8 ports of Gigabit Ethernet, which can be implemented in the half size of 19 inch rack-mountable 1U size platform. Under the preliminary performance evaluation, we confirmed that the computation part in practical Lattice QCD code will be able to achieve 30% of peak performance, and up to 600 Mbyte/sec of bandwidth at single directed neighboring communication will be achieved. PACS-CS will start its operation on July 2006 with 2560 CPUs and 14.3 Tflops of peak performance.

Architecture and compiler co-optimization for high performance computing

International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems

The •performance gap between processor and memo ry is very serious pro blem in high-performance c... more The •performance gap between processor and memo ry is very serious pro blem in high-performance comput ing because effective performance is limited by memo ry ability. In order to overcome this problem, it is in dispensable to make good use of wide on-chip memo ry bandwidth. For this purpose, arc hitecture and com piler co-optimization is a promising approach beCause most of data access is regular and/or predictable in high performance computing. Thus, we propose a new VLSI architecture called SCI MA as a platform of the co-optimization. SCIMA inte grates software controllable memory (SCM) into a pro cessor chip in addition to ordinary data cache. SCM and cache-can be reconfigured by software during com putation. Hence, memory hierarchy itself is the target of compiler optimization. In this sense, architecture and compiler co-optimization is realized in SCIMA. Towards the co-optimization, we have developed a directive-based compiler and an algorithm of SCM us age to insert directives automatically. In this paper, we present the directives and the outline of the algorithm for automatic optimization.

Heterogeneous multi-computer system: a new paradigm of parallel processing

Proceedings. International Conference on Parallel Computing in Electrical Engineering

ABSTRACT

Advanced processor design using hardware description language AIDL

Proceedings of ASP-DAC '97: Asia and South Pacific Design Automation Conference, 1997

In order to design advanced processors in a short time, designers must simulate their designs and... more In order to design advanced processors in a short time, designers must simulate their designs and reect the results to the designs at the very early stages. However, conventional hardware description languages (HDLs) do not have enough ability to describe designs easily and accurately at these stages. Then, we have proposed a new hardware description language AIDL. In this paper, in order to evaluate the eectiveness of AIDL, we describe and compare three processors in AIDL and VHDL descriptions.

GPU/CPU Work Sharing with Parallel Language XcalableMP-dev for Parallelized Accelerated Computing

2012 41st International Conference on Parallel Processing Workshops, 2012

ABSTRACT In this paper, we propose a solution framework to enable the work sharing of parallel pr... more ABSTRACT In this paper, we propose a solution framework to enable the work sharing of parallel processing by the coordination of CPUs and GPUs on hybrid PC clusters based on the high-level parallel language XcalableMPdev. Basic XcalableMP enables high-level parallel programming using sequential code directives that support data distribution and loop/task distribution among multiple nodes on a PC cluster. XcalableMP-dev is an extension of XcalableMP for a hybrid PC cluster, where each node is equipped with accelerated computing devices such as GPUs, many-core environments, etc. Our new framework proposed here, named XcalableMP-dev/Star PU, enables the distribution of data and loop execution among multiple GPUs and multiple CPU cores on each node. We employ a Star PU run-time system for task management with dynamic load balancing. Because of the large performance gap between CPUs and GPUs, the key issue for work sharing among CPU and GPU resources is the task size control assigned to different devices. Since the compiler of the new system is still under construction, we evaluated the performance of hybrid work sharing among four nodes of a GPU cluster and confirmed that the performance gain by the traditional XcalableMP-dev system on NVIDIA CUDA is up to 1.4 times faster than GPU-only execution.

Performance improvement for matrix calculation on CP-PACS node processor

Proceedings High Performance Computing on the Information Superhighway. HPC Asia '97

CP-PACS (Computational Physics by Parallel Array Computer System) is a massively parallel process... more CP-PACS (Computational Physics by Parallel Array Computer System) is a massively parallel processing system with 2048 node processors for large scale scientific calculations. On a node processor of CP-PACS, there is a special hardware feature called PVP-SW (Pseudo Vector Processor based on Slide Window), which realizes an efficient vector processing on a superscalar processor without depending on the cache. The

The architecture of massively parallel processor CP-PACS

Proceedings of IEEE International Symposium on Parallel Algorithms Architecture Synthesis

Performance of lattice QCD programs on CP-PACS

Parallel Computing, 1999

The CP-PACS is a massively parallel MIMD computer with the theoretical peak speed of 614 GFLOPS w... more The CP-PACS is a massively parallel MIMD computer with the theoretical peak speed of 614 GFLOPS which has been developed for computational physics applications at the University of Tsukuba, Japan. We report on the performance of the CP-PACS computer measured during recent production runs using our Quantum Chromodynamics code for the simulation of quarks and gluons in particle physics. With the full 2048 processing nodes, our code shows a sustained speed of 237.5 GFLOPS for the heat-bath update of gluon variables, 264.6 GFLOPS for the over-relaxation update, and 325.3 GFLOPS for quark matrix inversion with an even-odd preconditioned minimal residual algorithm.

High-Performance, Power-Aware Computing��HPPAC

Towards exascale with the ANR-JST Japanese-French Project FP3C

Ninth International Conference on Computer Science and Information Technologies Revised Selected Papers, 2013

IMPULSE: a high performance processing unit for multiprocessors for scientific calculation

ACM SIGARCH Computer Architecture News, 1988

In this paper, we propose a high performance processing unit for multiprocessor systems for scien... more In this paper, we propose a high performance processing unit for multiprocessor systems for scientific calculations. This processing unit is called IMPULSE. IMPULSE is equipped with a hardware process controI mechanism, and a powerfulfloating point processor and its controller. The process control method is based on the concurrent process model called the NC model. In the NC model, the processes and their communicating channels are static, and it is relatively easy to implement the interpnrcess communication server and process scheduler in hardware according to this model. To enhance the system performance, IMPULSE is composed of three parts, the TASK ENGINE, IPC ENGINE, and FPP ENGINE. From the results of simulations, it appears that the IPC ENGINE provides eficient process control even if the granularity of the processes is ve y fine.

Low-cost High-bandwidth Tree Network for PC Clusters based on Tagged-VLAN Technology

8th International Symposium on Parallel Architectures,Algorithms and Networks (ISPAN'05)

Abstract In the present paper, we propose a practical and feasible method to construct VLAN-based... more

Workshop Organizers

PEARL: Power-Aware, Dependable, and High-Performance Communication Link Using PCI Express

2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing, 2010

We have proposed a power-aware, high-performance, dependable communication link using PCI Express... more We have proposed a power-aware, high-performance, dependable communication link using PCI Express as a direct communication device, referred to as PEARL, for application in a wide range of parallel processing systems from high-end embedded system to small-scale high-performance clusters. In the present study, we describe the structure and function of a communicator chip, referred to as the PEACH chip, for realizing PEARL. The PEACH chip connects four ports of PCI Express Gen 2 with four lanes, and uses an M32R processor with four cores and several DMACs. We also develop the PEACH board as the network interface card for implementing the PEACH chip. The PEACH board provides a power-aware, dependable communication link with a theoretical peak bandwidth of 2 Gbytes/s per link.

Heterogeneous remote computing system for computational astrophysics with OmniRPC

2004 International Symposium on Applications and the Internet Workshops. 2004 Workshops., 2004

Evaluation of Multicore Processors for Embedded Systems by Parallel Benchmark Program Using OpenMP

Evolving OpenMP in an Age of Extreme Parallelism, 2009

Recently, the use of embedded systems with complicated functions, such as digi-tal home appliance... more

HMCS-G: grid-enabled hybrid computing system for computational astrophysics

CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings., 2003

Jack Dongarra Pete Beckman Terry Moore Jean-Claude Andre

Large-scale elastic-plastic indentation simulations via nonequilibrium molecular dynamics

Physical Review A, 1990

... Laboratory, Livermore, California 94550 Toshio Kawai Department of Physics, Keio University, ... more