Applied Parallel Computing. New Paradigms for HPC in Industry and Academia
In earlier papers ([2], [3], [6]) feedback guided loop scheduling algorithms have been shown to b... more In earlier papers ([2], [3], [6]) feedback guided loop scheduling algorithms have been shown to be very effective for certain loop scheduling problems which involve a sequential outer loop and a parallel inner loop and for which the workload of the parallel loop changes only slowly from one execution to the next. In this paper the extension of these ideas the case of nested parallel loops is investigated. We describe four feedback guided algorithms for scheduling nested loops and evaluate the performances of the algorithms on a set of synthetic benchmarks.
Weather and climate models are complex pieces of software which include many individual component... more Weather and climate models are complex pieces of software which include many individual components, each of which is evolving under the pressure to exploit advances in computing to enhance some combination of a range of possible improvements (higher spatio/temporal resolution, increased fidelity in terms of resolved processes, more quantification of uncertainty etc). However, after many years of a relatively stable computing environment with little choice in processing architecture or programming paradigm (basically X86 processors using MPI for parallelism), the existing menu of processor choices includes significant diversity, and more is on the horizon. This computational diversity, coupled with ever increasing software complexity, leads to the very real possibility that weather and climate modelling will arrive at a chasm which will separate scientific aspiration from our ability to develop and/or rapidly adapt codes to the available hardware.…
We present an approach which we call PSyKAl that is designed to achieve portable performance for ... more We present an approach which we call PSyKAl that is designed to achieve portable performance for parallel, finite-difference Ocean models. In PSyKAl the code related to the underlying science is formally separated from code related to parallelisation and single-core optimisations. This separation of concerns allows scientists to code their science independently of the underlying hardware architecture and for optimisation specialists to be able to tailor the code for a particular machine independently of the science code. We have taken the free-surface part of the NEMO ocean model and created a new, shallow-water model named NEMOLite2D. In doing this we have a code which is of a manageable size and yet which incorporates elements of full ocean models (input/output, boundary conditions, <i>etc.</i>). We have then manually constructed a PSyKAl version of this code and investigated the transformations that must be applied to the middle/PSy layer in order to achieve good perf...
The CNC series of Application Analysis Reports are top down reviews of computationally intensive ... more The CNC series of Application Analysis Reports are top down reviews of computationally intensive applications. By taking a top down approach we can study methods to e ciently implement them on massively parallel machines. This both aids the user and provides CNC with a study of the di culties encountered in parallelising a real problem. This report is based of the weather prediction model developed and run by the European Centre for Medium-Range Weather Forecasts (ECMWF). As their name suggests, ECMWF are an ...
Welcome to the Euro-Par 2001 Topic 03 on Scheduling and Load Balancing. Scheduling and load balan... more Welcome to the Euro-Par 2001 Topic 03 on Scheduling and Load Balancing. Scheduling and load balancing are key areas in the quest for performance in parallel and distributed applications. Relevant techniques can be provided either at the application level, or at the system level, and both scenarios are of interest for this topic. Twenty papers were submitted to Topic 03, one of which was redirected to Topic 04. Out of the nineteen remaining papers, three were selected as regular papers, and four as research notes. All papers were reviewed by at least three referees, and the vast majority received four reviews. The presentation of the seven papers is organized in two sessions. The first session contains three papers. In the first paper, On Minimising the Processor Requirements of LogP Schedules, the authors propose different clustering heuristics for task scheduling in the LogP model. These heuristics reduce the number of required processors without degrading the makespan. The second paper, Exploiting Unused Time Slots in List Scheduling Considering Communication Contention, presents (two versions of) a contention aware scheduling strategy which is compared to two related methods. It out-performs these other methods, with similar or better complexity, apart from one case where high communication costs mean that a more sequential solution is most apt. The third paper, An Evaluation of Partitioners for Parallel SAMR Applications, presents a review of mesh-partitioning tools/techniques for structured meshes, and provides experimental results for the various tools on one selected application, with various numbers of processors, problem size and partition granularity. The second session contains four papers. The first paper, Load Balancing on Networks with Dynamically Changing Topology, presents a load balancing algorithm targeted at synchronous networks with dynamic topologies (e.g. due to link failures), and establishes a convergence result for nearest-neighbor loadbalancing techniques. The second paper, A Fuzzy Load Balancing Service for Network Computing Based on Jini, addresses the problem of load balancing for servers executing independent tasks generated by clients in a distributed object computing environment implemented with Jini; the results show that the fuzzy algorithm achieves significantly better load balancing than random and roundrobin algorithms. The third paper, Approximation Algorithms for Scheduling Independent Malleable Tasks, builds on the well-known continuous resource allocation case for scheduling independent non-preemptive tasks. Finally, the fourth paper, The Way to Produce the Quasi-workload in a Cluster, addresses the problem of generating synthetic workloads that can serve as input for the simulation of scheduling algorithms in cluster-based architectures.
This paper describes RoboBase, a system that provides immediate access to entire libraries of Rob... more This paper describes RoboBase, a system that provides immediate access to entire libraries of RoboCup logfiles. A centralised database stores the logfiles allowing them to be viewed remotely. Instead of downloading a 2MB uncompressed logfile, the match is transferred and displayed in real-time. The system has been designed specifically to perform well in low bandwidth situations by using a domain specific compression method. Dynamic frame-rates are also employed, providing uninterrupted viewing in fluctuating network conditions. The system conforms to an Object Oriented methodology and is implemented in Java allowing extension of the software by the user.
In recent years, Numerical Weather Prediction (NWP) has made increasing use of parallel machines ... more In recent years, Numerical Weather Prediction (NWP) has made increasing use of parallel machines with large numbers of processors. The physics portion of NWP is particularly amenable to large scale parallelism as, in most cases, each grid column can be independently computed. However, grid columns have varying amounts of work associated with them, so, a simple partition of grid columns can leave some processors with much less work than others. As the number of processors grows, this e ect becomes increasingly important. Such imbalance can be time invariant (such as computation based on orographic features), vary predictably in time (such as short wave radiation), or vary unpredictably in time (such as convection and precipitation). This paper presents a load balance algorithm which is particularly aimed at the last of these cases. The feedback mechanism uses the time taken and number of grid columns for each processor at the previous time-step, to provide an improved partition for the current time-step. Simulation results are presented both for synthetic workloads, and using data from ECMWF's IFS model.
This paper describes user-level optimisations for virtual shared memory (VSM) systems and demonst... more This paper describes user-level optimisations for virtual shared memory (VSM) systems and demonstrates performance improvements for three scienti c kernel codes written in Fortran-S and running on a 30 node prototype distributed memory architecture. These optimisations can be applied to all consistency models and directory schemes, whether in hardware or software, which employ an invalidation based protocol. The semantics of these optimisations are carefully stated. Currently these optimisations are performed by the programmer, but there is much scope for automating this process within a compiler.
IntroductionThis survey is aimed at identifying the important aspects of tools for efficient para... more IntroductionThis survey is aimed at identifying the important aspects of tools for efficient parallelisationof code on a parallel machine. It also reviews the most mature of these and discussestheir relative merits. This survey is complemented by a tools overview [HEDA93] which reports on anAmerican initiative to define what is available and what is lacking in software tools forparallel machines. Section 2.0 discusses parallelising compilers, section 3.0 discusses performance analysistools and section 4.0 presents conclusions. Tools ...
This paper presents three algorithms for load balancing physics routines with dynamic load imbala... more This paper presents three algorithms for load balancing physics routines with dynamic load imbalance. Results are presented for the two most computationally demanding load imbalanced physics routines (short wave radiation and convection) in the UKMO's forecast and climate models. Results show between 30% and 40% performance improvement in these routines running on the UKMO's Cray T3E.
In previous papers ([2], [3], [6]) feedback guided loop scheduling algorithms have been shown to ... more In previous papers ([2], [3], [6]) feedback guided loop scheduling algorithms have been shown to be very effective for certain loop scheduling problems.
There has been a convergence in computer architecture toward a distributed memory implementation ... more There has been a convergence in computer architecture toward a distributed memory implementation of a shared memory user model. This has brought about the possibility of using a very large number of processors, on NWP applications, from the realm of research into reality. The simplicity of the shared memory model has meant that porting the very large Unified Model can be approached. However, although the shared memory model allows freedom of movement, the ability to achieve high performance can require very detailed analysis of many of the same issues faced by practicioners of distributed memory programming. This paper chronicles the process of porting the sequential version of the Unified Model to the KSR virtual shared memory supercomputer and outlines some of the issues which must be faced in obtaining a high performance implementation on the KSR. 1. Background The formulation of the Unified Model (UM) differs in several respects from that of many other models currently in use. I...
Dynamic memory allocation is a useful feature for the UM as it allows the precompilation of a lar... more Dynamic memory allocation is a useful feature for the UM as it allows the precompilation of a large amount of the model code. Production runs of differing resolutions can then use this pre-compiled object code thus reducing their compilation time. Although sequentially free lists are effective for the UM on the KSR1, a shared free list for the memory allocator 'malloc' can seriously degrade performance in parallel. Thread based free lists or the separate allocation of memory from a threads stack alleviates this problem. 1 The Unified Model The Unified Model (UM) is used operationally by the U.K. Meteorological Office (UKMO). It encompasses both data assimilation and prediction for the atmosphere and ocean. Run on an 16 processor Cray C90 system, the main applications are climate prediction and weather forecasting. A complete description is given in [1]. This work forms part of a parallelisation effort concentrating on the atmospheric prediction portion of the UM. 2 A Portabl...
This paper describes user-level optimisations for virtual shared memory (VSM) systems and demonst... more This paper describes user-level optimisations for virtual shared memory (VSM) systems and demonstrates performance improvements for three scienti c kernel codes written in Fortran-S and running on a 30 node prototype distributed memory architecture. These optimisations can be applied to all consistency models and directory schemes, whether in hardware or software, which employ an invalidation based protocol. The semantics of these optimisations are carefully stated. Currently these optimisations are performed by the programmer, but there is much scope for automating this process within a compiler.
Climate modelling in former times mostly covered the physical processes in the atmosphere. Nowada... more Climate modelling in former times mostly covered the physical processes in the atmosphere. Nowadays, there is a general agreement that not only physical, but also chemical, biological and, in the near future, economical and sociological-the so-called anthropogenic-processes have to be taken into account on the way towards comprehensive Earth system models. Furthermore these models include the oceans, the land surfaces and, so far to a lesser extent, the Earth's mantle. Between all these components feedback processes have to be described and simulated. Today, a hierarchy of models exist for Earth system modelling. The spectrum reaches from conceptual models-back of the envelope calculations-over box-, processor column-models, further to Earth system models of intermediate complexity and finally to comprehensive global circulation models of high resolution in space and time. Since the underlying mathematical equations in most cases do not have an analytical solution, they have to be solved numerically. This is only possible by applying sophisticated software tools, which increase in complexity from the simple to the more comprehensive models. With this series of briefs on ''Earth System Modelling'' at hand we focus on Earth system models of high complexity. These models need to be designed, assembled, executed, evaluated, and described, both in the processes they depict as well as in the results the experiments carried out with them produce. These models are conceptually assembled in a hierarchy of sub models, where process models are linked together to form one component of the Earth system (Atmosphere, Ocean, ...), and these components are then coupled together to Earth system models in different levels of completeness. The software packages of many process models comprise a few to many thousand lines of code, which results in a high complexity of the task to develop, optimise, maintain and apply these packages, when assembled to more or less complete Earth system models. Running these models is an expensive business. Due to their complexity and the requirements w.r.t. the ratios of resolution versus extent in time and space, most of these models can only be executed on high performance computers, commonly called supercomputers. Even on todays supercomputers typical model experiments vii viii Preface
Applied Parallel Computing. New Paradigms for HPC in Industry and Academia
In earlier papers ([2], [3], [6]) feedback guided loop scheduling algorithms have been shown to b... more In earlier papers ([2], [3], [6]) feedback guided loop scheduling algorithms have been shown to be very effective for certain loop scheduling problems which involve a sequential outer loop and a parallel inner loop and for which the workload of the parallel loop changes only slowly from one execution to the next. In this paper the extension of these ideas the case of nested parallel loops is investigated. We describe four feedback guided algorithms for scheduling nested loops and evaluate the performances of the algorithms on a set of synthetic benchmarks.
Weather and climate models are complex pieces of software which include many individual component... more Weather and climate models are complex pieces of software which include many individual components, each of which is evolving under the pressure to exploit advances in computing to enhance some combination of a range of possible improvements (higher spatio/temporal resolution, increased fidelity in terms of resolved processes, more quantification of uncertainty etc). However, after many years of a relatively stable computing environment with little choice in processing architecture or programming paradigm (basically X86 processors using MPI for parallelism), the existing menu of processor choices includes significant diversity, and more is on the horizon. This computational diversity, coupled with ever increasing software complexity, leads to the very real possibility that weather and climate modelling will arrive at a chasm which will separate scientific aspiration from our ability to develop and/or rapidly adapt codes to the available hardware.…
We present an approach which we call PSyKAl that is designed to achieve portable performance for ... more We present an approach which we call PSyKAl that is designed to achieve portable performance for parallel, finite-difference Ocean models. In PSyKAl the code related to the underlying science is formally separated from code related to parallelisation and single-core optimisations. This separation of concerns allows scientists to code their science independently of the underlying hardware architecture and for optimisation specialists to be able to tailor the code for a particular machine independently of the science code. We have taken the free-surface part of the NEMO ocean model and created a new, shallow-water model named NEMOLite2D. In doing this we have a code which is of a manageable size and yet which incorporates elements of full ocean models (input/output, boundary conditions, <i>etc.</i>). We have then manually constructed a PSyKAl version of this code and investigated the transformations that must be applied to the middle/PSy layer in order to achieve good perf...
The CNC series of Application Analysis Reports are top down reviews of computationally intensive ... more The CNC series of Application Analysis Reports are top down reviews of computationally intensive applications. By taking a top down approach we can study methods to e ciently implement them on massively parallel machines. This both aids the user and provides CNC with a study of the di culties encountered in parallelising a real problem. This report is based of the weather prediction model developed and run by the European Centre for Medium-Range Weather Forecasts (ECMWF). As their name suggests, ECMWF are an ...
Welcome to the Euro-Par 2001 Topic 03 on Scheduling and Load Balancing. Scheduling and load balan... more Welcome to the Euro-Par 2001 Topic 03 on Scheduling and Load Balancing. Scheduling and load balancing are key areas in the quest for performance in parallel and distributed applications. Relevant techniques can be provided either at the application level, or at the system level, and both scenarios are of interest for this topic. Twenty papers were submitted to Topic 03, one of which was redirected to Topic 04. Out of the nineteen remaining papers, three were selected as regular papers, and four as research notes. All papers were reviewed by at least three referees, and the vast majority received four reviews. The presentation of the seven papers is organized in two sessions. The first session contains three papers. In the first paper, On Minimising the Processor Requirements of LogP Schedules, the authors propose different clustering heuristics for task scheduling in the LogP model. These heuristics reduce the number of required processors without degrading the makespan. The second paper, Exploiting Unused Time Slots in List Scheduling Considering Communication Contention, presents (two versions of) a contention aware scheduling strategy which is compared to two related methods. It out-performs these other methods, with similar or better complexity, apart from one case where high communication costs mean that a more sequential solution is most apt. The third paper, An Evaluation of Partitioners for Parallel SAMR Applications, presents a review of mesh-partitioning tools/techniques for structured meshes, and provides experimental results for the various tools on one selected application, with various numbers of processors, problem size and partition granularity. The second session contains four papers. The first paper, Load Balancing on Networks with Dynamically Changing Topology, presents a load balancing algorithm targeted at synchronous networks with dynamic topologies (e.g. due to link failures), and establishes a convergence result for nearest-neighbor loadbalancing techniques. The second paper, A Fuzzy Load Balancing Service for Network Computing Based on Jini, addresses the problem of load balancing for servers executing independent tasks generated by clients in a distributed object computing environment implemented with Jini; the results show that the fuzzy algorithm achieves significantly better load balancing than random and roundrobin algorithms. The third paper, Approximation Algorithms for Scheduling Independent Malleable Tasks, builds on the well-known continuous resource allocation case for scheduling independent non-preemptive tasks. Finally, the fourth paper, The Way to Produce the Quasi-workload in a Cluster, addresses the problem of generating synthetic workloads that can serve as input for the simulation of scheduling algorithms in cluster-based architectures.
This paper describes RoboBase, a system that provides immediate access to entire libraries of Rob... more This paper describes RoboBase, a system that provides immediate access to entire libraries of RoboCup logfiles. A centralised database stores the logfiles allowing them to be viewed remotely. Instead of downloading a 2MB uncompressed logfile, the match is transferred and displayed in real-time. The system has been designed specifically to perform well in low bandwidth situations by using a domain specific compression method. Dynamic frame-rates are also employed, providing uninterrupted viewing in fluctuating network conditions. The system conforms to an Object Oriented methodology and is implemented in Java allowing extension of the software by the user.
In recent years, Numerical Weather Prediction (NWP) has made increasing use of parallel machines ... more In recent years, Numerical Weather Prediction (NWP) has made increasing use of parallel machines with large numbers of processors. The physics portion of NWP is particularly amenable to large scale parallelism as, in most cases, each grid column can be independently computed. However, grid columns have varying amounts of work associated with them, so, a simple partition of grid columns can leave some processors with much less work than others. As the number of processors grows, this e ect becomes increasingly important. Such imbalance can be time invariant (such as computation based on orographic features), vary predictably in time (such as short wave radiation), or vary unpredictably in time (such as convection and precipitation). This paper presents a load balance algorithm which is particularly aimed at the last of these cases. The feedback mechanism uses the time taken and number of grid columns for each processor at the previous time-step, to provide an improved partition for the current time-step. Simulation results are presented both for synthetic workloads, and using data from ECMWF's IFS model.
This paper describes user-level optimisations for virtual shared memory (VSM) systems and demonst... more This paper describes user-level optimisations for virtual shared memory (VSM) systems and demonstrates performance improvements for three scienti c kernel codes written in Fortran-S and running on a 30 node prototype distributed memory architecture. These optimisations can be applied to all consistency models and directory schemes, whether in hardware or software, which employ an invalidation based protocol. The semantics of these optimisations are carefully stated. Currently these optimisations are performed by the programmer, but there is much scope for automating this process within a compiler.
IntroductionThis survey is aimed at identifying the important aspects of tools for efficient para... more IntroductionThis survey is aimed at identifying the important aspects of tools for efficient parallelisationof code on a parallel machine. It also reviews the most mature of these and discussestheir relative merits. This survey is complemented by a tools overview [HEDA93] which reports on anAmerican initiative to define what is available and what is lacking in software tools forparallel machines. Section 2.0 discusses parallelising compilers, section 3.0 discusses performance analysistools and section 4.0 presents conclusions. Tools ...
This paper presents three algorithms for load balancing physics routines with dynamic load imbala... more This paper presents three algorithms for load balancing physics routines with dynamic load imbalance. Results are presented for the two most computationally demanding load imbalanced physics routines (short wave radiation and convection) in the UKMO's forecast and climate models. Results show between 30% and 40% performance improvement in these routines running on the UKMO's Cray T3E.
In previous papers ([2], [3], [6]) feedback guided loop scheduling algorithms have been shown to ... more In previous papers ([2], [3], [6]) feedback guided loop scheduling algorithms have been shown to be very effective for certain loop scheduling problems.
There has been a convergence in computer architecture toward a distributed memory implementation ... more There has been a convergence in computer architecture toward a distributed memory implementation of a shared memory user model. This has brought about the possibility of using a very large number of processors, on NWP applications, from the realm of research into reality. The simplicity of the shared memory model has meant that porting the very large Unified Model can be approached. However, although the shared memory model allows freedom of movement, the ability to achieve high performance can require very detailed analysis of many of the same issues faced by practicioners of distributed memory programming. This paper chronicles the process of porting the sequential version of the Unified Model to the KSR virtual shared memory supercomputer and outlines some of the issues which must be faced in obtaining a high performance implementation on the KSR. 1. Background The formulation of the Unified Model (UM) differs in several respects from that of many other models currently in use. I...
Dynamic memory allocation is a useful feature for the UM as it allows the precompilation of a lar... more Dynamic memory allocation is a useful feature for the UM as it allows the precompilation of a large amount of the model code. Production runs of differing resolutions can then use this pre-compiled object code thus reducing their compilation time. Although sequentially free lists are effective for the UM on the KSR1, a shared free list for the memory allocator 'malloc' can seriously degrade performance in parallel. Thread based free lists or the separate allocation of memory from a threads stack alleviates this problem. 1 The Unified Model The Unified Model (UM) is used operationally by the U.K. Meteorological Office (UKMO). It encompasses both data assimilation and prediction for the atmosphere and ocean. Run on an 16 processor Cray C90 system, the main applications are climate prediction and weather forecasting. A complete description is given in [1]. This work forms part of a parallelisation effort concentrating on the atmospheric prediction portion of the UM. 2 A Portabl...
This paper describes user-level optimisations for virtual shared memory (VSM) systems and demonst... more This paper describes user-level optimisations for virtual shared memory (VSM) systems and demonstrates performance improvements for three scienti c kernel codes written in Fortran-S and running on a 30 node prototype distributed memory architecture. These optimisations can be applied to all consistency models and directory schemes, whether in hardware or software, which employ an invalidation based protocol. The semantics of these optimisations are carefully stated. Currently these optimisations are performed by the programmer, but there is much scope for automating this process within a compiler.
Climate modelling in former times mostly covered the physical processes in the atmosphere. Nowada... more Climate modelling in former times mostly covered the physical processes in the atmosphere. Nowadays, there is a general agreement that not only physical, but also chemical, biological and, in the near future, economical and sociological-the so-called anthropogenic-processes have to be taken into account on the way towards comprehensive Earth system models. Furthermore these models include the oceans, the land surfaces and, so far to a lesser extent, the Earth's mantle. Between all these components feedback processes have to be described and simulated. Today, a hierarchy of models exist for Earth system modelling. The spectrum reaches from conceptual models-back of the envelope calculations-over box-, processor column-models, further to Earth system models of intermediate complexity and finally to comprehensive global circulation models of high resolution in space and time. Since the underlying mathematical equations in most cases do not have an analytical solution, they have to be solved numerically. This is only possible by applying sophisticated software tools, which increase in complexity from the simple to the more comprehensive models. With this series of briefs on ''Earth System Modelling'' at hand we focus on Earth system models of high complexity. These models need to be designed, assembled, executed, evaluated, and described, both in the processes they depict as well as in the results the experiments carried out with them produce. These models are conceptually assembled in a hierarchy of sub models, where process models are linked together to form one component of the Earth system (Atmosphere, Ocean, ...), and these components are then coupled together to Earth system models in different levels of completeness. The software packages of many process models comprise a few to many thousand lines of code, which results in a high complexity of the task to develop, optimise, maintain and apply these packages, when assembled to more or less complete Earth system models. Running these models is an expensive business. Due to their complexity and the requirements w.r.t. the ratios of resolution versus extent in time and space, most of these models can only be executed on high performance computers, commonly called supercomputers. Even on todays supercomputers typical model experiments vii viii Preface
Uploads
Papers by Rupert Ford