Academia.eduAcademia.edu

Load Balancing Physics Routines

This paper presents three algorithms for load balancing physics routines with dynamic load imbalance. Results are presented for the two most computationally demanding load imbalanced physics routines (short wave radiation and convection) in the UKMO's forecast and climate models. Results show between 30% and 40% performance improvement in these routines running on the UKMO's Cray T3E.

Introduction

The physics phase of forecast and climate models is particularly amenable to large scale parallelism as, typically, each grid column can be independently computed. However, the existence of load imbalance in this phase is well known 6, 8] and has been widely acknowledged as being one of the major factors in loss of performance, particularly when using large numbers of processors, for example see 5,9,14,15]. Interestingly, the same load imbalance problem is also observed in data assimilation 9,10].

A domain decomposition approach is usually employed during the physics phase of parallel forecast and climate models, where equal sized regions of the computational grid are allocated to processors. In the U.K. Meteorological O ce's (UKMO's) case, these regions are rectangular in latitude and longitude and there are an equal number of processors and regions. If no load balancing is performed, then all processors will compute an equal number of grid columns (ignoring di erences due to integer division) and typically, for every time-step, a particular processor will compute the same grid columns. Unfortunately, the computational cost of grid columns can vary signi cantly, leading to load imbalance.

Improving load balance implies a re-distribution of grid columns. For a load balancing algorithm to be e ective, the performance bene t due to improved load balance must outweigh any load balancing overheads. These overheads can be split into the following:

Communication. The total amount of communication used to re-distribute the grid columns should be minimised 1 and overlapped with computation where possible.

Number of messages. The number of messages required to perform the communication should be minimised. If the amount of communication is xed then this also maximises the size of messages. It also maximises the size of each grid column computed at a time (a chunk), if each chunk associated with a message is computed separately.

Memory requirements. The increased memory requirements should be minimised.

Chunk size. The chunk size should be maximised. Some of the bene ts of doing so are noted in 3]. On vector architectures, small chunks result in small vector lengths thus reducing performance. On message passing architectures small chunks potentially lead to short messages, which can also be ine cient. If the model implementation dynamically allocates arrays (allowing the same executable to run with varying resolutions and processors without re-compilation) then there is also an overhead per chunk; this is signi cant in the UKMO's uni ed forecast and climate model (UM) 2]. If smaller chunks prove bene cial, for example due to better cache use, then the larger chunk can subsequently be blocked.

The relative importance of the overheads described above will vary depending on the characteristics of the model and machine that it is running on. Attempting to minimise all of these overheads is useful, as it maximises the chance of obtaining good performance across a number of current and next generation machines.

As the sophistication of the load balance algorithm increases, there is a corresponding improvement in load balance and decrease in the above overheads. However, more sophisticated algorithms take longer to compute and may use global communication to calculate the load imbalance. The philosophy of this paper is to develop more sophisticated algorithms, as their cost is small compared to the cost of load imbalance and the load balance overheads described above.

For our purposes load imbalance can be usefully split into static and dynamic imbalance, then further split into accurately predictable and un-predictable imbalance.

Static imbalance (predictable and un-predictable) where load may vary over grid columns but not over time-steps. Examples of this type of imbalance may be found in routines which perform computation based on orographic features.

Predictable dynamic imbalance where the load also varies over time-steps but in a manner such that the load for each grid column can be accurately estimated before the computation takes place. An example of this type of imbalance is short wave radiation in the UKMO's UM. Un-predictable dynamic imbalance where the load varies in an un-predictable manner such that the load for each grid column cannot be accurately estimated a priori. This is typically found when work depends on the dynamic meteorological phenomena. Examples are convection and precipitation. This paper presents solutions targeted at the latter two forms of load imbalance. Clearly, however, dynamic algorithms can also cope with static distributions. These algorithms are tested on the UKMO's two most computationally intensive loadimbalanced routines, short wave radiation and convection. Short wave radiation is used to test our predictable dynamic imbalance algorithm. This re-distributes grid columns with the minimum amount of communication and attempts to minimise the number of messages. Convection is used to test two un-predictable dynamic imbalance algorithms. The rst enables under-loaded processors to steal chunks from over-loaded processors and the second uses the execution time from the previous time-step as an estimate of the load for the current time-step.

In Section 2 we describe how our work complements other work in this area. Sections 3 and 4 present the algorithms developed for predictable and unpredictable dynamic imbalance, respectively. Section 5 gives the results of our experiments for both the UKMO's operational forecast and climate models running on their Cray T3E. Finally, Section 6 draws some conclusions and points the way to future work in this area.

Related Work

A number of static grid column distributions have been suggested to solve (or reduce) the problem of (combined static and dynamic) load imbalance. Sch attler and Krenzien 11] suggest decomposing the grid into more regions than processors. These are allocated such that processors perform computation from disparate parts of the globe. The hope is that load imbalance will be smoothed out. This method is simple to implement and is likely to improve load balance, especially if the imbalance is dominated by radiation. However, it is likely to vary depending on the input data and, in general, requires experiments with the model to nd the optimum number of regions (which may change depending on the input data). Dynamic load balancing algorithms would be much more able to cope with changing loads and do not rely on a statistical smoothing over the globe. Note, this static distribution could subsequently use a dynamic algorithm to further improve load balance. In 20] we distribute grid columns cyclicly across processors (the extreme version of the above distribution) for data assimilation, signi cantly improving the performance. Foster and Toonen 12] show that the load balance of the PCCM2 model 16] can be improved by swapping grid columns between processors allocated regions on the opposite side of the globe in longitude. Three ways of achieving this are presented, swapping every other grid column, swapping excess day-lit grid columns and swapping grid columns based on a pre-computed estimate of the cost of both lit and non-lit grid columns. The algorithms work well as (again) the radiation costs dominate. They note that there are other variances, but suggest the partitioning of PCCM2, which pairs symmetric north/south latitudes, compensates for this. This may not always be the case in general (for example in local area models) and is not the case in the UM. Our algorithm for short wave radiation can be considered as a global version of their method. Our algorithm re-partitions data amongst all processors guaranteeing load balance, whereas they load balance purely between pairs which does not. In their case symmetry ensures a good solution. Their methods require communication only between pairs of processors when pre-computing how many grid columns should be sent (whereas ours requires global communication). However, their method does not guarantee to minimise the number of grid columns sent (globally) and therefore data transferred between these processors, which is the dominant communication cost.

Dent and Mozdzynski 13] also employ a global re-partitioning algorithm for an irregular distribution of radiation grid columns in ECMWF's IFS 1]. As in our algorithm, they send grid columns from the most over-loaded processor to the most under-loaded processor until all processors are load balanced. They claim that this minimises the number of messages, however (to our knowledge), they do not try to match pairs, which can give a further reduction in the number of messages required.

If the (static or dynamic) computation costs of the grid column can be modelled accurately by a set of weighting factors, typically factoring whether certain computation within the physics routine takes place at grid columns, then the grid columns can be re-partitioned to give load balance. In 3] we used the top level of a set of gather masks to predict the computational cost of each grid column and improve load balance. Isaksen and Hamrud 10] have also suggested, and Saarinen, Vasiljevic and J arvinen 19] have used, weighting factors for a static imbalance (in data assimilation). Foster and Toonen 12] (as mentioned earlier) pre-compute the cost of lit and non-lit grid columns. However, there may be many weighting factors, their accurate calculation may be di cult and their values will vary across architectures and as the physics are improved. Again, a dynamic load balance algorithm could be used in conjunction with this technique.

In 3] and 4] we explored the idea of using the execution time from the previous time-step to balance the load on the current time-step, presenting algorithms for shared address space parallel architectures; this paper presents a distributed memory variant of this algorithm. This technique is based on the observation that the load distribution is slowly varying over time-steps, something that one might expect as slow changes are usually required for model stability. It e ectively turns an unpredictable load imbalance problem into a predictable load imbalance problem. This method could also be used for the rst few time-steps to load balance a static load imbalance problem. Elbern 17] uses execution time from the previous time-step for a dynamically varying workload, a two dimensional partition and a local shu ing of work between adjacent processors. Our algorithm moves grid columns from over-loaded processors to under-loaded processors irrespective of their location. We also treat the grid columns as a single vector which allows us to load balance to the resolution of a single grid column rather than a line of grid columns.

If the execution time from the previous time-step is not a good measure of the current load, or the cost of the algorithm is prohibitive, then a fully dynamic algorithm such as Dynamic Chunk or Guided Self Scheduling (GSS) 7] can be used. We present results of a dynamic chunk implementation. This was chosen as it gives a simple control over the chunk size.

Predictable dynamic imbalance: Short Wave Radiation

Short wave radiation in the UM is a good example of predictable load imbalance. For the load to be predictable, we must make some assumptions as to its characteristics. Examining the code, we observe that it is only grid columns which are lit that perform computation. In the UM these grid columns are gathered into smaller arrays. Non-lit grid columns can therefore be safely assumed to have zero cost. The next observation is that each lit grid column performs approximately the same amount of computation.

Our assumption therefore, is that each lit grid column takes the same amount of time to compute. Earlier work has proven the accuracy of this assumption 3]. Given these assumptions, an equal distribution of lit grid columns across processors should result in an equal amount of time per processor and therefore load balance.

In a global model, half of the grid grid columns (G) will be lit and half will be in darkness. If there are P processors, a standard domain decomposition would give each processor G=P grid grid columns. As the number of lit grid columns is G=2 then a load balanced solution would allocate G=2P grid grid columns to each processor. For a standard domain decomposition, where the grid grid columns are partitioned in latitude and longitude, for any reasonable number of processors, it is easy to see that some processors will have all of their grid columns lit. This suggests that we should observe up to 50% load imbalance in short wave radiation in a global model.

In a standard domain decomposition there is, therefore, a non-uniform distribution of lit grid columns to processors. These must be re-distributed equally amongst processors. The minimum amount of communication required to achieve load balance is relatively simple to calculate. If a processor has more than the average number of grid columns, it computes the average number of grid columns and sends the excess to processor(s) with less than the average. If a processor has less than the average number of grid columns, it computes all of its grid columns and receives the di erence between its number of grid columns and the average. If the amount of communication required per grid column is C, G i is the number of grid columns on processor i and G av is the average number of grid columns per processor G av = P p i=1 G i =p (which should be G=2 in the global case) the minimum amount of communication is C P i2 1:

Achieving this re-distribution with the minimum number of messages (and therefore maximum message size) is much more di cult. We attempt to minimise the number of messages by sending grid columns from the most over-loaded processor to the most under-loaded processor until all processors are load balanced.

Our algorithm orders processors in terms of load 2 , then sends grid columns from the processor at the head of the list to the processor at the tail of the list. The number of grid columns transferred is the amount required to balance the least imbalanced of the two processors. At least one processor is always guaranteed to be load balanced after this step. Both will be balanced if they are pairs i.e: they are equally overand under-loaded. If they are not pairs, the reduced imbalance on the remaining processor is checked with the list to see if it has a pair. If so, grid columns are transferred between them and they are removed from the list. If not, the processor is re-inserted in the list. This process continues until all processors are balanced, see 18] for a more detailed analysis of this algorithm.

The UM gathers the lit grid columns on each processor into contiguous dynamically allocated arrays. We increase the size of these dynamic arrays appropriately so that there is space to store any grid columns sent from remote processors. This technique allows remote grid columns to be computed in the same chunk as local grid columns, thus maximising chunk sizes.

Unpredictable dynamic imbalance

Convection in the UM is a good example of unpredictable dynamic imbalance. The complexity of the control paths in the code make it di cult to obtain an accurate estimation of the computational cost of a grid column (although in 3] we show that some performance improvement is possible using this approach). In this section, we present two possible approaches to unpredictable dynamic imbalance, namely, dynamic chunk and dynamic feedback.

Dynamic Chunk

Dynamic chunk is one of a class of dynamic algorithms that splits the local grid columns into a number of chunks. Chunks are then computed locally until all are completed. When complete, the processor looks for chunks from remote processors which have yet to be computed and steals them. Dynamic chunk uses a xed chunk size which allows easy control. Such algorithms do not require any application speci c information and can be implemented e ciently on a Cray T3E. Note, this algorithm uses the T3E's global address space, one sided messaging and atomic operations to implement a shared memory algorithm, allowing chunk stealing from a processor while it is computing; this is not possible to implement e ciently in message passing without a separate server thread. There is also a clear trade-o between the load balance, which improves as the number of chunks increase (as it is not possible to load balance to a ner resolution than a chunk) and the relative overheads associated with a chunk, which increase as the chunk size decreases.

Dynamic Feedback

Our dynamic feedback algorithm uses the execution time from the previous time-step to estimate the load per grid column for the current time-step. The rst time-step simply uses the time taken by a processor and the number of grid columns it computes to estimate the load of each grid column. The work is then re-distributed based on this estimation to load balance the code to give each processor an average amount of time. The same algorithm discussed in Section 3 is used to minimise communication and hence the number of messages. However, this version is a little more complicated as we are balancing time, which can be split to any accuracy, and transferring grid columns, which can obviously only be transferred in integer numbers. This means that the pairs option discussed in Section 3 cannot be used and that under-loaded processors cannot be removed from the list, see 18] for more details. Subsequent time-steps use the time from locally and remotely computed chunks to estimate the load.

Each local chunk and each remote chunk is computed and timed separately. This allows accurate timing of all chunks, minimises extra memory use and maximises the size of local chunks. However, although the algorithm also attempts to keep remote chunks as large as possible, these can be arbitrarily small (down to a single grid column). In the UM's convection routine, the cost of a grid column in a small chunk is signi cantly greater than the cost of the same grid column in a large chunk, see Section 5.2. Figure 1 shows the results of running the UM with its current static partition (labelled \asis") and with the new predictable dynamic load balance algorithm (labelled \lb"), described in Section 3. The example is a forecast model running on 144 processors (the number of processors used operationally). For both methods the maximum and average time is given. The maximum time shows how long the routine takes, whereas the average time shows how long the routine would take if it were perfectly balanced (the ideal). Our method load balances short wave radiation well, reducing the time to solution by nearly 40%. The di erence between the average load balanced time and average original time gives a measure of the overhead of the algorithm. The di erence between the maximum load balanced time and the average load balanced time gives a measure of the inaccuracy in the assumptions mentioned in Section 3. Figure 2 shows the results of running the UM with its current static partition (labelled \asis") and and the new dynamic chunk algorithm (labelled \lb"), described in Section 4.1, for three and ve chunks. The example is the same forecast model run used in the previous section, again running on 144 processors (the number of processors used operationally). For both methods the maximum and average time is given. The maximum time shows how long the routine takes, whereas the average time shows how long the routine would take if it were perfectly balanced (the ideal). Our method improves the load balance of convection, reducing the time to solution by over 40%. The di erence between the average load balanced times and average original time is negligible, however there is some load imbalance remaining. Five chunks per processor gives a better load balance solution than three chunks per processor. The dynamic feedback algorithm (described in Section 4.2) and the dynamic chunk algorithm (described in Section 4.1) were tested with a climate model. The climate model was chosen as it presents a greater challenge to the load balancing algorithms due to the small size of the data set. The results of running the UM with the current static partition (labelled \asis") and with both the new predictable dynamic chunk algorithm (labelled \lb chunk") and the dynamic feedback algorithm (labelled \lb feedback") are given in Figure 3. The example is running on 36 processors (the number of processors used operationally). The original code computes convection in 3 chunks on each processor; if only 1 chunk is used we see a corresponding increase in performance. Maximum and average (the ideal) times are given for these two runs. This di erence shows the increased overhead of using smaller chunks. The dynamic chunk algorithm provides further improvement and the feedback algorithm gives the best results, reducing the time to solution by approximately 33%. We have mentioned the overhead associated with decreasing the size of chunks. In Figure 4 we quantify this overhead. In this example, a climate model was run sequentially for two time-steps and for a number of di erent chunk sizes. The results shown are for time-step 2, however the time-step 1 results are nearly identical. The results presented are for the total time taken by convection and therefore show the relative cost of a grid column in di erent sized chunks. Small chunks are seen to have a signi cantly greater cost than larger chunks. A grid column allocated its own chunk will cost more than seven times that of a grid column in a large (>100 grid columns) chunk. Chunks larger than those presented gave no further performance improvement. This paper has presented an e ective algorithm for load balancing predictable dynamic load imbalance, which minimises communication between processors and attempts to minimise the number of messages. This results in nearly 40% performance increase for short wave radiation in the UKMO's operational forecast model. This routine will be integrated into the UM in the next release. It has also presented two e ective algorithms for load balancing un-predictable dynamic load imbalance. These improve the performance of convection in the UKMO's operational forecast and climate models by between 30% and 40%. These algorithms are also expected to be applicable to other physics routines in the UM. The dynamic chunk algorithm is already being used operationally for convection.

Figure 1

Predictable dynamic imbalance algorithm with 144 processors 5.2 Convection

Figure 2

Dynamic Chunk algorithm with 144 processors

Figure 3

Dynamic Chunk and Feedback algorithms with 36 processors Total convection time with varying chunk size 6 Conclusions and Future Work

Figure 4

Results

Short Wave Radiation

For future work we intend to examine the overheads of the dynamic feedback algorithm in more detail, introducing modi cations to increase the chunk size by merging remote chunks, if appropriate, and possibly attempting to reduce the dynamic memory allocation overhead which increases the costs of small chunk sizes. We also intend to implement the most appropriate algorithms in other physics routines and examine the limits of scalability of the model.