Academia.eduAcademia.edu

Parallel I/O performance: From events to ensembles

Parallel I/O is fast becoming a bottleneck to the research agendas of many users of extreme scale parallel computers. The principle cause of this is the concurrency explosion of high-end computation, coupled with the complexity of providing parallel file systems that perform reliably at such scales. More than just being a bottleneck, parallel I/O performance at scale is notoriously variable, being influenced by numerous factors inside and outside the application, thus making it extremely difficult to isolate cause and effect for performance events. In this paper, we propose a statistical approach to understanding I/O performance that moves from the analysis of performance events to the exploration of performance ensembles. Using this methodology, we examine two I/O-intensive scientific computations from cosmology and climate science, and demonstrate that our approach can identify application and middleware performance deficiencies-resulting in more than 4× run time improvement for both examined applications.

Parallel I/O Performance: From Events to Ensembles Andrew Uselton† , Mark Howison† , Nicholas J. Wright† , David Skinner† , Noel Keen† , John Shalf† , Karen L. Karavanic⋆ , Leonid Oliker† † CRD/NERSC, Lawrence Berkeley National Laboratory Berkeley, CA 94720 ⋆ Portland State University, Portland, OR 97207-0751 Abstract—Parallel I/O is fast becoming a bottleneck to the research agendas of many users of extreme scale parallel computers. The principle cause of this is the concurrency explosion of high-end computation, coupled with the complexity of providing parallel file systems that perform reliably at such scales. More than just being a bottleneck, parallel I/O performance at scale is notoriously variable, being influenced by numerous factors inside and outside the application, thus making it extremely difficult to isolate cause and effect for performance events. In this paper, we propose a statistical approach to understanding I/O performance that moves from the analysis of performance events to the exploration of performance ensembles. Using this methodology, we examine two I/O-intensive scientific computations from cosmology and climate science, and demonstrate that our approach can identify application and middleware performance deficiencies — resulting in more than 4× run time improvement for both examined applications. I. I NTRODUCTION The era of petascale computing is one of unprecedented concurrency. This daunting level of parallelism poses enormous challenges for I/O systems because they must support efficient and scalable data movement between a relatively small number of disks and a large number of distributed memories on compute nodes. The root cause of an application’s poor I/O performance may be found in the code itself, in a middleware library it relies upon, in the file system, or even in the configuration of the underlying machine running the application. Worse, there may be unexpected interplay between these possibilities resulting in significant performance deterioration [13]. The performance of individual I/O events can vary by several orders of magnitude from run to run, making bottleneck isolation and optimization extremely challenging. It is therefore critical for the high performance computing (HPC) community to develop performance monitoring tools and methodologies that can help disambiguate the sources of I/O bottlenecks for supercomputing applications. In this paper, we propose a statistical approach to understanding I/O behavior that transitions from the typical analysis of performance events to the exploration of performance ensembles. A key insight is that although the I/O rate an individual task observes may vary significantly from run to run, the statistical moments and modes of the performance distribution are reproducible. To efficiently collect parallel I/O statistics in a scalable fashion, we have extended an existing performance tool called IPM (Integrated Performance Monitoring) [19] to add I/O operation tracing (IPMI/O). IPM is a scalable, portable, and lightweight framework for collecting, profiling, and aggregating HPC performance information. Using this tool, we evaluate the I/O behavior on large-scale Cray XT supercomputers of the InterleavedOr-Random (IOR) micro-benchmark as well as two I/O-intensive numerical simulations from cosmology and climate modeling. For the MADbench application, which studies the cosmic microwave background, our approach helps isolate a subtle file system middleware problem, that results in performance improvement of 4×. Additionally, our exploration into the 10,240-way I/O behavior of the global cloud system resolving model (GCRM) resulted in several successful optimizations of the application and its interaction with the underlying I/O-library, causing a net performance increase of over 4×. Overall our work successfully demonstrates that the statistical analysis of ensembles can be used effectively to isolate complex sources of I/O bottlenecks on high-end computational systems. A. Related Work There are several performance tools that measure I/O performance of scientific applications, including KOJAK [10], TAU [18], CrayPat [9], Vampir [16] and Jumpshot [8]. All are general purpose tools that include some I/O measurement and analysis capability. From the variety of tools available there is a wide range in the scope of information collected, performance overhead, and impact on the application being studied. We chose to use IPM-I/O for this study because of its focus upon recording a limited set of metrics in a lightweight manner. A number of published studies investigate I/O performance on high end systems, as we do in this paper. One investigation [12] characterizes a large- scale Lustre installation relating poor performance to default striping parameters. Another study of highend file system performance [20] evaluated the I/O requirements and performance of several applications over 18 months. The often unexpected performance shifts as applications and systems changed over time are a strong argument for the use of scalable, applicationcentric I/O performance tools and methodologies, as presented in this paper. Interpreting performance information using statistical techniques has been the subject of several previous works. For example, Ahn and Vetter used a variety of multivariate statistical techniques to analyse performance counter data [5]. This approach is similar in spirit to ours, in that it attempts to combine large amounts of performance information into a more compact representation; however, it does not focus on the specific challenges of understanding large-scale I/O behavior characteristics. II. P LATFORMS AND T RACING T OOLS In this section we briefly define the features of our experimental platforms and the IPM-I/O trace tool. A. Architectural Platforms Most of the experiments conducted for this study used Franklin, the 9660 node Cray XT4 supercomputer located at Lawrence Berkeley National Laboratory (LBNL). Each XT4 node contains a quad-core 2.1 GHz AMD Opteron processor, which is tightly integrated to the XT4 interconnect via a Cray SeaStar-2 ASIC through a 6.4 GB/s bidirectional HyperTransport interface. All the SeaStar routing chips are interconnected in a 3D torus topology, where each node has a direct link to its six nearest neighbors. Franklin employs the Lustre parallel file system as its temporary file systems scratch and scratch2 each with 24 Object Storage Servers (OSSs) and with 2 Object Storage Targets (OSTs) on each OSS. Additionally, several experiments were conducted on Jaguar, the Cray XT4/XT5 system located at Oak Ridge National Laboratory (ORNL), which also uses Lustre. The combined Jaguar system has 7832 nodes in the XT4 portion and another 37,544 nodes in the XT5 portion. The results reported in Section IV use the XT4 portion with 72 OSSs hosting 2 OSTs each for a total of 144 OSTs. B. IPM-I/O Tracing IPM has previously [19], [21] been used to understand computation, communication, and scaling behavior of parallel codes. These studies focused on performance limitations related to the compute node resources (caches, memory, CPU), messaging and switch contention, and algorithmic limitations. I/O brings with it a new set of shared resources in which contention and performance variability may occur. Metadata locking, RAID subsystems, file striping, and other factors compound the complexity of understanding measured wall clock times for I/O. While contention for node and switch hardware resources do affect, for example, MPI performance, the impact of contention upon parallel I/O at scale is much more prevalent and significant, mostly because there are (generally) relatively few I/O resources compared to computational ones. The work reported here employs newly developed I/O functionality for IPM to generate trace data in a lightweight, portable, and scalable manner. IPM-I/O works by intercepting an application’s POSIX I/O calls into the libc library. To use IPM-I/O an application is linked against the IPM-I/O library using the -wrap functionality of the GNU linker, which provides a mechanism for intercepting any library call. In this case it redirects POSIX I/O calls to IPM-I/O. During the application run IPM-I/O collects timestamped trace entries containing the libc call, its arguments, and its duration. A look-up table of open file descriptors allows IPM-I/O to associate events interacting with the same file. By default IPM-I/O emits the entire trace. This approach has proved to be scalable for I/O tracing applications running with up to 10K MPI tasks without any significant slowdown being observed. III. P ERFORMANCE E NSEMBLES Most parallel I/O in High Performance Computing (HPC) is distinct from the random access seen in transaction processing and database environments [11]. From profiling workloads at supercomputing centers [17] we have observed that HPC I/O in this environment frequently involves large-scale data movement, such as check-pointing the state of the running application. Furthermore, reviewing user requirements documents confirms this observation, as do complaints from HPC users. In a production supercomputing environment, it is common that observed details of I/O performance change from one application run to the next. Factors affecting performance include the load from other jobs on the HPC system, task layout, and multiple potential levels of contention, among numerous others. Our goal is to determine robust ways of examining I/O performance that are stable under the changing conditions from one run to the next. To this end we examine the distribution of individual I/O rates observed during a test, and study the statistical properties of the distributions. To present an example of this approach, we first examine results using the Interleaved-Or-Random (IOR) 70000 1000 60000 scratch scratch2 R/4 800 600 40000 count MB/s 50000 30000 R/2 400 20000 R 200 10000 0 0 0 50 100 150 200 seconds (a) I/O trace diagram (b) Aggregate I/O rate 250 0 10 20 30 40 50 sec (c) I/O time distribution Figure 1: IOR 512 MB transfers using 1024 processors. a) The trace diagram: the y-axis represents the tasks (1 - 1024) and the x-axis represents time in seconds; blue indicates time spent in write() and white space indicates all other time. This diagram shows 5 phases of I/O. b) The aggregate data rate over all tasks (y-axis) is plotted versus wall clock time (x-axis). c) The two distributions each count (y-axis) events for a given amount of time (x-axis), on different Franklin parallel file systems, scratch and scratch2 512M B (R = 16M B/s ). [14] code. IOR is a parametrized benchmark that performs I/O operations for a defined file size, transaction size, concurrency, I/O-interface, etc. For our experiments, shown in Figure 1, IOR has been configured to run with 1024 tasks on 256 nodes of Franklin. Each task writes 512 MB to a unique offset within a shared file, and does so in a single write() call, followed by a barrier. This is then repeated five times. The IOR binary has been augmented with the IPM-I/O library to capture I/O events. In this context we refer to a particular choice of test parameters as an experiment and a specific instance of running that experiment simply as a run. Figure 1(a) depicts the I/O traces for five runs of the experiment, all in a single job. Each task’s time history is represented with a separate horizontal line, with task 0 at the top and task 1023 at the bottom. The x-axis is wall-clock time and each trace proceeds from left to right showing the I/O pattern from the beginning to the end of the test. Each bar corresponds to the write (blue) of 512 MB, and its length gives the duration of the I/O. White space represents non-I/O activity, which is a barrier wait in these experiments. Since all of the write calls transfer the same amount, short bars represent fast I/O and longer bars slower I/O performance. The trace in Figure 1(a) shows two phenomena common to HPC I/O at scale. The first is that the synchronous nature of many applications leads to vertically banded intervals during which parallel I/O occurs, i.e., the I/O happens in synchronous phases. As a consequence the task that arrives last at the barrier will define the performance of the application for that phase. Thus a small number of events, or even a single event, can define the performance of an application. The second is that there is great variability in the performance of individual (theoretically identical) I/O events. This variability appears to be random in the sense that a given individual MPI task is not consistently slow or fast. Figure 1(b) shows the instantaneous data rate, across the 1024 tasks taken together, over the life of the job. There does seem to be a consistent initial high plateau around 60 GB/s followed by another brief plateau around 10 GB/s and a final long tail that is slower still. In this situation a statistical representation of the I/O events provides a clearer picture of the overall performance, as shown in Figure 1(c). This is a histogram of the distribution of completion times for individual I/O events shown in Figure 1(a), which is from the scratch file system on Franklin. Figure 1(c) also shows the distribution from scratch2. We note that the statistical representations are almost identical, but the second run of the same experiment on the different file system produces a trace very different it its specific details, albeit with a similar overall run time. Observe that each histogram has three prominent peaks corresponding to three distinct modes of behavior. The aggregate available data rate from either of the two file systems is limited to about 18 GB/s by the network infrastructure, and observed peak performance is somewhat short of that. Suppose each of the 1024 tasks got a fair share of this aggregate data rate. For a task to move 512 MB in 30 to 32 seconds, as the peak labeled “R” indicates, means that task saw about 16 or 17 MB/s — close to the fair share of the peak available data rate. Note that the other two strong peaks are at the 0.08 0.06 0.04 0.02 0 0.1 probability density f(t) 0.1 probability density f(t) probability density f(t) 0.1 0.08 0.06 0.04 0.02 0 0 10 20 30 40 50 0.06 0.04 0.02 0 0 10 t (seconds) (a) two 256 MB transfers 0.08 20 30 40 t (seconds) (b) four 128 MB buffers 50 0 10 20 30 40 50 t (seconds) (c) eight 64 MB transfers Figure 2: IOR 512 MB transfer using 1024 processors where: a) 512 MB written via two 256MB write() calls. b) Four calls (128MB). c) Eight calls (64MB). Note that the distributions become progressively narrower and more Gaussian. second and fourth harmonic for this rate, which implies that one task on the node (or two) took all the available I/O resources until it was done, with the other tasks waiting until it was complete. This implies a particular order to the processing in the Lustre parallel file system. Note further that these three peaks do not correspond to the three plateaus from Figure 1(b). Those modes reflect filling local system buffers and then having the off-node communication throttle back the data rate. Overall, the modes in Figure 1(c) give a much more precise characterization of the I/O behavior, thus increasing the potential for appropriate diagnosis and remediation (where appropriate). We conclude that while the performance characteristics of individual I/O events can behave erratically, the modes by which they occur are stable. It is this insight that will allow us to see past the seemingly random individual I/O performance measurements to address potential bottlenecks. This transition from mechanistic analysis of isolated systems of events to the analysis of ensembles resembles the successful strategy of statistical physics whereby large numbers of interacting systems can be described by the properties of their ensemble distributions such as moments, splittings and line-widths. A. Statistical Analysis In the upcoming discussions on the statistics of observed I/O times we allude to two commonplace observations about statistical ensembles. The first is Order Statistics — in particular, the N th order statistic for a sequence of N observations is the largest value in the ensemble, and its distribution fN (t) is given by: fN (t) = N F (t)N −1 f (t) (1) where f (t) is the probability density function for the I/O time for one observation, and F (t) is the corresponding cumulative probability distribution. fN (t) gives the distribution for the longest observation given the underlying distribution. As N increases the expression F (t)N −1 quickly converges to a step function picking out a point in the righthand tail of the distribution f (t). The distributions in Figure 1(c), when normalized, give an approximation to the probability density function f (t) from Equation 1, and the cumulative probability distribution F (t) is the integral of f (t) (see Figure 5(a)). The second observation concerns an application of the Law of Large Numbers. Let Ti |i ∈ {1 . . . k} be the time to completion for a sequence of k I/O operations governed by independent identical distributions with average µ, and let the completion time tk for the sequence be P the sum of the individual observed I/O k times tk = i=1 Ti . The expected value of tk will converge to kµ as k increases. In other words, the more samples one takes from a distribution, the closer the sample average will be to the average of the underlying distribution. Figure 2 shows three probability density functions for a sequence of experiments comparable to that illustrated in Figure 1. These are the distributions of tk values measured over all the MPI tasks for three IOR experiments in which the 512 MB is sent to the file system in k = 2, 4, and 8 successive write() calls (using 256, 128, 64MB respectively) — with no barrier until all 512 MB has been written. The run time for an experiment, and therefore the reported data rate, is determined by the slowest I/O operation amongst all the tasks. In each case the slowest is the N th order statistic mentioned in Equation 1 for the corresponding tk . As the value of k increases the slowest running task becomes a little faster. In the case of a single 512 MB write, the run time is approximately 45 seconds (see Figure 1(c)) and the reported data rates for the 512 MB (a) MADCAP (b) GCRM Figure 3: (a) Visualization of high resolution cosmic microwave background sky map, used by MADCAP to compute the angular power spectrum [6] (b) A pseudocolor plot of a wind velocity variable from a GCRM data set displayed using the VisIt visualization tool. experiments is around 11,610 MB/s. The reported data rate for the 256 MB experiments is 12,016 MB/s, or about 3% faster. More and smaller transfers continue the trend with 128 MB experiments getting 13,446 MB/s and 64 MB experiments achieving 13,486 MB/s — a 16% speedup. Since the underlying I/O activity in each of these experiments is the same, it is reasonable to think that dividing the I/O up into multiple write() calls would have little on no effect on the overall performance. In fact one might even expect a small penalty for the extra system call processing. However, this is not the case. The worse case behavior improves as k increases because the distributions are getting narrower. That in turn is a consequence of the Law of Large Numbers. In other words, the more opportunities a task has to sample, the more likely it is to have average performance. We now explore two scientific computations, MADbench and GCRM, and show how our statistical methodology can be used to identify bottlenecks and increase I/O performance. and micro-K level on a 3K background; and second it is necessary to measure their power on all angular scales, from the all-sky to the arc minute. As illustrated in Figure 3(a), obtaining sufficient signal-to-noise at high enough resolution requires the gathering — and then the analysis — of extremely large data sets. Therefore CMB analysis is often extremely I/O intensive. IV. MAD BENCH I/O A NALYSIS We apply the forgoing insights to the analysis of two important HPC applications, starting with the Microwave Anisotropy Data-set Computational Analysis Package (MADCAP). MADbench is the second generation of a HPC benchmarking tool [7], [15] that is derived from MADCAP which is an application that focuses on the analysis of massive cosmic microwave background (CMB) data sets. The CMB is the earliest possible image of the Universe, as it was only 400,000 years after the Big Bang. Extremely tiny variations in the CMB temperature and polarization encode a wealth of information about the nature of the universe, and a major effort continues to determine precisely their statistical properties. The challenge here is twofold: first the anisotropies are extraordinarily faint, at the milli- Each transfer of a matrix to or from the file system consists of a single large write or read, which is about 300 MB per MPI task in the experiments reported here. The write or read is performed using an MPI-IO call (MPI File write and MPI File read). Each task manages and computes values for its portion of a sequence of such matrices — eight of them in these experiments — and performs I/O to an exclusive region within a shared file. All matrices for a task are sequentially ordered in a contiguous file region (modulo an alignment parameter, which is 1 MB in these experiments). Overall the I/O pattern from each MPI task looks like this: 8× (write 300MB), 8× (seek, read 300MB, seek, write 300MB), 8× (read 300MB). This I/O pattern is atypical in that it is sensitive to both read and write I/O rates, most of the available memory MADbench is a lightweight benchmark derived from MADCAP that abstracts the I/O, communication, and computation characteristics to facilitate straightforward performance tuning. It is an out-of-core solver that has three phases of computation. During the first phase it generates a series of matrices and writes them to disk one by one. In the middle phase, MADbench reads each matrix back in, multiplies it by an inverse correlation matrix and writes the result back out. Finally, MADbench reads the result matrices and calculates a trace of their product. In the experiments reported here all computation and communication has been effectively turned off, so we can focus exclusively on the I/O component. 1000 35000 read write read write 30000 100 20000 count MB/s 25000 15000 read4 10000 read5 read6 read7 10 read8 5000 1 0 0 500 1000 1500 2000 2500 .5 1 2 5 seconds (a) Franklin trace 10 20 50 100 200 500 seconds (b) Franklin aggregate I/O rate (c) Franklin histogram 1000 35000 read write read write 30000 100 20000 count MB/s 25000 15000 10 10000 5000 1 0 0 500 1000 1500 2000 2500 seconds (d) Jaguar trace (e) Jaguar aggregate I/O rate .5 1 2 5 10 20 50 100 200 500 seconds (f) Jaguar histogram Figure 4: MADbench 256-task experiment on Franklin (2200 seconds) and Jaguar (275 seconds), showing the trace data, aggregate I/O rate, and I/O histogram for each platform. Franklin’s slow reads are seen in the broad right shoulder of the read rate distribution in (c). is already in use, circumventing file system caching efficiencies, and the pattern of seek, read, seek, write in the middle phase of the computation is not a streaming I/O pattern. Previous work [7] shows that MADbench exhibits significantly different performance characteristics under various choices of operating mode and hardware platforms. A. Trace-Based Analysis Figure 4 depicts the I/O traces for two MADbench single-file experiments at 256 tasks∗ on Franklin and Jaguar XT4 systems. The I/O traces in Figures 4(a) and 4(d) were generated via IPM-I/O as described in Sections II-B and III. In these figures each bar corresponds to the write (blue) or read (red) of a 300 MB matrix, and its length gives the duration of the I/O† . White space represents a barrier wait. Since all of the matrices are the same size, short bars represent fast I/O and longer bars slower I/O, and the overall per∗ Traces at other concurrencies show qualitatively similar behavior. attentive reader will note that the middle phase actually begins with two reads and ends with two writes. See [7] for details. † The formance is again dominated by the slowest individual performers. The I/O hardware and software infrastructure is different enough between the two systems that a significant difference both in I/O pattern and aggregate time to run the application is apparent. Jaguar, (Figure 4(d)), shows only modest variability in I/O rate from one task to the next, whereas Franklin (Figure 4(a)) shows a much larger variation in I/O performance from one task to the next. Also the reads in the sequence (seek-readseek-write) in the middle third of the computation on Franklin are sometimes slow, whereas those at the end, where the sequence is simply eight reads one after the other, show little variability. (We note here that this is not simply a quirk of a single run, it occurs at multiple concurrencies and is reproducible.) B. Data-Rate Analysis The slow reading tasks in Franklin’s I/O trace, Figure 4(a), that cause the long delays stand out sharply, and those diagrams give an intuitive view of the behavior of the application. The difference between the performance on the two machines is also illustrated by comparing Figures 4(b) and 4(e) which show the 1000 Before After 0.8 100 0.6 count Fraction of I/O ops complete 1 0.4 read 4 read 5 read 6 read 7 read 8 0.2 0 0 100 200 300 400 10 1 500 .5 1 2 seconds (a) Read rate deterioration 5 10 20 50 100 200 500 seconds (b) Read performance before and after middleware update (c) Franklin trace after update Figure 5: MADbench 256-task Franklin experiment before and after middleware update a) The fraction of I/Os completed versus time deteriorates from read 4 to read 8, leading directly to the discovery of a subtle system software (Lustre) bug. b) Read histogram before and after bug correction c) Trace file after update showing removal of catastrophic delays. instantaneous read and write rates on the two machines. On Franklin the overall duration of each read increases from the fourth read (read4) to the eighth read (read8) and each of these reads has a long tail that continues until the next write phase. Figures 4(c) and 4(f) show the histograms for Franklin and Jaguar respectively. In this case the histograms are presented as log-log plots so that the different modes, especially the slowest modes, stand out. The histograms for write() calls on Franklin and Jaguar in Figure 4 are similar and both show four strong peaks on the left with less prominent features trailing to right. Note that the two write (blue) distributions in Figures 4(c) and 4(f) display similar performance characteristics, while the read (red) distributions show a markedly different pattern from each other. For the Franklin experiment the slowest read() calls vary from 30 to 500 seconds. It is these expensive reads that stand out as anomalies in Figures 4(a) and 4(b). The reads in Figure 4(c) centered around the peak at 15 seconds do not show the usual rounded-peak shape expected for a mode with some variability. Instead, this is either several poorly resolved peaks next to each other, or some broad and flat mode unlike the rest. C. Performance Resolution The slow reads on Franklin in Figures 4(a) and 4(c) all occur in the fourth through eighth reads, as shown in Figure 4(b). In Figure 5(a) those reads are presented separately. Figure 5(a) presents the cumulative probability distribution Fp |p ∈ {4, 5, 6, 7, 8} for the reads in these phases. That is, each curve in Figure 5(a) gives the progress of I/O during the phase versus time. Not only are the slow reads confined to reads 4 through 8, but they get progressively worse. These two insights lead directly to determining the source of the bottleneck. The MADbench I/O pattern aligns each I/O operation to a 1 MB boundary, and that produces a small gap between the end of each I/O region and the next. This strided pattern is one that the Lustre parallel file system recognizes and takes into account. In particular, the strided I/O pattern is recognized by Lustre on its third appearance. Subsequent reads that match the stride (the fourth and after) get a larger read-ahead window. In the phase where reads alternate with writes the client-side system buffers were all full, and Lustre issues one page (4 kB) reads due to a lack of system memory resources. This large number of small reads lead to the expensive delays. The later reads did not suffer this effect because system memory was not being filled with interleaved writes. Due to our investigation, a patch was created for the Lustre file system that avoids the erroneous windowsize calculation and that patch was installed onto the Franklin system. The patch removed strided read-ahead detection entirely and the associated expensive delays. This improved the overall performance by more than 4.2×. In Figure 5(b) the distribution for reads after the Lustre patch is applied is superimposed on the read distribution from Figure 4(c). It is clear that the problem has been resolved, as also seen in the trace file of Figure 5(c), where the job run time has been reduced from 2200 seconds to 520, and the trace is comparable to that obtained from Jaguar. V. GCRM I/O A NALYSIS The Global Cloud Resolving Model (GCRM, Figure 3(b)), is a climate simulation developed by a MB/sec 10000 1000 100000 100 10 1 0.1 0.01 .001 data metadata 8000 10000 MB/s 6000 count 1000 4000 100 10 2000 1 0 0 50 100 150 200 250 300 .001 .01 .1 seconds (a) 10,240 task trace 1 10 100 1000 0.1 0.01 .001 sec/MB (b) Aggregate write rate (c) Histogram MB/sec 10000 1000 100000 100 10 1 data metadata 8000 10000 MB/s 6000 count 1000 4000 100 10 2000 1 0 0 50 100 150 200 250 300 .001 .01 .1 seconds (d) 80 task trace 1 10 100 1000 0.1 0.01 .001 sec/MB (e) Aggregate write rate (f) Histogram MB/sec 10000 1000 100000 100 10 1 data metadata 8000 10000 MB/s 6000 count 1000 4000 100 10 2000 1 0 0 50 100 150 200 250 300 .001 .01 .1 seconds (g) Aligned offsets trace 1 10 100 1000 0.1 0.01 .001 sec/MB (h) Aggregate write rate (i) Histogram MB/sec 10000 1000 100000 100 10 1 data metadata 8000 10000 MB/s 6000 count 1000 4000 100 10 2000 1 0 0 50 100 150 200 seconds (j) Aggregated metadata trace (k) Aggregate write rate 250 300 .001 .01 .1 1 10 100 1000 sec/MB (l) Histogram Figure 6: GCRM using 10,240 tasks writing to a shared file, showing trace graph, aggregate I/O write rate, and histogram distribution for baseline configuration and three progressive optimizations: (a–c) Baseline configuration. (d–f) Data written by only 80 tasks. (f–h) Writes padded and aligned to 1MB boundaries. (j–l) Metadata writes are aggregated into a few large writes. team of scientists at Colorado State University led by David Randall [2]. It runs at resolutions fine enough to accurately simulate cloud formation and dynamics and, in particular, resolves cirrus clouds, which strongly affect weather patterns, at finer than 4km resolutions. Underlying the GCRM simulation is a geodesic-grid data structure containing nearly 10 billion total grid cells, an unprecedented scale that challenges existing I/O strategies. Researchers at Pacific Northwest National Lab [1] and LBNL [3] have developed a data model, I/O library, and visualization pipeline for these geodesic grids, as well as a GCRM I/O kernel for tuning I/O performance. Initially, the I/O library was able to achieve only around 1 GB/s, a fraction of the available write rate on Franklin. In order for I/O to consume less than 5% of the total GCRM simulation run time at 4 km resolution, the GCRM I/O library must sustain at least 2GB/s, and preferably more to facilitate scaling to finer resolutions. Therefore we employed the diagnostic tools and methods discussed in earlier sections to investigate and improve performance. Based on this analysis, we worked with the Hierarchical Data Format (HDF) Group to optimize the GCRM I/O kernel. Our baseline configuration uses 10,240 MPI tasks, each writing the same amount of data, representing different GCRM variable types. This lead to an I/O pattern with three writes of a single 1.6 MB record, each followed by a barrier, then three writes of six 1.6 MB records, followed by another barrier. All the data was written to a single shared file using H5Part [4], a simple data scheme and veneer API built on top of the HDF5 library. Figures 6(a)–6(c) present the trace graph, write rates and histogram for the baseline. Figure 6(a) shows the limited value of a trace graph at this scale; resolving in detail each of the 10,240 stacked horizontal lines is extremely difficult. In particular, it is not apparent that most of the graph is actually white space. HDF5 metadata operations account for the read activity seen in red in the trace graph. Figure 6(b) shows that the baseline achieved only a fraction of Franklin’s available 16 GB/s aggregate write rate. The peak rate is barely half that, and most of the run time was spent at rates of less than 2 GB/s. Once again, the I/O pattern is governed by the worst-case behavior. The statistical view is essential to understanding the behavior of the code. In the histogram (Figure 6(c)), we plot separate distributions for two buffer sizes, one corresponding to GCRM records (1.6 MB, blue) and the other to HDF5 metadata (<3 KB, red). Unlike the experiments in the previous sections, there are multiple transfer sizes plotted in the histograms, so we normalize the histograms to present MB/sec along the top and sec/MB along the bottom. Faster writes still appear on the left and slower ones on the right. Each of the 10,240 tasks should ideally access a fair share of approximately 1.6 MB/s, given the available 16GB/s aggregate rate. Unfortunately, the baseline exhibits a distribution of per-task data rates with broad peaks well below 1 MB/s and extending to around 0.5 MB/s. The sustained write rate over the entire run time was only about 1 GB/s. The first optimization follows directly from the insight gained in the experiments of Figure 2. Because each task is executing a small number of writes, we can benefit from a “collective buffering” scheme (similar to that of MPI-IO) in which the data is aggregated from all tasks to a smaller subset of I/O tasks using MPI communication (stage one) then written to disk using only the I/O tasks (stage two). In previous IOR tests on Franklin (not reported here), we observed that as few as 80 tasks can saturate the I/O subsystem. Therefore, we tested a collective buffering scheme (stage two only) by running the I/O kernel with 80 tasks, each with 10240 80 = 128× as many write calls. The number, size, and alignment of the write calls remained unchanged from the baseline, as did the total amount of data written. Performance was improved due to the Law of Large Numbers advantage described in the IOR experiments in Section III. The results of this optimization are shown in Figures 6(d)–6(f); the total run time dropped from 310 seconds to 190 seconds, a 1.6× speedup. Figure 6(e) shows that the peak data rate did not improve, but the overall rate is more consistent with less fall off. The peak of the per-task rate distribution (Figure 6(f)) is 100 MB/s, which corresponds to an aggregate 8 GB/s for the 80 tasks. The worst case per-task rate has improved: the 128 records that each task transfers prior to the barrier are more likely to average out in performance. In addition to employing the Law of Large Numbers advantage, this optimization also reduced the number of tasks communicating with the 48 I/O servers from 10,240 to 80, which likely reduces contention and improves I/O server queue depths and service times. However, even with the optimization, the I/O kernel obtained a peak data rate of only 5 GB/s and a sustained write rate of 1.8 GB/s. Thus, we continued our investigation to identify additional opportunities for optimization. In previous IOR experiments on Franklin, we established that the Lustre file system prefers aligned offsets when writing to a shared file. The Lustre client transfers data to the I/O servers in 1 MB stripes, yet an examination of our trace data (not shown) revealed that the GCRM records were not aligned with these stripes. Using HDF5 library calls, we padded and aligned these writes to 1MB boundaries. The results are shown in Figures 6(g)–6(i) and show a run time of 150 seconds, less than half that of the baseline. Figure 6(h) shows that the peak write rate has improved and the “bulge” in the Figure 6(f) distribution between 1MB/s and 0.1MB/s has disappeared, leaving the distribution more closely centered around its peak. Similarly, the worstcase per-task rate now lies at 1MB/s rather than 0.5 MB/s. Also, the metadata operations benefited somewhat from alignment with a peak now around 1 MB/s (Figure 6(i)). From Figure 6(g), it is clear that the total run time was dominated by the serialized metadata operations on task 0. Our final optimization aggregates the metadata writes from many <3KB writes into a single 1 MB write that is deferred until file close, rather than at the end of each run. The results of this optimization are shown in Figures 6(j)–6(l). The large gaps caused by serialized writing on task 0 have disappeared and, the total run time has decreased to 75 seconds. This is a < 4× improvement over the baseline. VI. C ONCLUSIONS With the exponential growth of high-fidelity sensor (ex. MADbench) and simulated (ex. GCRM) data, the scientific community is increasingly reliant on ultrascale HPC resources to handle its data analysis requirements. To use such extreme computing power effectively, the I/O components must be designed in a balanced fashion, as any bottleneck will quickly render the platform intolerably inefficient. However, identifying the root cause of I/O performance deficiencies is an increasingly challenging task, as the source of degradation may be found in the application code, middleware library, file system, underlying architecture — or some combination thereof. To address this concern, we have developed a statistical approach for understanding I/O performance that shifts the analysis from the examination of individual performance events to the study of performance ensembles. To collect trace data in a production environment, we extended I/O functionality to the IPM profiling tool. Results on large-scale HPC systems, demonstrated that IPM-I/O allowed lightweight, portable, and scalable tracing — effectively collecting I/O statistics for our largest 10,240-way simulation. Statistical analysis of trace data produced by IPMI/O shows that the modes and moments revealed by the distribution of I/O times can contribute directly to understanding an application’s I/O behavior and potential bottlenecks. An examination of the I/O performance statistics for the IOR benchmark revealed an interesting and surprising I/O boost due to taking advantage of the Law of Large Numbers. Next, we examined the MADbench cosmology application, which suffered anomalous performance behavior on the Franklin XT4 platform. Using IPM-I/O data collection and our performance histogram methodology allowed us to identify a Lustre file system bug, which caused an erroneous read-ahead window. The ability to isolate this subtle I/O interaction between the application and middleware layer highlights the efficacy of our ensemble approach — and resulted in a 4.2× MADbench speedup once the appropriate Lustre patch was installed. Finally, we explored the I/O behavior of a large-scale 10,240-way GCRM climate modeling code. Through our statistical I/O performance analysis, we discovered a series of application-level optimizations that dramatically reduced the overall run time from 310 to 75 seconds, an improvement of over 4×. The three cases described in this paper illustrate the power of our statistics based approach. We fully expect that in future as the number of components in HPC systems increases such approaches will become essential, so that performance measurements can still be tractably recorded and analysed. In fact, the reproducible nature of our performance ensembles suggests that in most cases it may not even be necessary to store a majority of the performance data, just enough to define the distribution. Future work will build this statistical approach directly into IPM-I/O, thus moving the data captures from an I/O tracing paradigm to an I/O profiling paradigm. This transition promises to improve the scalability of our method in precisely the same way that program counter profiling is more scalable than execution tracing. With the ability to recognize modes and moments of the performance distribution, the IPM-I/O framework will be expanded to detect an application’s I/O patterns; thus providing key information to the underlying file system that can be leveraged for improving I/O behavior. VII. ACKNOWLEDGMENTS We would like to thank Quincey Koziol and John Mainzer of the HDF Group for their assistance with tuning the HDF5 library for the GCRM I/O pattern. We would like to thank Kitrick Sheets of Cray for his support in testing the file system modifications in Section IV. Portions of this work were completed while Dr. Karavanic was at: Performance Modeling and Characterization (PMaC) Laboratory, San Diego Supercomputer Center, La Jolla, CA 92092-0505. This work was supported by the ASCR Office in the DOE Office of Science under contract number DE-AC02-05CH11231; by the Director, Office of Advanced Scientific Computing Research, Office of Science, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231 through the Scientific Discovery through Advanced Computing (SciDAC) program’s Visualization and Analytics Center for Enabling Technologies (VACET); and by NSF contracts CNS-0325873 and OCI-0721397. This research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC0206CH11357, resources of the National Energy Research Scientific Computing Center, under Contract No. DEAC02-05CH11231, and resources of the National Center for Computational Sciences, under contract No. DEAC05-00OR22725. R EFERENCES [1] Community Access to Global Cloud Resolving Model and Data. http://climate.pnl.gov/. [2] Design and Testing of a Global Cloud Resolving Model. http://kiwi.atmos.colostate.edu/gcrm/. [3] Global Cloud Resolving Model Simulations. http://vis. lbl.gov/Vignettes/Incite19/. [4] H5Part: A Portable High Performance Parallel Data Interface to HDF5. http://vis.lbl.gov/Research/ AcceleratorSAPP/. [5] D. H. Ahn and J. S. Vetter. Scalable analysis techniques for microprocessor performance counter metrics. In Supercomputing ’02: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, pages 1–16, Los Alamitos, CA, USA, 2002. IEEE Computer Society Press. [6] J. Borrill. MADCAP: The Microwave Anisotropy Dataset Computational Analysis Package. In 5th European SGI/Cray MPP Workshop, Bologna, Italy, 1999. [7] J. Borrill, L. Oliker, J. Shalf, H. Shan, and A. Uselton. HPC global file system performance analysis using a scientific-application derived benchmark. Parallel Computing, In Press, Accepted Manuscript:–, 2009. [8] A. Chan, W. Gropp, and E. Lusk. An efficient format for nearly constant-time access to arbitrary time intervals in large trace files. In Scientific Programming, volume 16, pages 155–165, 2008. [9] L. DeRose, B. Homer, and D. Johnson. Detecting application load imbalance on high end massively parallel systems. In Proceedings, EuroPar 2007, LNCS 4641, pp. 150-159, 2007. [10] M. Geimer, F. Wolf, B. J. Wylie, E. brahm, D. Becker, and B. Mohr. The SCALASCA performance toolset architecture. In International Workshop on Scalable Tools for High-End Computing (STHEC), Kos, Greece, June 2008. [11] G. Griem, L. Oliker, J. Shalf, and K. Yelick. Identifying performance bottlenecks on modern microarchitectures using an adaptable probe. In Proc. 3rd International Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems (PMEO-PDS), Santa Fe, New Mexico, Apr. 26-30, 2004. [12] R. Hedges, B. Loewe, T. McLarty, and C. Morrone. Parallel File System Testing for the Lunatic Fringe: the care and feeding of restless I/O Power Users. In IEEE NASA Goddard Conference on Mass Storage Systems and Technologies (MSST), 2005. [13] I/O Tips. http://www.nccs.gov/computing-resources/ jaguar/debugging-optimization/io-tips/. [14] The ASCI I/O stress benchmark. http://sourceforge.net/ projects/ior-sio/. [15] MADbench2: A Scientific-Application Derived I/O Benchmark. http://outreach.scidac.gov/projects/ madbench/. [16] M. Mueller and et al. Developing scalable applications with vampir. http://www.vi-hps.org/datapool/page/18/ mueller.pdf. [17] H. Shan, K. Anypas, and J. Shalf. Characterizing and predicting the I/O performance of HPC applications using a parameterized synthetic benchmark. In Proc. SC2008: High performance computing, networking, and storage conference, Austin, TX, Nov 15-21, 2008, to appear. [18] S. Shende, A. Malony, and D. Cronk. Observing Parallel Phase and I/O Performance Using TAU. In DoD HPCMP Users Group Conference, 2008, 14-17 July 2008. [19] D. Skinner. Integrated Performance Monitoring: A portable profiling infrastructure for parallel applications. In Proc. ISC2005: International Supercomputing Conference, volume to appear, Heidelberg, Germany, 2005. [20] E. Smirni, R. Aydt, A. Chien, and D. Reed. I/O Requirements of Scientific Applications: An Evolutionary View. In H. Jin, T. Cortes, and R. Buyya, editors, High Performance Mass Storage and Parallel I/O: Technologies and Applications, pages 576–594. IEEE Press, Piscataway, New Jersey, 2002. [21] N. J. Wright, W. Pfeiffer, and A. Snavely. Characterizing parallel scaling of scientific applications using ipm. In The 10th LCI International Conference on HighPerformance Clustered Computing, March 10-12, 2009.