Academia.eduAcademia.edu

Owl: next generation system monitoring

2005, Conference On Computing Frontiers

As microarchitectural and system complexity grows, comprehending system behavior becomes increasingly difficult, and often requires obtaining and sifting through voluminous event traces or coordinating results from multiple, non-localized sources. Owl is a proposed framework that overcomes limitations faced by traditional performance counters and monitoring facilities in dealing with such complexity by pervasively deploying programmable monitoring elements throughout a system. The

Owl: Next Generation System Monitoring Martin Schulz Brian S. White, Sally A. McKee Center for Applied Scientific Computing Lawrence Livermore National Laboratory Computer Systems Lab Cornell University [email protected] {bwhite,sam}@csl.cornell.edu Hsien-Hsin S. Lee Jürgen Jeitner School of Electrical and Computer Engineering Georgia Institute of Technology Department of Informatics Technische Universität München [email protected] [email protected] ABSTRACT As microarchitectural and system complexity grows, comprehending system behavior becomes increasingly difficult, and often requires obtaining and sifting through voluminous event traces or coordinating results from multiple, nonlocalized sources. Owl is a proposed framework that overcomes limitations faced by traditional performance counters and monitoring facilities in dealing with such complexity by pervasively deploying programmable monitoring elements throughout a system. The design exploits reconfigurable or programmable logic to realize hardware monitors located at event sources, such as memory buses. These monitors run and writeback results autonomously with respect to the CPU, mitigating the system impact of interrupt-driven monitoring or the need to communicate irrelevant events to higher levels of the system. The monitors are designed to snoop any kind of system transaction, e.g., within the core, on a bus, across the wire, or within I/O devices. Categories and Subject Descriptors: C.4 [Performance of Systems]: Measurement Techniques General Terms: Performance, Measurement Keywords: Autonomous Performance Monitoring, Reconfiguration, Performance Analysis 1. INTRODUCTION As microarchitectural and system complexity grows, comprehending system behavior becomes increasingly difficult. Users nonetheless anticipate future autonomic systems that adapt dynamically to provide greater performance, avoid or repair transient faults, intercept adversarial attacks, and reduce system management costs. A substantial gap remains Copyright 2005 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by a contractor or affiliate of the U.S. Government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. CF’05, May 4–6, 2005, Ischia, Italy. Copyright 2005 ACM 1-59593-018-3/05/0005 ...$5.00. between what system designers can now provide—in terms of introspective data itself and the means of processing, analyzing, and visualizing it—and what users expect and will soon demand. A simple example of application memory performance monitoring highlights the disparity between the current state of the art, where probes report only symptoms of poor performance, and the goal of a self-aware system that discovers the cause. Rather than placing the burden on a user to query hardware registers or instrument binaries, a continuously monitored [1] system might employ a unobtrusive background monitor that detects spikes in some metric, e.g., cache miss rate. The monitor then instantiates additional monitors on the L1-L2 and memory buses to capture individual L1 and L2 misses, which can be correlated. For example, thrashing between the caches for specific data indicates conflict misses. This more precise miss characterization helps in selecting an appropriate optimization, such as cache-conscious data placement [4]. Monitoring is most often applied for performance considerations, but future systems will also demand it for security, fault detection, and maintenance. Consider memory fault isolation, where loads, stores, and jumps are guarded to ensure validity of the memory access or jump target. This may be done by manual instrumentation [30] or by a logically similar hardware method that “macro expands” instructions into guarded equivalents [6]. An active monitor can provide this functionality with no overhead. When an access check fails, the monitor signals the operating system, which might either terminate the process or decide to enlist more monitors to sandbox the potentially malicious code. Current monitoring facilities are ill-equipped to handle the above scenarios. Traditional hardware monitors are restricted to a fixed set of events and cannot perform sophisticated, online analysis. Their deferral to software for processing comes at the expense of costly and frequent interrupts or of loss of accuracy due to sampling. Further, they rely on a proper handling of software events, which is clearly not adequate for monitoring software and system faults. Software solutions [13, 17, 20, 35] rely on sampling, offline traces, or heavyweight instrumentation resulting in high system overhead or loss in accuracy. In order to provide the introspection necessary to understand and efficiently use future architectures, we need new hardware solutions for system monitoring. In recent years, several academic projects have focused on such hardware (e.g., solutions by Prvulovic and Torrellas [22] or Xu et al. [33]). However, each of these approaches provides specialized monitoring capabilities for specific problems or questions. No general framework has yet been proposed. In this paper, we propose Owl, a generic and programmable monitoring framework. It consists of reconfigurable or programmable logic elements deployed throughout the system. The user can program these monitors to acquire performance data without system interference, perform application-specific data analysis, and write results directly into main memory without interrupting the processor. The latter is necessary to minimize system perturbation. Owl’s programmability is flexible enough to encompass existing performance counters and extant or proposed monitoring techniques, as well as new, previously infeasible monitoring applications. The latter include monitoring addresses on memory buses, collecting a complete history of individual cache replacement decisions, snooping I/O bus activity, or checking assertions on all memory accesses. As its key design principle, Owl differentiates between user-defined analysis modules, which perform monitoring and analysis such as data aggregation and compression, and monitor capsules, which provide a standard interface between the module and the hardware. The programmability and autonomy of the modules support event processing close to the source, domain-specific monitoring, and the ability to react or adapt to observed events or application phases. Modules are implemented in reconfigurable or programmable logic, such as programmable microengines or FPGAs, within capsules. This flexibility enables less invasive hardware implementations of existing software techniques and, ultimately, more sophisticated monitoring than previously possible. The pervasive deployment of capsules throughout the system provides a user with alternate views of the same event (e.g., following a page miss through caches to disk) or simultaneous views of correlated events (e.g., declining IPC and branch prediction accuracy). The capsules’ standardization allow the same analysis algorithm to examine events throughout the system, despite potential dissimilarity among monitored devices. Flexibility in deploying capsules results from the simple assumption that they are only able to observe system events in the form of transactions. Thus, they may be attached to any transaction interface, including memory buses, I/O buses, or network interfaces. In Section 2 we describe existing hardware approaches for system monitoring. In Section 3 we present requirements for a next-generation monitoring facility and describe the design of Owl, a framework leveraging autonomous, programmable monitors to achieve those requirements. In Section 4 we describe several sample applications for novel memory monitoring capabilities, including efficient memory access logging, memory access characterization, and dynamic pattern recognition. In Section 5 we present results discussing the small system perturbation caused by using Owl, and we show the hardware complexity of a sample Owl module. 2. RELATED WORK Most current architectures include at least rudimentary hardware assists for system monitoring, usually in the form of counter registers. The UltraSPARC IIi [27] is an archetype of more advanced systems: it exports two performance counters that can monitor any of 20 predefined events. Counters may be programmed to raise an interrupt upon overflow. Counter-based techniques suffer common shortcomings [25]: too few counters, sampling delay, and lack of address profiling. Modern systems try to address these deficiencies. For instance, the Pentium 4 [15, 26] comes with 18 performance counters. In addition, it tags µops when they trigger certain performance events. These events are not counted until the µops are retired, ensuring that spurious events from speculative instructions do not pollute samples. This microinstruction-targeted approach also overcomes sampling delay. However, these mechanisms can only be employed to collect aggregate statistics using sampling. It is not possible to react to single events or to collect additional data, e.g., load target addresses of memory accesses. This prohibits a direct correlation of observed events with the data structures causing the events. The Itanium Processor Family [14] and other newer systems overcome this deficiency and allow the detection of such events, e.g., memory accesses or branch mispredictions. The access mechanisms provide microarchitectural event data, but these data are delivered to the consuming software through an exception for each event. The process using the information experiences frequent interrupts, and system perturbation occurs at many levels. Overheads costs of these mechanisms, particularly with respect to time, limit the extent to which software can reasonably exploit them. Many academic projects have focused on novel performance monitors for interconnection systems. Martonosi et al. [19] propose using the inherent coherence/communication system in the Stanford FLASH multiprocessor as a performance monitoring tool. FlashPoint is embedded in the software coherence protocol to monitor and report memory behavior in cache miss counts and latencies, inducing a 10% execution time overhead. In the SHRIMP project [18], performance monitoring boards allow each node to collect histograms of incoming packets from the network interface. A threshold-based interrupt is used to signal the application software and operating system to take proper action in response to a specified event. The SMiLE project [16] includes a SCI-based hardware monitor to detect memory layout problems in NUMA clusters; the monitored information helps to guide data layout and transparent data migration. 3. Owl: PROGRAMMABLE, SYSTEMWIDE MONITORING The research community’s shift from using simple aggregate metrics characterizing entire applications, e.g., cache miss rate, to more advanced statistics such as reuse distance [34], periodicity [13, 24], and correlating multi-layer analysis [31], signals that traditional hardware monitors are becoming insufficient. Further, new goals such as security and system health maintenance require monitoring different classes of events and eliciting varied responses, including the recruitment of other monitors to focus on specific symptoms. Discovering trends in system behavior requires monitors to be long running, and therefore unobtrusive, as well as capable of correlating events. The monitors must produce results judiciously to avoid overwhelming the system with inordi- nate amounts of data. These considerations point to an autonomous monitoring facility that is flexible enough to handle varied and evolving system behavior, leading to the following requirements: CPU chip Core Core M • Hardware assist for data aggregation: Data probes have the potential to collect copious data, especially when applied broadly to highly used resources, such as monitoring all L1 accesses. To manage such data, the monitoring system must be capable of performing at least preliminary analysis online in hardware. This can reduce data by using compression, aggregation, or statistical analysis. Data aggregation need not entail a loss of information, but rather can remove obfuscating data to bring a trend to light. • Domain-specific monitoring through programmability: One type of monitor cannot address all possible scenarios: some kind of programmability or reconfiguration capabilities are thus required. These enable the system to retarget monitors to perform new tasks or to gear to a specific device, application, or application phase. For example, a programmer may wish to monitor accesses to a specific data structure to determine if it exhibits poor locality so that it may be targeted by future optimizations, such as data regrouping [11]. Such fine-grained monitoring would be greatly aided by compiler support [12]. • Data delivery without process interruption: To minimize system perturbation, monitors must deliver their output asynchronously and deposit it into main memory. This contrasts with most currently implemented schemes, which either require a process interruption and event handler invocation for each observable event or are limited to sampling. The amount of monitoring data injected into the system, and hence the amount of system perturbation, depends on the semantics of the analysis module. Designing moduleembedded analytical techniques therefore requires an estimation of system perturbation. We study data injection rates to provide general overhead results intended to guide future designs and to allow early decisions regarding their feasibility. 3.1 Owl Architecture The Owl monitoring framework is designed to fulfill these requirements. As its key architectural principle, it splits the monitoring functionality into two parts: programmable capsules, which are attached to the actual data probes or sensors; and analysis modules, which are loaded into the capsules to perform data aggregation and preprocessing. The M Possible Location for a Hardware Monitor M M • Coordinated, cooperative system-wide monitoring: Observation of a single data source provides a myopic view, which may be sufficient to establish the symptom but not the cause of a problem. For example, if a monitor detects declining IPC, it might spawn additional monitors to examine branch prediction accuracy and cache miss rates. Indication that the latter is a problem might lead to additional coordinated monitoring. M M M M M M M I/D Caches M I/D Caches M M On−chip bus Off−chip bus M M L2/L3 cache To other CPU chips ✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁ ✂✂✁ ✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁ ✂✁ ✂✁ ✂✁ ✂✁ ✂✁ ✂✁ ✂✁ ✂✁ ✂✁ ✂✁ ✂✁ ✂✁ ✂✁ ✂✁ ✂✁ ✂✁ ✂✁ ✂✁ ✂✁ ✂✁ ✂✁ ✂✁ ✄ ☎✄✁✁ ✄ ✂✁ ✄ ✂✁ ✄ ✂✁ ✄ ✂✁ ✄ ✂✁ ✄ ✂✁ ✄ ✂✁ ✄M✂✁ ✄ ✂✁ ✄ ✂✁ ✄ ✂✁ ✄ ✂✁ ✄ ✂✁ ✄ ✂✁ ✄ ✂✁ ✄ ✂✁ ✄ ✂✁ ✄ ✂✁ ✄ ✂✁ ✄ ✂✁ ✄ ✂✁ ✄ ✂✁ ✄ ✂✂✂ ☎✄ ☎✁ ☎✁ ☎✁ ✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁✁ ✂✁ ✂✁☎✁ ✂✁☎✁ ✂✁Crossbar ✂✁☎✁ ✂✁☎✁ ✂✁☎✁ ✂✁☎✁ ✂✁☎✁ ✂✁☎✁ ✂✁☎✁ ✂✁☎✁ ✂✁☎✁ ✂✁☎✁ ✂✁☎✁ ✂✁☎✁ ✂✁☎✁ ✂✁☎✁ ✂✁☎✁ ✂✁☎✁ ✂✁☎✁ ✂✁☎✁ ✂✁☎✁ DRAM/Cohereny Controller M I/O bridge To other memory banks M I/O bus M DRAM Chips M (Active) Storage Network Interface M M Figure 1: Possible monitor capsule locations for systemwide monitoring capsules may be included in any component providing interesting data and hence can be located throughout the system, as illustrated in Figure 1. Each capsule is connected to one or more data probes or sensors embedded within the monitored component (see also Figure 2). These probes provide the respective data in event form, e.g., memory bus transactions, coherence events, disk accesses, accesses to the reorder buffer for out-of-order microprocessors, updates to the branch predictor, or rollback events in case of misspeculation. The event, including any relevant parameters such as addresses, values, or event types, is received by the Architecture Hardware Interface and translated to a standardized interface connecting the capsule and the analysis modules, the Monitor Hardware Interface. A module loaded into the programmable part of a capsule uses this interface to receive monitored data. This use of a system-wide, standardized interface allows the execution of any analysis module in any capsule, independent of its location and concrete data source, and thereby enables the reuse of analysis and aggregation techniques. Analysis modules are loaded and instantiated from a system-wide library of modules. This library can be dynamically extended and can contain application-specific modules. Library management and concrete module selection is left to the system components or tools using Owl for their monitoring purposes. Once loaded, a module may further be customized through memory-mapped configuration registers in each capsule. When activated, the capsule directs the probed data to the module, where it is preprocessed, analyzed, aggregated, compressed, or even sampled, which may be the desired functionality, rather than a limitation imposed by the monitor framework. The results of this analysis step are forwarded to the capsule for storage. Due to Owl’s programmability, each module contains application-specific data aggregation and filtering tech- Monitor HW Interface Monitoring Capsule Network ✄✁✄✁✄✁Cache Bus ✆✁✞✁ TLB ✞✁✞ ✁✁✁✁✁✁ ✂✁✁✁✁✁✁✁ ✂✁ ☎✄✁ ☎✄✁ ✝✆✁ ✝✆✁ ✝✆✁ ✝✆✁ ✝✆✁ ✞✁ ✞✁ ✞✁ ✞✁ ✂✂✁✁✁✁✁✁✁✁ ✂✁ ✂✁ ✂✁ ✂✁ ✂✁ ☎☎✁ ☎☎✁ ☎☎✁ ☎☎✁ ☎☎✁ ✝✝✁ ✝✝✁ ✝✝✁ ✝✝✁ ✝✝✁ ✝✝✁ ✄✂✂✂ ☎✄✁ ✄ ☎✄✁ ✄ ✄✁☎✄✁☎☎✁✄ ✄✁ ✄ ✄✁ ✄ ✆✁ ✆✄☎✄☎☎✄ ✆✁ ✆ ✆✁ ✆ ✆✁ ✆ ✆✁ ✆ ✝✆✁ ✆ ✞✁ ✞✁✆✝✆✝✝✆ ✟✟✟✁✞✁ ✞ ✟✟✟✁✞✁ ✞ ✟✟✟✁✞✁ ✞ ✟✟✟✁✞✁ ✞ ✟✟✟✁✞✁ ✞ ✟✟✟✞✞ ✂✁✂✁ ✂✁✂✁ ✂✁✂✁ ✂✁✂✁ ✂✁☎✄✁ ✠✠ ✡✁ ✠✠ ✡✁ ✠✠ ✡✁ ✠✠ ✡✁ ✠✠ ✡✁ ✠✠ ✡✁ ✠✠ ✡✁ ✠✠ ✡✁✠✠ ✡✁ ✠✠ ✡✁ ✠✠ ✡✁ ✠✠ ✡✁ ✠✠ ✡✁ ✠✠ ✡✁ ✠✠ ✡✁ ✠✠ ✡✁ ✠✠ ✡✠✠ ✡✠✁✡✁✠ ✡✁ ✡✁✡✁ ✡✁✡✁ ✡✁✡✁ ✡✁Architecture ✡✁✡✁✡✁ ✡✁✡✁ ✡HW ✁✡✁ ✡Interface ✁✡✁ ✡✁✡ Monitor Repository Monitor Monitor ✌✌ ✍✌✌ Reconfigurable ✍✁ Module Module ✍✁ ✍ #2 #3 Logic ✌✌ ✍✌✌ ✍✁ ✍✁ ✍ Monitor Monitor Monitor ✌✌ ✍✌✌ Module ✍✁ Module Module ✍✁ ✍ #1 #N ... ✌ ✍✌ Monitor ✍✁ Module ✌✌✁✍✌✌ ✍✁ Monitor #N−1 ✍✍✁ Module ✌ ✍✍✌ ... Upload ☛☛ Output ☛☛ ☞✁ ☛☛ ☞✁ ☛☛ Interface ☛☛ ☞✁ ☛☛ ☞✁ ☛☛ ☞✁ ☛☛ ☞✁☛☛ ☞✁ ☛☛ ☞✁ ☛☛ ☞✁ ☛☛ ☞✁ ☛☛ ☞✁ ☛☛ ☞✁ ☛☛ ☞✁ ☛☛ ☞✁ ☛☛ ☞☛☛ ☞☛✁☞✁☛ ☞✁ ☞✁☞✁ ☞✁☞✁ ☞✁☞✁ ☞✁☞✁ ☞✁☞✁ ☞✁☞✁ ☞✁☞✁ ☞✁☞✁ ☞✁☞✁ ☞✁☞ Configure Writeback Software Infrastructure for Evaluation and Reconfiguration Figure 2: Architecture of a monitoring capsule niques that are uploaded directly to the location of the data probes. This enables the modules to acquire and process every individual event without system perturbation. Only the results of the data aggregation are written to main memory, and the total size of these results is usually significantly smaller than the total size of all observed events. Further, the storage of the monitor results is initiated autonomously by the monitor module: it hands any data packet containing monitor data to the capsule through a standardized interface. The capsule then generates a memory packet, injects it into the memory system, and stores in a contiguous area of main memory organized as a ring buffer. Consumers of monitor data can then read the data from this memory region and process it asynchronously. The ring buffer itself is reserved by the operating system, and its size depends on both location of the capsule and the expected event rate. In order to avoid buffer overruns, each capsule has the ability to signal consumers using interrupts when a ring buffer is about to become full. 3.2 Providing Programmability The programmable nature of the modules is the key to their flexibility and filtering capabilities. Several means of achieving programmability exist, e.g., dynamic selection of predefined components, use of microprogrammable processing cores, or use of reconfigurable logic in the form of FPGAs. We consider the first option too restrictive, since only a predefined set of monitoring modules would exist. The latter two options provide similar capabilities, and we will explore both avenues. Here we focus on FPGA-based solutions, since they provide a low-level and direct interface to the hardware. Furthermore, any concurrency expressible in hardware designs is directly exposed to the analysis modules and thereby permits an easy and efficient use of pipelining. This can be used to ensure a high event handling rate despite complex analysis operations for each event. Reconfigurable hardware is increasingly used in modern architectures to complement general-purpose processors. An industry trend towards hybrid-reconfigurable systems indicates the potential and viability for architectures like Owl. For example, SRC Computer platforms are architected with Direct Execution Logic (DEL), comprised of dynamically reconfigurable functional elements (Multi-Adaptive Processors) intended to maximize parallelism of a given application code. OctigaBay Systems (now part of Cray, Inc.) uses one Xilinx Virtex-II Pro FPGA per node as an applicationspecific accelerator to perform vector operations. In the embedded market, several chip manufacturers provide singlechip solutions combining processor cores, such as PowerPC or ARM, with FPGAs enabling easy and efficient customization of processors [2]. In general, we see a trend towards faster and more efficient FPGAs, and recent studies have shown that some FPGAs can compete with the clock frequencies of most modern microprocessors [28]. 3.3 Design Considerations for Modules Monitoring efficiency depends on a module’s ability to intelligently preprocess and “semantically” reduce data traffic, i.e., to extract the essential information before injecting the result into the memory system. Further, hardware complexity limitations restrict the nature of the analysis techniques that are loaded into reconfigurable logic. The design of each module must reflect this balance between traffic reduction and hardware complexity. In general, modules are not intended as full-featured analytical engines, but as semantic preprocessors and partial data analyzers. As a result, the data stream will be reduced by filtering interesting events and aggregating data as needed by the consumer. This contrasts with traditional compression techniques or static aggregation steps, which do not make use of semantic information. As a consequence, they either push the processing completely into software or potentially lose interesting events. Recognizing the power and flexibility inherent in preprocessing data close to the source within the computational and memory constraints of the capsule are, as yet, a subtle art. Future work must systematically address the division of labor between an analysis module and high-level software analysis tools. 3.4 Programming Monitoring Modules Programming FPGAs using low-level hardware description languages is a complex task often only accessible to the advanced hardware designer. In order to make Owl accessible to a wider range of users, we must take further steps. While not part of the initial design, we present a few directions we are pursuing in ongoing work. For the common user, we will provide a comprehensive module library containing common analysis modules. Users can extend this library at any time with new modules developed by any programmer. The result is similar to kernel modules, which can be loaded into the kernel at runtime to extend the system’s capabilities without the user having to know details about the kernel itself or implementation details of such modules. In addition, we will work on high-level abstractions to design or compose monitoring modules. As a first step, we are developing a toolset that enables users to combine monitor submodules (e.g., histogram generators, counters, compression algorithms, or pattern detectors) into new modules. Once composed, the toolset generates and compiles the new hardware design as specified by the user and adds it to the module library. As a next step, we will investigate highlevel programming approaches, such as C to HDL compilers, CPU Monitor L1 Cache Monitor L2 Cache Monitor Memory Cross Capsule Analysis Figure 3: Multi-level Monitoring in the Memory Hierarchy: Location of Monitoring Capsules to make Owl’s full flexibility available to the user. Several projects and products already address such compilation approaches, and we plan to build on top of them. So far, these approaches have had limited success when applied in a general scenario due to inefficiencies when compiling arbitrary algorithms into hardware. In Owl, however, we are faced with a less complex problem, since all modules will naturally follow a given pattern—they target online event processing. We expect such compiler solutions to be more efficient when used to generate Owl modules. 4. USAGE SCENARIOS FOR Owl Owl provides a versatile and highly flexible monitoring infrastructure that can be used for a variety of scenarios. It can be used to implement existing hardware counters as well as most hardware monitors currently proposed in the literature. Owl is therefore downward compatible, providing a true superset of extant monitoring functionality. Further, Owl allows users to transfer hybrid hardware/software solutions into a single hardware module, eliminating the often costly software component. For instance, sampling and profiling based on performance counters can be implemented inside the capsule connected to the associated data probe. In addition, Owl can be used to implement previously infeasible or overly expensive analysis techniques. In the following we describe three such examples that aggregate and pre-analyze memory traffic: the creation of memory access traces for logging and debugging, the generation of memory access and cache miss histograms for memory performance tuning, and data structure access monitoring and pattern recognition. These represent three typical classes of algorithms, namely classic filtering and compression; partial, custom aggregation for the creation of histograms; and pattern detection and extraction. For these examples, we assume Owl capsules are distributed in the memory system. More specifically, we assume a three-level memory hierarchy and attach a capsule between each level, as illustrated in Figure 3. We focus on monitoring the memory system initially, since it remains one of the most significant performance bottlenecks in modern architectures. The overall Owl architecture, however, can be deployed system-wide, and modules similar to those described below can be used in other system components. 4.1 Memory Access Logging Software-based memory trace facilities [10, 21, 29] are heavily used for workload characterization and simulation. In most cases, however, they are implemented based on architectural simulation and hence are limited by high simulation overhead and by the fact that they fail to capture complete system-level behavior. These limitations can be overcome using a monitoring module designed for trace generation. To avoid excessive memory traffic induced by full logging, monitoring modules can perform loss-less compression, ranging from well established schemes, such as run-length encoding, to more sophisticated approaches including semantic trace compression [7] or load-value predictor-based compression [3]. All these can be implemented in hardware to reduce traffic significantly. The modules can also be used to perform aggressive filtering, logging only accesses corresponding to particular code or address regions or occurring within a given time window. In the extreme case, a module only maintains a short time access log in the form of a ring buffer. Triggered by certain events, e.g., illegal accesses, this log is written to memory and delivered to a corresponding tool for further analysis. This mechanism provides new means for debugging at all system levels, e.g., through transparent assertion testing or value inspection. It can also be applied for security tests, e.g., detection of buffer overruns. In the latter case, the monitoring module is provided by the operating system as a system component to transparently and non-intrusively control any running application. 4.2 Memory Access Histograms The ability to monitor addresses of memory accesses allows the creation of memory access histograms, as depicted in Figure 4 (left) for a sort algorithm (RADIX of SPLASH2 [32]). These histograms show the number of accesses per address or cache block to each level of the memory hierarchy over a period of time and thereby enable observations on individual data structures. In the example, the different access behavior for the two main arrays (cache block 2008500 and 8500-16500), as well as to the global configuration data (cache blocks 16500-17300), are clearly distinguishable. Furthermore, the availability of histograms for each level of the memory hierarchy enables the computation of cache miss rates on a per address basis, as shown in Figure 4 (right). The example shows that both arrays are faced with high L1 cache miss rates of around 30% interleaved with regular spikes to particular blocks with almost 100% miss rate. Through reverse mappings of addresses to data structures, this information is used to identify data structures with poor cache behavior and hence focus performance analysis and optimizations. 4.3 Dynamic Pattern Recognition and Reduction Periodic program behavior presents a rich opportunity for program characterization and optimization [13, 24]. Loops involving non-affine iterative steps or indirect accesses are a barrier to static periodicity detection, and long-range correlations may be obscured by procedure boundaries. Online approaches are attractive in combating these limitations and can be implemented within the proposed framework in the form of analysis modules. They are capable of detecting periodic behavior and delivering these observations for con- L1 MR L2 MR 100% L1 L2 MM 80% 60% 40% 20% 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000 15000 16000 17000 0% 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000 15000 16000 17000 Number of accesses 50 45 40 35 30 25 20 15 10 5 0 Number of cache block Number of cache block Figure 4: Memory access histogram (left) and cache miss rate diagram (right) for sort algorithm (radix sort) 5400000000 5300000000 address (bytes) 5200000000 5100000000 5000000000 4900000000 4800000000 0 100 200 time (instructions) 300 400 0 100 200 time (instructions) 300 400 5396412200 5396412000 address (bytes) 5396411800 5396411600 5396411400 5396411200 5396411000 5396410800 5396410600 5396490000 5396480000 5396470000 address (bytes) sequent system optimization, such as prefetching. At the same time, the repeating character of detected patterns can be used for “semantic” data reduction, since each pattern, once recognized, can be represented by a single data point. Such aggregation is significant, as it reduces the amount of data that need be communicated by the monitor, visualized by the user, stored on disk or memory, or analyzed by a tool or algorithm. A pattern detection module first recognizes arbitrary, repeating memory patterns and then performs a semantic data reduction, representing each complex pattern as a single point. We apply this technique to an off-line Alpha 21264 trace of ammp, an n-body molecular dynamics code from the SPEC 2000 suite. More specifically, our study focuses on the mm fv update nonbon function, which dominates computation time for many inputs, including the SPEC reference input set used here. A monitoring module first extracts repeating subsequences from a stream of load target addresses by applying a standard longest common subsequence algorithm. Assuming generous 32-bit table entries to record distances between load addresses, a module implementing such a pattern detection algorithm would require a pattern match table with 200×100 entries (approximately 80KB in raw form). The elimination of dead table entries, the application of partial evictions, and the use of width-adaptive registers will decrease this space requirement. The results of pattern detection and aggregation of ammp are shown in Figure 5. The top pane of the figure shows a trace of addresses from the execution of mm fv update nonbon. Applied to this data, the monitoring module captures a number of recurring patterns. The most interesting one is shown highlighted in the middle pane, which is a blown up section of the top band in the original trace. These patterns iterate over a larger address space, interrupted by gaps. These observations can be related to dynamic control flow such as if statements (i.e., the gap in addresses) and loops (i.e., the repetition of the pattern). Having recognized the pattern, little additional information is gained by the inclusion of every one of its constituent accesses in a trace. Rather, the aggregation of the entire pattern to a single point reveals the underlying periodicity (as shown in the bottom pane of Figure 5) and at the same time significantly reduces the amount of data transfered from the monitor to the ring buffer. In this example, the 72 individual accesses comprising the pattern are reduced to a single 5396460000 5396450000 5396440000 5396430000 5396420000 5396410000 0 20000 40000 60000 time (instructions) 80000 100000 Figure 5: ammp substructure: trace of mm fv update nonbon (top), trace restricted to address range of top band with repeating access pattern highlighted (middle), extended trace with access pattern aggregated to a single point (bottom) representative access to the base address of the pattern, resulting in an injection rate of 1/72. The behavior so exposed indicates that a second round of pattern matching should be applied, which would distill the entire loop body to only a few points. Applying this multi-resolution approach recursively allows users to expose higher-level program structure and periodicity without the usual attendant increase in data. 5. EXPERIMENTS AND RESULTS In order to be successful, a reconfigurable monitoring infrastructure, like Owl, must facilitate the implementation of monitoring modules with low hardware complexity and guarantee low system perturbation. In the following, we present experimental results showing that Owl satisfies these conditions. 5.1 Hardware Complexity To study the complexity of monitoring models, we implemented a VHDL version of the histogram module described above. Memory histogram generation conceptually requires an individual counter for each memory location. We use an associative counter array to avoid maintaining a large number of counters. The module contains a small set of counters that are dynamically associated with observed addresses. When the number of observed addresses exceeds the number of available counters, the module frees the least recently used counter, writes its contents to main memory as a partial result, and reuses the freed counter for new events. This drastically reduces the required frequency of writebacks to memory, and ensures low system perturbation. The concrete number of counters used within the module depends on the capsule location and the observed traffic patterns. Across a range of applications, our experiments show that 32 to 128 counters are sufficient to achieve an injection rate of less than 1/8 [23]. Using a Xilinx XC4085XLA with 40K gates, the analysis logic in its current version consumes about 66% of the available resources. Considering that modern FPGA chips contain up to 8M gates, this monitor can be realized with a modest chip real estate budget. Furthermore, aggressively pipelining the design of the analysis module enables the monitor to execute at the maximum frequency of the FPGA. 5.2 System Perturbation To characterize perturbation of monitored applications we implemented the Owl framework within the memory system of SimpleScalar (SimAlpha version 4.0) [9] and varied injection rates to simulate different monitoring modules. We implement capsules at each level of the memory hierarchy using two different techniques for injecting the recorded data into the memory stream: the first simply injects the traffic into the memory system at the location of the capsule as regular memory packets, while the second ensures that monitor packets bypass all on-chip caches and uses a separate off-chip memory controller to store the monitor data. The former minimizes the architectural impact when introducing Owl, and hence is a suitable solution for commodity systems, while the second results in less perturbation, but comes at a significantly higher implementation cost, and hence is only an option in specialized high performance systems. Our simulator is configured as closely as possible to the validated model of a Compaq DS-10L Alpha Server, as de- Benchmark 164.gzip 171.swim 175.vpr 176.gcc 177.mesa 179.art 186.crafty 188.ammp 254.gap 256.bzip2 Type SPECint SPECfp SPECint SPECint SPECfp SPECfp SPECint SPECfp SPECint SPECint Input ref/program ref ref/route ref/expr ref ref/470 ref ref ref ref/program #Instr. 256B 1563B 240B 15.2B 492.2B 198.9B 264B 1924B 473B 58B Table 1: SPEC benchmark parameters baseline: no monitoring activated 1/1 injection rate; naive injection (naive) 1/8 injection rate; naive injection (naive) 1/64 injection rate; naive injection (naive) 1/1 injection rate; separate off-chip memory (sepmem) 1/8 injection rate; separate off-chip memory (sepmem) 1/64 injection rate; separate off-chip memory (sepmem) Table 2: Test cases scribed in previous studies [8, 9]. The memory system is a 64KB, two-way associative L1 cache with 64B lines and three-cycle latency followed by a 2MB direct-mapped L2 cache with a 13-cycle latency. The benchmarks are selected from the SPEC 2000 benchmarks to achieve a broad coverage of behaviors and they use the SPEC reference input sets. In order to achieve reasonable simulation times we rely on SimPoint [24]. More specifically, we use SimPoint in the original version with multiple simpoints per code [5] to guarantee the highest possible accuracy. The complete configuration of the benchmarks along with the input sets for the cycle-accurate simulation are shown in Table 1. In case of multiple possible input sets, we choose the one with the lowest error rate under SimPoint execution. The discussion of modules in Section 4 reflects the dependence of traffic injection rates on the analysis technique employed by the module: 1/1 for full address logging, 1/8 as an upper bound for the memory access histogram generation, and 1/64 for pattern matching. The configuration space is shown in Table 2 and the results in Figure 6. Most codes are capable of hiding injection rates of 1/8 and less without system perturbation, while some of the more memory intensive codes require rates as low as 1/64. However, these codes can be executed at higher injection rates if the user can tolerate higher system perturbation. Similar to the selected results above, introducing a separate monitoring memory could eliminate most perturbation for higher injection rates. This overhead is still orders of magnitude lower than any software scheme. In addition, as described in Section 4, most applications of logging schemes do not require full logs over the application runtime, but rather concentrate on specified subregions. This will reduce the amount of traffic the system has to deal with, enabling full access logging in these cases. 6. CONCLUSIONS AND FUTURE WORK Researchers have proposed myriad analysis techniques, and practitioners have made efficient use of these for sys- 1.5 no monitoring ir=1/1, L1-L2, naive ir=1/8, all, naive ir=1/64, all, naive ir=1/1, L1-L2, sepmem ir=1/8, all, sepmem ir=1/64, all, sepmem IPC 1.0 0.5 0.0 zip 4.g 16 pr m wi 1.s 17 5.v 17 cc 6.g 17 rt a es 7.m 17 9.a 17 fty ra 6.c 18 mp .am 8 18 ap 7.g 19 tic me n ith mea 2 zip 4.b 25 ar Figure 6: IPC across several SPEC benchmarks and with varying configurations tem adaptation and optimization. Nonetheless, the capabilities of current systems fall far short of what is needed and expected for reliable, efficient, complex computing systems. Most existing approaches suffer from the inflexibility and limitations of the monitoring hardware or rely on purely software implementations with extreme overhead costs. All of these techniques would benefit from the existence of a unifying, general-purpose monitoring architecture that allows a better balance of functionality between hardware and software, and even allows the user to configure each technique to the scenario at hand, to be activated on demand. In this paper, we propose such a flexible, general infrastructure for reconfigurable monitoring, and we illustrate how it can be used for understanding memory behavior to optimize access patterns and data placement. Further, we show that such monitoring can be implemented with very low overhead, causing little or no system perturbation, and thus minimal observable performance penalties. The framework itself is novel: it separates system probes from data analysis functionality, allowing the latter to be dynamically controlled by the user in the form of analysis modules. For instance, in the memory monitoring examples, these modules perform online preprocessing of the observed memory addresses, and deliver results to the consumer for post-processing—without necessitating any process interruptions. Our results show that a monitoring system with autonomous data delivery has a relatively small impact on system performance and that with lower injection rates the overhead becomes negligible. The feasibility study performed in the course of this work demonstrates the viability of the general approach. As the framework was designed as a general monitoring facility, we believe its success in the specific context of memory analysis will extend to more pervasive system-wide monitoring—and towards better understanding of system behavior in general. 7. ACKNOWLEDGMENTS This work was supported by the National Science Foundation under award ITR/NGS-O325536 and by a DOE fellowship, provided under grant number DE-FG02-97ER25308. Part of this work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48 (UCRL-CONF-209855). 8. REFERENCES [1] J. M. Anderson, L. M. Berc, J. Dean, S. Ghemawat, M. R. Henzinger, S.-T. A. Leung, R. L. Sites, M. T. Vandevoorde, C. A. Waldspurger, and W. E. Weihl. Continuous profiling: Where have all the cycles gone? In Proceedings of the 16th ACM Symposium on Operating Systems Principles, pages 357–390, Oct. 1997. [2] D. Andrews, D. Niehaus, and P. Ashenden. Programming models for hybrid CPU/FPGA chips. IEEE Computer, 37(1):118–120, Jan. 2004. [3] M. Burtscher and M. Jeeradit. Compressing extended program traces using value predictors. In Proceedings of the 2003 International Conference on Parallel Architectures and Compilation Techniques, pages 159–169, Oct. 2003. [4] B. Calder, C. Krintz, S. John, and T. Austin. Cache-conscious data placement. In Proceedings of the 8th Symposium on Architectural Support for Programming Languages and Operating Systems, pages 139–149, Oct. 1998. [5] B. Calder, T. Sherwood, E. Perelman, and G. Hamerley. Simpoint. http://www.cs.ucsd.edu/˜calder/simpoint/, 2003. [6] M. L. Corliss, E. C. Lewis, and A. Roth. DISE: A programmable macro engine for customizing applications. In Proceedings of the 30th Annual International Symposium on Computer Architecture, pages 362–373, June 2003. [7] L. DeRose, K. Ekanadham, J. Hollingsworth, and S. Sbaraglia. SIGMA: A simulator infrastructure to guide memory analysis. In Proceedings of IEEE/ACM Supercomputing ’02, Nov. 2002. [8] R. Desikan, D. Burger, and S. Keckler. Measuring experimental error in multiprocessor simulation. In Proceedings of the 28th Annual International Symposium on Computer Architecture, pages 266–277, June 2001. [9] R. Desikan, D. Burger, S. Keckler, and T. Austin. Sim-alpha: a validated, execution-driven Alpha 21264 simulator. Technical Report TR-01-23, Department of Computer Sciences, The University of Texas at Austin, 2001. [10] Digital Equipment Corporation. ATOM User Manual, Mar. 1994. [11] C. Ding and K. Kennedy. Improving cache performance in dynamic applications through data and computation reorganization at run time. In Proceedings of the 1999 ACM SIGPLAN Conference [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] on Programming Language Design and Implementation, pages 229–241, May 1999. C. Ding and Y. Zhong. Compiler-directed run-time monitoring of program data access. In First ACM SIGPLAN Workshop on Memory System Performance (MSP), pages 1–12, June 2002. E. Duesterwald, C. Cascaval, and S. Dwarkadas. Characterizing and predicting program behavior and its variability. In Proceedings of the 2003 International Conference on Parallel Architectures and Compilation Techniques, pages 220–231, Sept. 2003. Intel. Intel Itanium Architecture Software Developer’s Manual, 2000. Intel. Intel Architecture Software Developer’s Manual Volume 3: System Programming Guide, 2002. W. Karl, M. Leberecht, and M. Schulz. Optimizing data locality for SCI-based PC-clusters with the SMiLE monitoring approach. In Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 169–176, Oct. 1999. J. Marathe, F. Mueller, T. Mohan, B. de Supinski, S. McKee, and A. Yoo. METRIC: Tracking down inefficiencies in the memory hierarchy via binary rewriting. In Proceedings of the First Annual Symposium on Code Generation and Optimization, pages 289–300, Mar. 2003. M. Martonosi, D. W. Clark, and M. Mesarina. The SHRIMP performance monitor: Design and applications. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems (Sigmetrics ’96), pages 61–69, May 1996. M. Martonosi, D. Ofelt, and M. Heinrich. Integrating performance monitoring and communication in parallel computers. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems (Sigmetrics ’96), pages 138–147, May 1996. T. Mohan, B. de Supinski, S. McKee, F. Mueller, A. Yoo, and M. Schulz. Identifying and exploiting spatial regularity in data memory references. In Proceedings of IEEE/ACM Supercomputing ’03, Nov. 2003. A.-T. Nguyen, M. Michael, A. Sharma, and J. Torrellas. The augmint multiprocessor simulation toolkit for intel x86 architectures. In Proceedings of 1996 International Conference on Computer Design, October 1996. M. Prvulovic and J. Torrellas. Reenact: Using thread-level speculation mechanisms to debug data races in multithreaded codes. In Proceedings of the 30th Annual International Symposium on Computer Architecture, pages 110–121, June 2003. M. Schulz, J. Tao, J. Jeitner, and W. Karl. A proposal for a new hardware cache monitoring architecture. In Proceedings of the Workshop on Memory Systems Performance (MSP 2002), June 2002. T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In Proceedings of the 10th Symposium on Architectural Support for Programming [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] Languages and Operating Systems, pages 45–57, Oct. 2002. B. Sprunt. The basics of performance-monitoring hardware. IEEE Micro, pages 64–71, July/August 2002. B. Sprunt. Pentium 4 performance-monitoring features. IEEE Micro, pages 72–82, July/August 2002. Sun Microsystems. Ultra-SPARC-IIi User’s Manual, 1997. J. Teifel and R. Manohar. Highly Pipelined Asynchronous FPGAs. In Proceedings of the ACM International Symposium on Field-Programmable Gate Arrays, pages 133–142, Feb. 2004. J. Veenstra and R. Fowler. MINT: A front end for efficient simulation of shared-memory multiprocessors. In Proceedings of the 2nd International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pages 201–207, Jan. 1994. R. Wahbe, S. Lucco, T. E. Anderson, and S. L. Graham. Efficient software-based fault isolation. In Proceedings of the 14th ACM Symposium on Operating Systems Principles, pages 203–216, Dec. 1993. R. Wisniewski, P. Sweeney, K. Sudeep, M. Hauswirth, E. Duesterwald, C. Cascaval, and R. Azimi. Performance and environment monitoring for whole-system characterization and optimization. In P = AC [ 2] Conference on Power/Performance Interaction with Architecture, Circuits, and Compilers, pages 1–10, Oct. 2004. S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 24–36, June 1995. M. Xu, R. Bodik, and M. Hill. A flight data recorder for enabling full-system multiprocessor deterministic replay. In Proceedings of the 30th Annual International Symposium on Computer Architecture, pages 122–135, June 2003. Y. Zhong, C. Ding, and K. Kennedy. Reuse distance analysis for scientific programs. In Proceedings of the Workshop on Languages, Compilers, and Runtime-Systems for Scalable Computers, Mar. 2002. Y. Zhong, S. G. Dropsho, and C. Ding. Miss rate prediction across all program inputs. In Proceedings of the 2003 International Conference on Parallel Architectures and Compilation Techniques, pages 79–90, Sept. 2003.