Academia.eduAcademia.edu

Panopticon

2010, Proceedings of the 2010 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists

Monitoring systems are necessary for the management of anything beyond the smallest networks of computers. While specialised monitoring systems can be deployed to detect specific problems, more general systems are required to detect unexpected issues, and track performance trends. While large fleets of computers are becoming more common, few existing, general monitoring systems have the capability to scale to monitor these very large networks. There is also an absence of systems in the literature that cater for visualisation of monitoring information on a large scale. Scale is an issue in both the design and presentation of large-scale monitoring systems. We discuss Panopticon, a monitoring system that we have developed, which can scale to monitor tens of thousands of nodes, using only commodity equipment. In addition, we propose a novel method for visualising monitoring information on a large scale, based on general techniques for visualising massive multi-dimensional datasets. The monitoring system is shown to be able to collect information from up to 100 000 nodes. The storage system is able to record and output information from up to 25 000 nodes, and the visualisation is able to simultaneously display all this information for up to 20 000 nodes. Optimisations to our storage system could allow it to scale a little further, but a distributed storage approach combined with intelligent filtering algorithms would be necessary for significant improvements in scalability.

Panopticon: A Scalable Monitoring System Duncan Clough Stefano Rivera Michelle Kuttel Department of Computer Science, University of Cape Town, Private Bag X3, Rondebosch, 7701, South Africa Department of Computer Science, University of Cape Town, Private Bag X3, Rondebosch, 7701, South Africa Department of Computer Science, University of Cape Town, Private Bag X3, Rondebosch, 7701, South Africa [email protected] [email protected][email protected] Patrick Marais Vincent Geddes Department of Computer Science, University of Cape Town, Private Bag X3, Rondebosch, 7701, South Africa Department of Computer Science, University of Cape Town, Private Bag X3, Rondebosch, 7701, South Africa [email protected] [email protected] ABSTRACT Categories and Subject Descriptors Monitoring systems are necessary for the management of anything beyond the smallest networks of computers. While specialised monitoring systems can be deployed to detect specific problems, more general systems are required to detect unexpected issues, and track performance trends. While large fleets of computers are becoming more common, few existing, general monitoring systems have the capability to scale to monitor these very large networks. There is also an absence of systems in the literature that cater for visualisation of monitoring information on a large scale. Scale is an issue in both the design and presentation of large-scale monitoring systems. We discuss Panopticon, a monitoring system that we have developed, which can scale to monitor tens of thousands of nodes, using only commodity equipment. In addition, we propose a novel method for visualising monitoring information on a large scale, based on general techniques for visualising massive multi-dimensional datasets. The monitoring system is shown to be able to collect information from up to 100 000 nodes. The storage system is able to record and output information from up to 25 000 nodes, and the visualisation is able to simultaneously display all this information for up to 20 000 nodes. Optimisations to our storage system could allow it to scale a little further, but a distributed storage approach combined with intelligent filtering algorithms would be necessary for significant improvements in scalability. C.2.3 [Computer]: Communication Networks—Network Operations: Network monitoring; H.3.3 [Informational Systems]: Information Storage and Retrieval—Information Search and Retrieval: Retrieval models; H.5.2 [Informational Systems]: User Interfaces—Graphical user interfaces (GUI) General Terms Design, Management Keywords Monitoring System, Scalability, Visualisation 1. INTRODUCTION The last decade has seen the emergence of massively large data centres composed of tens of thousands of nodes. The scale of data centres run by companies such as Amazon and Google has expanded rapidly as consumers demand ever more resources. As systems begin spanning data centres, large scale monitoring becomes essential. With an ever increasing number of nodes, redundancy has moved outside the individual server, giving more importance to effective management and monitoring of an entire fleet of servers. However, few of the mature monitoring systems, that are constantly used in system administration, are designed for supervision of very large networks. There are a variety of monitoring systems for identifying and detecting different classes of problem. While specialised monitoring systems can be deployed to detect most specific problems, more general systems are required to handle unexpected systemic issues [4]. General monitoring systems usually measure metrics such as CPU usage, free memory, network traffic, and availability [12]. Given such a set of metrics from a fleet of nodes (computers); system load, application performance, and outages can be readily interpreted. This has applications ranging from diagnosing performance bottlenecks in a single node to optimising cluster-usage in Capacity Computing. Designing a scalable and efficient monitoring system is not a trivial task. A naı̈ve implementation may impose ∗Now at Amazon.com Development Centre South Africa Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAICSIT ’10, October 11-13, Bela Bela, South Africa Copyright 2010 ACM 978-1-60558-950-3/10/10 ...$10.00. 1 particularly scalable, with the system capable of handling a theoretical maximum of 300 000 nodes. Unfortunately, Astrolabe was designed for data mining and not visualising detailed information, so queries are limited to aggregates in order to reduce network usage. We improve on this limitation, making individual metric values from all nodes viewable. A high level of fault tolerance is important to any massively scalable monitoring system. On a very large scale, problems become inevitable and agents have to gracefully recover from these problems [12]. Both Ganglia and Astrolabe use multiple levels of redundancy to increase robustness. Astrolabe goes further, using a randomised point-topoint message exchange to limit reliance on single channels. The third aspect to a scalable monitoring system, visualising the monitoring information in a scalable manner, is often overlooked. Simultaneously visualising data from tens of thousands of computers leads to a user being presented with so much information that it is extremely difficult, if not impossible, to comprehend it effectively. A fleet of a million nodes would only compound this issue. Traditional graphs become too cluttered to effectively display information [11]. Tools such as OVIS-2 [3] suggest that statistical methods should be used to highlight important information. While this is effective at reducing information overload, our research specifically investigated techniques for visualising all the information rather than a subset. OVIS-2 also includes a physical 3D representation of a cluster, with components being coloured individually according to metric values. Although the physical representation is very useful in relation to actual clusters, it also limits the visual scalability. Our monitoring information is multi-dimensional as there are multiple metrics to consider. This compounds the problem of large amounts of information to be expected from a large fleet. In order to build an effective, scalable visualisation for our monitoring system we looked at general visualisation techniques for massive, multi-dimensional datasets. Pixel-oriented [9, 16, 15] aggregation solves the problem of loss of detail in multiple zoom levels by rearranging information into a more suitable format for the current zoom level. Hierarchical [7, 11] aggregation attempts to group related information to provide a meaningful overview of and context for substructures. These techniques allow a visualisation to provide a broader context to smaller subsets of data. Since nodes within a fleet have a logical and physical layout, aggregation is appropriate to monitoring systems. Orthogonal projections from higher dimensions into understandable 3D or 2D coordinates help reduce the complexity of multi-dimensional data-sets, but have an associated loss of information [2]. ShapeVis [11] uses projections to group objects with similar multi-dimensional characteristics in similar 3D spacial locations — a concept that inspired our Metric-map. Parallel Coordinates [8, 10] are a popular method of rendering multi-dimensional data, but require aggregation techniques for better scaling [11]. We therefore decided not to include Parallel Coordinates in our visualisations. Perspective manipulation and user-interaction can be used to enhance a visualisation. Filtering [2], linking and brushing[17] (propagating selection between different visualisations), zooming and distortion [11] are examples of methods that can improve a visualisation. a heavy burden on a network, wasting a resource that the monitoring system is intended to preserve [14]. A large scale fleet will produce a massive amount of information that will need to be efficiently stored and quickly displayed. Panopticon is our experimental system that is able to non-intrusively monitor, efficiently store and quickly retrieve metrics from a very large fleet. As a prototype, it was not required to implement all the features of a general purpose monitoring system. The monitoring and storing of metrics on a large scale is not the only difficulty; effectively visualising them is a problem that is often overlooked. If not displayed effectively, this information can easily overwhelm the user, hindering their ability to identify problems [4]. Traditionally, graphs are used for visualising metrics and are very effective on a small scale. However, on a large scale, graphs become visually limited by the number of curves that can be plotted simultaneously. Therefore, a new method of visualising monitoring information is required — one that can deal with massive sets of data, and takes into account the multi-dimensional nature of the information. A purely large-scale overview would not provide enough node-specific detail to be useful for identifying problems in individual nodes. Panopticon’s Visualisation component presents a solution to these issues with our Node-map that provides information at multiple levels of detail, and our Metric-map that shows an overview of the fleet status. Thus, the visualisation provides a general overview of the status of all nodes, while also allowing access to the metrics of a specific node. Live changes in metric values are highlighted for easy identification and historic information on monitored nodes is also accessible. We evaluated Panopticon by monitoring a live test clusters of 56 computers over a period of 4 months, as well as with simulated data for significantly larger fleets. Our analysis shows that the Panopticon approach can scale to very large fleets, with the potential to scale to millions of monitored nodes and be extended into a viable production monitoring system. 2. RELATED WORK For small scale monitoring, RRDTool is widely used for storage of metrics and their visualisation [13]. Its approach of storing individual, time-granularity-specific metric values in their own separate databases has been very effective, but quickly becomes inefficient when dealing with thousands of computers [12]. As the scale increases, communicating information from nodes to the databases also becomes problematic. CARD [1] is an early example of a monitoring system that employs a hierarchical approach specifically to improve scalability. This approach was modernised by Ganglia [12], with a multi-level tree of aggregation nodes pulling information up from monitoring agents on individual computers. Communication bottlenecks were avoided by only storing information in top-level RRDTool databases, thus eliminating network usage during information retrieval. Despite having a scalable architecture for collecting information, reliance on RRDTool storage limited the system to below 10 000 nodes. This highlights the need for scalability to be central in the design of all aspects of a large scale monitoring system. Another approach is to distribute the storage of metrics across all nodes in the hierarchy, with queries being pushed down the hierarchy to be serviced by all interested nodes. Astrolabe [18] showed that this method is 2 istics only increases with scale [18]. In a similar approach to Astrolabe [18], the monitoring and aggregation system is a hierarchical network of agents (the node-tree). The agent is installed as a daemon on each node. It is responsible for observing and recording host system metrics, and providing them to other nodes on request. Internal nodes in the tree will poll their children for recent metrics. This information is propagated up through the tree, recording the route within the data. The root node receives metrics from every node in the system while only having to communicate with a few child nodes. The system can handle unreliable hardware and networks by using multiple parent nodes for each child, all the way up to multiple roots. Duplicate data is automatically reconciled by ignoring the oldest data for a particular node. For our purposes, we selected a similar small set of metrics as Ganglia [12]: uptime, usage, load (1, 5, and 15 minute averages), network usage, and free memory. These metrics are commonly used by monitoring systems, and easily available on POSIX. The exact metrics monitored would have negligible effect on our scalability evaluation, which is dependant on data quantity, rather than content. The inter-agent communication uses a very simple clientserver binary protocol. The only supported operations are requests for all available metrics on a particular node (both aggregated and locally observed) and a status report on the availability of aggregated data. For communication with other systems, agents could run a Web Service Interface. Requests followed the REST architecture [6] and data was encoded in JSON [5]. This interface was used between the root node and the Storage and Retrieval system. Figure 1: Overview of Panopticon system architecture, showing the separate Visualisation, Storage & Retrieval, and Collection components 3.2 Although each of the discussed techniques can be individually successful, it is a combination of techniques that often results in the most successful visualisations [10]. Our visualisations were, therefore, designed to use multiple techniques. Determining the cause of a problem on a network often requires looking into the past and comparing against previous behaviour. Therefore, as well as providing upto-date information on the current state of a fleet, a monitoring system should store any collected information for later use. We required Panopticon to store all live metrics at least once every 5 minutes, and to retrieve metrics as requested by the Visualisation component in real-time. The Storage & Retrieval component requires a high degree of availability, as it provides critical information without which system administrators are blind. Our storage system was built around a high-performance SQL database, MySQL. Having a centralised MySQL server is an obvious limit to scalability and reliability: It is a single point of contention and failure. However, the simplicity of this approach in our proof-of-concept is considered to be enough of a gain to outweigh these issues. A time granularity of 5 minutes is widely used by monitoring systems (it is MRTG’s default [13]). It strikes a reasonable balance between information detail and network and storage overhead. In early tuning experiments, our RDBMS was shown to be capable of inserting 10 million rows in 170 ± 32 s. This gave us an upper bound of being able to capture the information from 14 million nodes with a 5 minute resolution. Since this was well above our desired scale, a centralised RDBMS was considered sufficient for our system: it provided a good starting point for testing, while having reasonable scalability. Storage and retrieval features are implemented separately. Storage is accomplished via a daemon polling multiple root aggregation nodes in parallel, and storing re- 3. DESIGN & IMPLEMENTATION Panopticon is the combination of three separate components, each focussing on one part of a much larger problem: Visualisation, Storage & Retrieval, and Monitoring & Collection (Figure 1). Each component runs as a separate entity, and communicates with other components over the network. This allowed each component to be developed and evaluated independently. As a prototype system, the aim of Panopticon is to investigate scalability in the order of tens of thousands of nodes. Many features that would be desirable in production monitoring systems are considered non-vital to our research. 3.1 Storage & Retrieval Component Monitoring Component A monitoring system is responsible for measuring metrics of monitored nodes and collating the data without adversely affecting the performance of those nodes or their network [12]. It should be lightweight so that it does not interfere with activities on nodes and is able to continue reporting even when nodes are under intense load. It must be highly reliable and robust, as reporting failures will be perceived as node failures. It should also require as little maintenance as possible, so that it is not an administrative burden on a network. The importance of these character3 100 Node Detail Group Detail Group Summary 80 161.58.175.23 CPU: 4 % Network: 0 % RAM: 52 % 60 Zoom Out Zoom Out 40 Figure 3: The three different levels of detail of the Node-map. 20 0 Fri Sat Sun Mon Tue Wed Thu Fri Sat Dead nodes Figure 2: A graph of memory usage in only 20 machines over a week, highlighting the problem of graph clutter. All nodes above threshold All nodes below threshold trieved metrics in the database. Incoming data is timestamped on the storage node (to remove the requirement that every participating node have accurate time) and quantised into 5-minute buckets. Only two database tables are used: a node list table (with static information and last-seen time stamp), and a full historical archive table. The archive table is indexed with a primary key on the (address, time stamp) tuple so that full table scans are never needed for the supported queries (below). Retrieval is handled by a separate daemon, answering requests from a visualisation front-end and sending updates as they become available. The protocol is textbased, for ease of debugging. As the number of monitored nodes may be very large, the responses are deltacompressed where possible. Commands supported are: selecting metrics and/or nodes of interest, and enabling or disabling live status updates. Historical queries can be performed for fleet status at a specific point in time, or for a set of consecutive metric values from a specific node that are suitable for a timeaxis graph. 3.3 Most nodes below threshold Few nodes above threshold CPU Network RAM Figure 4: Information in the Group Summary level of detail is summarised using an adjustable threshold value, and represents an overview of activity within a Group. Selection Made Update Received Visualisation Component The central issue addressed in this work is the problem of monitoring and visualising a huge number of nodes simultaneously. So, above all, the visualisation must be scalable. We also required low-level information to still be easily accessible within our system. To meet these needs, we designed two different overview visualisations (Nodemap and Metric-map) combining visualisation techniques discussed in Section 2. Additionally, we provided traditional per-node historical graphs. The key difference between the two visualisations is their approach to solving the problems of viewing low- and high-level information. The visualisations were also able to interact through twoway selection propagation. The Visualisation component aims to present information about the state of the fleet to the user. This is done so that the user can identify problems with the fleet and their potential causes. New information needs to be presented to the user as soon as it is available and historic data also needs to be accessible. Graphs can become too cluttered if too many computers are being monitored, and therefore are not applicable to Panopticon as the main visualisation (see Figure 2). Figure 5: An example of how selection and update highlighting are visualised on the Node-map. Group Summary Level Node Level Figure 6: Update highlighting shown in the contect of a fleet, where areas of change are easily identifiable. 4 However, as they are still useful when only looking at a single monitored machine, a historical graphing tool is included for that purpose. Nodes can be selected in both visualisations, and have their historic information plotted according to various time-granularities. RAM 3.3.1 The Node-Map b is darker than a, because the average of b's metrics is more than that of a's metrics a The Node-map is designed to be the main source of lowlevel information, while also providing an aggregated view for a general summary of a fleet’s status. Nodes are aggregated by logical layout and displayed in multiple levels of detail as shown in Figure 3. Metrics are grouped to be above or below an adjustable threshold. This reduces the amount of information presented by the Node-map, reducing the information needing to be processed by the user. The Node-map also briefly highlights any nodes that have had changes in their metric values, so that a user can easily notice any changes that occur (Figures 5 and 6). When dealing with large numbers, viewing each node individually becomes infeasible because (i) too much information is then presented to the user, and (ii) most window tool-kits struggle to render tens of thousands of shapes in real time. Therefore, once the user has zoomed out, a group of nodes is replaced with a summary of the status of those nodes. This significantly reduces the amount of information that the user has to process, and reduces the complexity of the scene to be rendered. The summary is designed to highlight the number of dead nodes, and the number of nodes on either side of the threshold for all three metrics being monitored. The dead nodes are represented by a black rectangle at the top of the top of the group. The width of the rectangle is set to be the group width, and the height is calculated by ( n o d max ngg , 5% × heightg , if dg > 0 dead_heightg = 0 , otherwise Few Nodes All Metrics Low RAM Lowest All Selected b CPU Net Many Nodes High Network Low CPU, RAM Quarter Selected Figure 7: The features of the metric-map. Large group of nodes with high RAM and network usage Large group of nodes with high RAM and CPU usage No nodes with only high network usage Lots of small dots indicate varied usage where a group g has ng nodes, dg of which are dead. In the case where a tiny proportion of the total nodes are dead, we increase the proportion of the height to 5% of the group’s height so that the black rectangle will at least be noticeable by the user. The remaining space is then divided into three columns to represent information about each of the three metrics (in this case we are using CPU, Network and RAM usages, but these could be any three arbitrary metrics). An example is shown in Figure 4. When at this level of detail, the user is now able to select entire groups of nodes. Selecting a group has the same effect as individually selecting every node within the group. Since individual nodes are not visible, the selection is aggregated as well. Instead of a border, a semi-transparent blue rectangle is drawn on top of the Group Summary, that represents the percentage of nodes within the group that are currently selected (Figure 5). Figure 8: An example scenario on the metric-map. Large Change A problem in 60% of nodes in the system Minor Change A normal update with no problems 3.3.2 The Metric-Map The Metric-map was designed to give a high-level overview of the status of the fleet and, as a result of this focus, specific details are not directly accessible. The base of the Metric-map is a hexagon, with coloured circles representing information about the fleet using their position, area, opacity and borders (Figures 7, 8 and 9). When nodes are given to the Metric-map, their metrics are quantised (rounding down) to percentages in the set Q = {0%, 20%, 40%, 60%, 80%, 100%}. Then, nodes with the same values across all three metrics are put into Figure 9: Visualising differences in fleet status on the metric-map. 5 bins, where each bin can be represented by a tuple b = (m1 , m2 , m3 ) ∈ Q3 . Table 1: Overview of our test fleets. System For placement, the higher an mi is, the more it is “pulled” towards its corresponding corner, µ ~ i . The exact position p ∈ R2 of a bin b is calculated as a sum of each mi multiplied with its corresponding µ ~ i: p= 3 X clusters location EC2 Cluster 20 1 off-site TSL 36 1 on-site Everything 56 2 mixture mi µ ~i i=1 well as data from our live test fleet over a period of four months. Panopticon was tested on two separate clusters, as well as a single WAN cluster made by combining both the separate clusters (see Table 1). The TSL cluster was an active university computer laboratory being used by undergraduate students. Load was periodically generated on the EC2 cluster by using a custom parallel n-body simulation. This calculation does not result in a unique p for all b. For example (1%, 1%, 1%) and (40%, 40%, 40%) both map to the point (0, 0). Therefore the Metric-map also adjusts the alpha intensity based on the overall load for a given bin, with higher intensity corresponding to higher load. The alpha intensity αb of a given bin b is calculated as an average of each mi thresholded to a maximum of 1: ) ( 3 1X mi αb = min 1, 3 i=1 4.1 Monitoring and Collection In testing the monitoring system, we were faced with a practical problem: we had around 100 computers at our disposal, but wanted to see if our system could cope when 10 000 or more nodes were monitored. Therefore, we simulated fake zones which then aggregated their fake information to their parent aggregation nodes. Since the exact values of the information being monitored did not affect the behaviour of the collection system, the simulated results give a reliable indication of expected performance on real hardware. All reported results are based on average of 9 test-runs. This method was used because it is intuitive to have richer colours linked to more activity, and fainter colours linked to less activity. Bins are plotted as circles on the diagram. The radius of the circle is logarithmically proportional to the number of nodes represented by the bin. A logarithmic scale allows the Metric-map to scale well as the number of nodes increases. Bins are selectable; clicking on a bin will select all nodes within that bin and highlight it with a thick black border. In the case that there is a partial selection within a bin, the bin is only partially bordered. The exact formula for the degrees of a bin b to be bordered is: ( o n , if sb > 0 max nsbb × 360, 1 θb = 0 , otherwise 4.2 Storage and Retrieval Over the four month test period the Storage and Retrieval system collected around one million rows of data (i.e. node-measurements) at a resolution of 5 minutes per node. Raw storage component rates were determined by simulating a 10 000 node fake zone on a single root node and polling it continuously. Since the actual values of the data do not affect the database performance (within certain limits), pseudo-random data was used. The performance of the data retrieval subsystem was evaluated for a fleet of a million nodes. Similarly, psuedorandom data was used, as the only effect the data has on performance is its delta-compressibility. where sb is the number of selected nodes in b, and nb is the total number of nodes in b. In the case where nsbb is very small, we make sure that a selection is visible by making θb at least 1. 3.4 nodes System Design Methodology An iterative methodology was followed for the development of our system. Firstly, the simplest working solution was put together, using scripting tools where possible. Additional functionality was developed as, and when, required by other components in the system. Components were developed in different languages, as each component was developed using the most appropriate tools for its task. The Monitoring system was written in D, a low level systems programming language. The Database interface was implemented in Python, a scripting language. C++ and the Qt toolkit were used for the Visualisation front-end, using Qwt for historical graphing. OpenGL (via the Qt Graphics View Framework) was used for the main visualisations. Development and testing was done in Linux. 4.3 Visualisation The Visualisation component was evaluated by two expert users: a departmental systems administrator from our university and a developer from Amazon EC2 Web Services. While using our prototype system, these expert users were asked questions regarding the efficacy of the visualisations, with their responses being recorded. More detailed usability testing was not applicable, as Panopticon was developed to be a proof-of-concept, not a production system. The expert users were shown the system in two different configurations: with live reported data to demonstrate the real-time aspects and with simulated data to demonstrate scalability. For the live data, 120 nodes were visualised due to intentional duplication in the aggregation component’s configuration. Of these, around 55 nodes were active and reporting. Since there were at most only a few hundred computers at our disposal, a simulated monitoring system was required in order to test the Visualisation component with large fleets (we simulated 15 000 nodes). 4. EVALUATION Since access to upwards of 500 nodes was infeasible, simulated data was used at various stages to represent a very large fleet. Each component was subjected to an individual evaluation and tested with simulated data as 6 350 Bandwidth Memmory CPU Usage 30 Table 2: Results for scalability evaluation. CPU usage was negligible and excluded. 300 25 Node Count 250 20 200 15 150 10 100 5 50 0 0 20 000 40 000 60 000 80 000 Memory (MiB) Bandwidth (Mbps) / CPU Usage (%) 35 0 100 000 Number of Nodes Figure 10: The resources required of Panopticon’s top aggregation node — they are well within the limits of current hardware. Bandwidth Memory (Mbps) (MiB) 100 0.03 2.44 500 0.13 4.72 1 000 0.25 8.72 5 000 1.22 16.76 10 000 2.59 31.76 25 000 6.20 74.76 50 000 12.28 126.76 75 000 18.32 188.76 100 000 21.41 310.76 Table 3: Time taken for visualisation queries on 1 million nodes, averaged over 100 tests. Unlike the other components which could be tested with random data, the Visualisation component requires realistic and sensible data for any meaningful evaluation to take place. However, implementing a back-end that can (i) realistically simulate a real-time monitoring system, (ii) supply realistic and consistent historic data, and (iii) have metrics that realistically relate to each other, is an extremely complex task requiring prolonged access to many more computers than we had access to. As a compromise, we generate pseudo-realistic monitoring information, with relative ease, by assigning simulated nodes to a set of characteristic usage classes (e.g. web servers, simulation clusters, general-use computer laboratories). The nodes are collected into groups according to usage classes, as could be expected in an orderly data centre. Each class was assigned common failure modes through consultation with expert users. Query Wall Time Full State 31.9 ± 0.3 s Update 19.9 ± 0.2 s Rewind 13.2 ± 0.1 s Historic Data < 0.002 s we found it important to ensure that database queries avoid full table scans and only return relevant information wherever possible. The monitoring system communicated with the storage via a text-based JSON protocol. JSON was chosen for simplicity and readability, but parsing the information sent from the monitoring system became a significant bottleneck. Using Python’s simplejson parser, our system was able to poll 10 000 nodes in 95 ± 5 s, leading to a limit of 25 000 nodes when monitoring at a 5-minute resolution. The performance of the data retrieval subsystem was evaluated for a virtual fleet of a million nodes (see Table 3). It takes 32 s for a Full State query to complete, which is acceptable since it is a once off start-up cost. Update times of 20 s are also suitable, given the 5-minute monitoring resolution and the fact that this is a background operation that the user is not aware of. Historic Queries are instantaneous and therefore well within acceptable limits. Only the rewind time of 13 s is unacceptable, as near real-time response times are required for replay functionality to be immediately useful. Despite these limitations, our system showed that an ordinary RDBMS could be used to store and retrieve information from a monitoring system with the order of 100 000 nodes in real-time. We suggest that in future work, a distributed storage system be investigated to achieve further scalability. While a centralised RDBMS is suitable in terms of storage requirements for a million-node fleet, other approaches would have to be followed to improve retrieval performance. We suggest either a decentralised, non-ACID-compliant storage system or a high degree of partitioning across multiple servers. In the Visualisation component evaluations, overall, the expert users reported our visualisations to be useful when 5. RESULTS AND CONCLUSIONS In our four month real world monitoring exercise, our monitoring agents proved themselves to be robust and have a low system overhead. CPU and memory usage on monitored nodes was found to be negligible, even when polling once every 15 seconds. Aside from some minor MySQL dead-lock issues, which occurred even at the lowest transactional isolation level, everything ran well and was stable. During the last two months, the Visualisation component regularly connected to the retrieval system and was able to view the state of our test fleet. In evaluations with simulated data, the performance requirements of aggregated nodes was shown to be well within the limits of consumer hardware; Table 2 and Figure 10 show our system was easily capable of aggregating information from 100 000 nodes to a single aggregation node on a LAN. Further protocol improvements, such as incorporating data compression, would be required for the system to be applicable to WANs or larger fleets. The Storage component produced approximately 100 MiB of data and indices in the four month test, yielding an average of 30 KiB per node per day. Therefore, 1 TiB of storage would be able to hold a 90-day monitoring history for just under 400 000 computers — a trivial expense when considering the cost of maintaining a fleet of that size. When dealing with these quantities of information 7 Table 4: Reported feedback from our two expert users on the Visualisation component, obtained during an interview while they experimented with the system. Node-map Positive Comments Suggested Improvements Intuitive to use Recent History view Layout is simple to understand Effective group summary Rewind ability is valuable Metric-map Historic Information Surprisingly intuitive to use Metric-map should be the emphasised feature in the GUI Provides a useful overview Update highlighting Replay Tool very useful More complex graphing tool Recent History as part of main visualisation General Visualisations very effective for 100s of nodes More than 3-level hierarchies for visualisation Effective up to 20 000 nodes Update highlights should also appear on the Metric-map Selection propagation between visualisations is valuable 6. REFERENCES monitoring up to 20 000 computers. Although initially unfamiliar, the Metric-map was said to be “surprisingly intuitive” to use. One user stressed the importance of providing context to a current visualisation, through access to historic information. While the Replay Tool provided basic functionality in this regard, it was suggested the visualisations should incorporate a view of recent history. It was also noted that the expert users did not make use of the Node-detail level of detail. Table 4 lists all significant expert user feedback. The expert users reported that the Visualisation component provided an effective visualisation of the overall fleet status. The Metric-map was seen to be better at representing overall information, with the Node-map being better suited to individual information. However, the interaction between the two visualisations, through selection propagation, was seen as the most important aspect of both. Including easily accessible recent history provides a valuable perspective on current information. Users proposed that this would be convenient for identifying possible issues with the current state of a fleet. Making this recenthistory visible alongside the real-time visualisation is a possible extension to enhance the Visualisation component, but would also require improvements to the dataretrieval process. While our system focussed on providing an alternative to graphing as a means of viewing monitoring information, graphs were still seen as necessary for viewing specific, long-term, historic information. Features such as logical grouping, aggregation, multiple levels of detail and selection propagation between independent views enhanced our system’s scalability. Information filtering techniques would be useful to further increase the scalability of the Visualisation component. As a proof-of-concept system, Panopticon was successful. All components were shown to be able to scale effectively into the tens of thousands of monitored nodes, mainly limited by the Visualisation component and information retrieval speeds. [1] E. Anderson and D. Patterson. Extensible, scalable monitoring for clusters of computers. In LISA ’97: Proceedings of the 11th USENIX conference on System administration, pages 9–16, 1997. [2] D. Asimov. The grand tour: a tool for viewing multidimensional data. SIAM Journal of Scientific and Statistical Computing, 6(1):128–143, January 1985. [3] J. M. Brandt, B. Debusschere, A. C. Gentile, J. R. Mayo, P. P. Pébay, D. Thompson, and M. H. Wong. OVIS-2: A robust distributed architecture for scalable RAS. In Proceedings of 22nd IEEE International Parallel and Distributed Processing Symposium, pages 1–8, April 2008. [4] J. M. Brandt, A. C. Gentile, D. J. Hale, and P. Pébay, P. Ovis: A tool for intelligent, real-time monitoring of computational clusters. In Proceedings of 20th IEEE International Parallel and Distributed Processing Symposium, April 2006. [5] D. Crockford. RFC 4627: The application/json Media Type for JavaScript Object Notation (JSON). IETF The Internet Society, July 2006. [6] R. T. Fielding. Architectural Styles and the Design of Network-Based Software Architectures. PhD thesis, University of California, Irvine, 2000. [7] J. Goldstein and S. F. Roth. Using aggregation and dynamic queries for exploring large data sets. In B. Adelson, S. Dumais, and J. Olson, editors, Proceedings of the SIGCHI Conference on Human Factors in Computing Systems: Celebrating interdependence, pages 23–29. ACM, April 1994. [8] A. Inselberg and B. Dimsdale. Parallel coordinates: A tool for visualizing multidimensional geometry. In A. Kaufman, editor, Proceedings of the 1st Conference on Visualization ’90, pages 361–378. IEEE Computer Society Press, October 1990. [9] D. A. Keim. Designing pixel-oriented visualization 8 [10] [11] [12] [13] [14] [15] [16] [17] [18] techniques: Theory and applications. IEEE Transactions on Visualization and Computer Graphics, 6(1):59–78, January 2000. D. A. Keim. Information visualization and visual data mining. IEEE Transactions on Visualization and Computer Graphics, 8(1):1–8, January 2002. M. Kreuseler, N. Lopez, and H. Schumann. A scalable framework for information visualization. In Proceedings of IEEE Symposium on Information Vizualization 2000, page 27. IEEE Computer Society, 2000. M. L. Massie, B. N. Chun, and D. E. Culler. The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing, 30(7):817–840, 2004. T. Oetiker. MRTG - the multi router traffic grapher. In Proceedings of the 12th USENIX Conference on System Administration, pages 141–148, December 1998. F. Sacerdoti, M. Katz, M. Massie, and D. Culler. Wide area cluster monitoring with ganglia. In Proceedings of the IEEE International Conference on Cluster Computing, pages 289–298. IEEE Press, December 2003. J. Schneidewind, M. Sips, and D. A. Kiem. An automated approach for the optimization of pixel-based visualizations. Information Visualization, 6(1):75–88, March 2007. M. Sips, J. Schneidewind, D. A. Keim, and H. Schumann. Scalable pixel-based visual interfaces: Challenges and solutions. In Proceedings of Tenth International Conference on Information Visualization. IEEE Press, July 2006. C. Stolte, D. Tang, and P. Hanrahan. Polaris: A system for query, analysis, and visualization of multidimensional relational databases. IEEE Transactions on Visualization and Computer Graphics, 8(1), January 2002. R. Van Renesse, K. P. Birman, and W. Vogels. Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining. ACM Transactions on Computer Systems, 21(2):164–206, May 2003. 9