With the growth of social media, embedded sensors, and “smart” devices, those responsible for man... more With the growth of social media, embedded sensors, and “smart” devices, those responsible for managing resources during emergencies, such as weather-related disasters, are tran- sitioning from an era of data scarcity to data deluge. During a crisis situation, emergency managers must aggregate various data to assess the situation on the ground, evaluate response plans, give advice to state and local agencies, and inform the public. We make the case that social graph analysis and natural language modeling in real time are paramount to distilling useful intelligence from the large volumes of data available to crisis response personnel. Using ground truth information from social media data surrounding the 2012 Hurricane Sandy in New York City, we test and evaluate our real-time analytics platform to identify immediate and critical information that increases situational awareness during disastrous events.
2014 KDD Workshop on Learning about Emergencies from Social Information
During real-world crises individuals utilize their online social networks in several ways: to mak... more During real-world crises individuals utilize their online social networks in several ways: to make observations about ongoing events, to request information from their networks and local communities, and to gain situational intelligence about unfolding events. In doing so, users often express highly useful emergency-relevant information, whether in- tentionally or not. Unfortunately, the massive amount of data generated make filtering and processing useful social media artifacts nearly impossible with naive methods, such as simple keyword searches. To address this problem, we offer two contributions: we demonstrate the efficacy of applying a generalizable and meaningful emergency-related on- tology for large-scale reasoning over social media text. Next, we show how topic models trained on a rich data set (here, Tweets collected during Hurricane Sandy) can immediately provide insightful knowledge for disparate disasters.
We present a new lock-free parallel algorithm for computing betweenness centrality of massive sma... more We present a new lock-free parallel algorithm for computing betweenness centrality of massive small-world networks. With minor changes to the data structures, our algorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in HPCS SSCA#2, a benchmark extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the Threadstorm processor, and a single-socket Sun multicore server with the UltraSPARC T2 processor. For a small-world network of 134 million vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.
Social networks produce an enormous quantity of data. Facebook consists of over 400 million activ... more Social networks produce an enormous quantity of data. Facebook consists of over 400 million active users sharing over 5 billion pieces of information each month. Analyzing this vast quantity of unstructured data presents challenges for software and hardware. We present GraphCT, a Graph Characterization Toolkit for massive graphs representing social network data. On a 128processor Cray XMT, GraphCT estimates the betweenness centrality of an artificially generated (R-MAT) 537 million vertex, 8.6 billion edge graph in 55 minutes and a realworld graph (Kwak, et al.) with 61.6 million vertices and 1.47 billion edges in 105 minutes. We use GraphCT to analyze public data from Twitter, a microblogging network. Twitter's message connections appear primarily tree-structured as a news dissemination system. Within the public data, however, are clusters of conversations. Using GraphCT, we can rank actors within these conversations and help analysts focus attention on a much smaller data subset.
We present a new approach for parallel massive graph analysis of streaming, temporal data with a ... more We present a new approach for parallel massive graph analysis of streaming, temporal data with a dynamic and extensible representation. Handling the constant stream of new data from health care, security, business, and social network applications requires new algorithms and data structures. We examine data structure and algorithm trade-os that extract the parallelism and high performance necessary for rapidly updating analysis of massive graphs. Static implementations of analysis kernels often rely on specic structure on the input data, maintaining the specic structures for each possible kernel with high data rates imposes too great a performance price. A case study with clustering coecients demonstrates incremental updates can be more ecient than global recomputation. Within this kernel, we compare three methods for dynamically updating local clustering coecients: a brute-force local recalculation, a sorting algorithm, and our new approximation method using a Bloom lter. On 32 processors of a Cray XMT with a synthetic scale-free graph of 2 24 ≈ 16 million vertices and 2 29 ≈ 537 million edges, the brute-force method processes a mean of over 50 000 updates per second and our Bloom lter approaches 200 000 updates per second.
Current online social networks are massive and still growing. For example, Face book has over 500... more Current online social networks are massive and still growing. For example, Face book has over 500 million active users sharing over 30 billion items per month. The scale within these data streams has outstripped traditional graph analysis methods. Real-time monitoring for anomalies may require dynamic analysis rather than repeated static analysis. The massive state behind multiple persistent queries requires shared data structures and flexible representations. We present a framework based on the STINGER data structure that can monitor a global property, connected components, on a graph of 16 million vertices at rates of up to 240,000 updates per second on 32 processors of a Cray XMT. For very large scale-free graphs, our implementation uses novel batching techniques that exploit the scale-free nature of the data and run over three times faster than prior methods. Our framework handles, for the first time, real-world data rates, opening the door to higher-level analytics such as community and anomaly detection.
We present a new parallel algorithm that extends and generalizes the traditional graph analysis m... more We present a new parallel algorithm that extends and generalizes the traditional graph analysis metric of betweenness centrality to include additional non-shortest paths according to an input parameter k. Betweenness centrality is a useful kernel for analyzing the importance of vertices or edges in a graph and has found uses in social networks, biological networks, and power grids, among others. k-betweenness centrality captures the additional information provided by paths whose length is within k units of the shortest path length. These additional paths provide robustness that is not captured in traditional betweenness centrality computations, and they may become important shortest paths if key edges are missing in the data. We implement our parallel algorithm using lock-free methods on a massively multithreaded Cray XMT. We apply this implementation to a real-world data set of pages on the World Wide Web and show the importance of the additional data incorporated by our algorithm.
We present a new lock-free parallel algorithm for computing betweenness centrality of massive com... more We present a new lock-free parallel algorithm for computing betweenness centrality of massive complex networks that achieves better spatial locality compared with previous approaches. Betweenness centrality is a key kernel in analyzing the importance of vertices (or edges) in applications ranging from social networks, to power grids, to the influence of jazz musicians, and is also incorporated into the DARPA HPCS SSCA#2, a benchmark extensively used to evaluate the performance of emerging high-performance computing architectures for graph analytics. We design an optimized implementation of betweenness centrality for the massively multithreaded Cray XMT system with the Threadstorm processor. For a small-world network of 268 million vertices and 2.147 billion edges, the 16-processor XMT system achieves a TEPS rate (an algorithmic performance count for the number of edges traversed per second) of 160 million per second, which corresponds to more than a 2× performance improvement over the previous parallel implementation. We demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for the large IMDb movie-actor network.
To design software for the analysis of massive-scale spatio-temporal interaction networks using m... more To design software for the analysis of massive-scale spatio-temporal interaction networks using multithreaded architectures such as the Cray XMT. The Center launched in July 2008 and is led by Pacific-Northwest National Laboratory.
With the growth of social media, embedded sensors, and “smart” devices, those responsible for man... more With the growth of social media, embedded sensors, and “smart” devices, those responsible for managing resources during emergencies, such as weather-related disasters, are tran- sitioning from an era of data scarcity to data deluge. During a crisis situation, emergency managers must aggregate various data to assess the situation on the ground, evaluate response plans, give advice to state and local agencies, and inform the public. We make the case that social graph analysis and natural language modeling in real time are paramount to distilling useful intelligence from the large volumes of data available to crisis response personnel. Using ground truth information from social media data surrounding the 2012 Hurricane Sandy in New York City, we test and evaluate our real-time analytics platform to identify immediate and critical information that increases situational awareness during disastrous events.
2014 KDD Workshop on Learning about Emergencies from Social Information
During real-world crises individuals utilize their online social networks in several ways: to mak... more During real-world crises individuals utilize their online social networks in several ways: to make observations about ongoing events, to request information from their networks and local communities, and to gain situational intelligence about unfolding events. In doing so, users often express highly useful emergency-relevant information, whether in- tentionally or not. Unfortunately, the massive amount of data generated make filtering and processing useful social media artifacts nearly impossible with naive methods, such as simple keyword searches. To address this problem, we offer two contributions: we demonstrate the efficacy of applying a generalizable and meaningful emergency-related on- tology for large-scale reasoning over social media text. Next, we show how topic models trained on a rich data set (here, Tweets collected during Hurricane Sandy) can immediately provide insightful knowledge for disparate disasters.
We present a new lock-free parallel algorithm for computing betweenness centrality of massive sma... more We present a new lock-free parallel algorithm for computing betweenness centrality of massive small-world networks. With minor changes to the data structures, our algorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in HPCS SSCA#2, a benchmark extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the Threadstorm processor, and a single-socket Sun multicore server with the UltraSPARC T2 processor. For a small-world network of 134 million vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.
Social networks produce an enormous quantity of data. Facebook consists of over 400 million activ... more Social networks produce an enormous quantity of data. Facebook consists of over 400 million active users sharing over 5 billion pieces of information each month. Analyzing this vast quantity of unstructured data presents challenges for software and hardware. We present GraphCT, a Graph Characterization Toolkit for massive graphs representing social network data. On a 128processor Cray XMT, GraphCT estimates the betweenness centrality of an artificially generated (R-MAT) 537 million vertex, 8.6 billion edge graph in 55 minutes and a realworld graph (Kwak, et al.) with 61.6 million vertices and 1.47 billion edges in 105 minutes. We use GraphCT to analyze public data from Twitter, a microblogging network. Twitter's message connections appear primarily tree-structured as a news dissemination system. Within the public data, however, are clusters of conversations. Using GraphCT, we can rank actors within these conversations and help analysts focus attention on a much smaller data subset.
We present a new approach for parallel massive graph analysis of streaming, temporal data with a ... more We present a new approach for parallel massive graph analysis of streaming, temporal data with a dynamic and extensible representation. Handling the constant stream of new data from health care, security, business, and social network applications requires new algorithms and data structures. We examine data structure and algorithm trade-os that extract the parallelism and high performance necessary for rapidly updating analysis of massive graphs. Static implementations of analysis kernels often rely on specic structure on the input data, maintaining the specic structures for each possible kernel with high data rates imposes too great a performance price. A case study with clustering coecients demonstrates incremental updates can be more ecient than global recomputation. Within this kernel, we compare three methods for dynamically updating local clustering coecients: a brute-force local recalculation, a sorting algorithm, and our new approximation method using a Bloom lter. On 32 processors of a Cray XMT with a synthetic scale-free graph of 2 24 ≈ 16 million vertices and 2 29 ≈ 537 million edges, the brute-force method processes a mean of over 50 000 updates per second and our Bloom lter approaches 200 000 updates per second.
Current online social networks are massive and still growing. For example, Face book has over 500... more Current online social networks are massive and still growing. For example, Face book has over 500 million active users sharing over 30 billion items per month. The scale within these data streams has outstripped traditional graph analysis methods. Real-time monitoring for anomalies may require dynamic analysis rather than repeated static analysis. The massive state behind multiple persistent queries requires shared data structures and flexible representations. We present a framework based on the STINGER data structure that can monitor a global property, connected components, on a graph of 16 million vertices at rates of up to 240,000 updates per second on 32 processors of a Cray XMT. For very large scale-free graphs, our implementation uses novel batching techniques that exploit the scale-free nature of the data and run over three times faster than prior methods. Our framework handles, for the first time, real-world data rates, opening the door to higher-level analytics such as community and anomaly detection.
We present a new parallel algorithm that extends and generalizes the traditional graph analysis m... more We present a new parallel algorithm that extends and generalizes the traditional graph analysis metric of betweenness centrality to include additional non-shortest paths according to an input parameter k. Betweenness centrality is a useful kernel for analyzing the importance of vertices or edges in a graph and has found uses in social networks, biological networks, and power grids, among others. k-betweenness centrality captures the additional information provided by paths whose length is within k units of the shortest path length. These additional paths provide robustness that is not captured in traditional betweenness centrality computations, and they may become important shortest paths if key edges are missing in the data. We implement our parallel algorithm using lock-free methods on a massively multithreaded Cray XMT. We apply this implementation to a real-world data set of pages on the World Wide Web and show the importance of the additional data incorporated by our algorithm.
We present a new lock-free parallel algorithm for computing betweenness centrality of massive com... more We present a new lock-free parallel algorithm for computing betweenness centrality of massive complex networks that achieves better spatial locality compared with previous approaches. Betweenness centrality is a key kernel in analyzing the importance of vertices (or edges) in applications ranging from social networks, to power grids, to the influence of jazz musicians, and is also incorporated into the DARPA HPCS SSCA#2, a benchmark extensively used to evaluate the performance of emerging high-performance computing architectures for graph analytics. We design an optimized implementation of betweenness centrality for the massively multithreaded Cray XMT system with the Threadstorm processor. For a small-world network of 268 million vertices and 2.147 billion edges, the 16-processor XMT system achieves a TEPS rate (an algorithmic performance count for the number of edges traversed per second) of 160 million per second, which corresponds to more than a 2× performance improvement over the previous parallel implementation. We demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for the large IMDb movie-actor network.
To design software for the analysis of massive-scale spatio-temporal interaction networks using m... more To design software for the analysis of massive-scale spatio-temporal interaction networks using multithreaded architectures such as the Cray XMT. The Center launched in July 2008 and is led by Pacific-Northwest National Laboratory.
Uploads
Papers by David Ediger