There is a trend towards multicore or manycore processors in com- puter architecture design. In a... more There is a trend towards multicore or manycore processors in com- puter architecture design. In addition, several parallel program- ming models have been introduced. Some extract concurrent threads implicitly whenever possible, resulting in fine grained threads. Oth- ers construct threads by explicit user specifications in the program, resulting in coarse grained threads. How these two mechanisms im- pact performance remains an open question. Implicitly constructed fine grained threads exhibit more overhead due to additional thread scheduling, thread communication, and thread context switches. However, they also increase the flexibility in scheduling. There- fore, computation resources can be utilized further and workloads are more balanced among cores. Moreover, if scheduled properly, concurrent fine grained threads may exhibit more data affinity than coarse grained threads. In most parallel architectures, the last- level cache is typically shared among all the cores. Therefore, it...
Proceedings of the 11th ACM Conference on Computing Frontiers, 2014
Understanding workload behavior plays an important role in performance studies. The growing compl... more Understanding workload behavior plays an important role in performance studies. The growing complexity of applications and architectures has increased the gap among application developers, performance engineers, and hardware designers. To reduce this gap, we propose SKOPE, a SKeleton framework for Performance Exploration, that produces a descriptive model about the semantic behavior of a workload, which can infer potential transformations and help users understand how workloads may interact with and adapt to emerging hardware. SKOPE models can be shared, annotated, and studied by a community of performance engineers and system designers; they offer readability in the frontend and versatility in the backend. SKOPE can be used for performance analysis, tuning, and projection. We provide two example use cases. First, we project GPU performance from CPU code without GPU programming or accessing the hardware, and are able to automatically explore transformations and the projected bestachievable performance deviates from the measured by 18% on average. Second, we project the multi-node scaling trends of two scientific workloads, and are able to achieve a projection accuracy of 95%. this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
The history of parallel computing shows that good performance is heavily dependent on data locali... more The history of parallel computing shows that good performance is heavily dependent on data locality. Prior knowledge of data access patterns allows for optimizations that reduce data movement, achieving lower data access latencies. Compilers and runtime systems, however, have difficulties in speculating on locality issues among threads. Future multicore architectures are likely to present a hierarchical model of parallelism, with multiple threads on a core and multiple cores on a chip. With such a system, data affinity and localization becomes even more important to efficiently use per-core resources. We show how an application programming interface (API) with the right abstractions can conveniently indicate data locality and that a system can use this information to place threads in a way that minimizes cache miss rates and interconnect traffic. This information is often well understood and easily expressed by the programmer but is typically lost to the system, forcing runtime environments to rediscover it on the fly; a far more costly approach. Our system is particularly well suited for the trend in manycore architectures towards large numbers of simple cores connected by a decentralized interconnect fabric.
Page 1. Dynamic Warp Subdivision for Non-Speculative Runahead SIMT Gather Jiayuan Meng, Kevin Ska... more Page 1. Dynamic Warp Subdivision for Non-Speculative Runahead SIMT Gather Jiayuan Meng, Kevin Skadron Department of Computer Science, University of Virginia 1. Background: SIMT Architecture Scalar threads are grouped into warps that operate with a common instruction sequence in lockstep. (eg NVIDIA's Tesla architecture [1] ) 2.
Abstract This paper describes the performance analysis of the light field refocusing algorithm ru... more Abstract This paper describes the performance analysis of the light field refocusing algorithm running on different hardware specifications, including the Intel Pentium 4, SSE2 (Streaming SIMD Extensions), GPU, and also Cell Broadband Engine. The hardware chosen has unique features, making it interesting to compare their performance on such an application with each other, and how much advantage or disadvantage each one has over others.
People like magic. It is fun to imagine a cloud is shaped like a Mickey mouse, or the pattern of ... more People like magic. It is fun to imagine a cloud is shaped like a Mickey mouse, or the pattern of leaves and flowers are associated with human face. Sometimes in fiction movie or cartoon production, we like to see similar visual effects. In this paper, we define the problem as following: Given two images, one for pattern and one for shape. We output another image, which draws the shape provided in the shape image, but using the patterns defined in the pattern image. This is a typical texture synthesize problem with user interaction.
Abstract Computer games have become a driving application in the personal computer industry. For ... more Abstract Computer games have become a driving application in the personal computer industry. For computer architects designing general purpose microprocessors, understanding the characteristics of this application domain is important to meet the needs of this growing market demographic. In addition, games and 3D-graphics applications are some of the most demanding personal computer programs.
Diminishing returns in single thread performance have forced a reevaluation of priorities in micr... more Diminishing returns in single thread performance have forced a reevaluation of priorities in microprocessor design. Recent architectures have foregone deeper pipelining in favor of multiple cores per chip and multiple threads per core. The day approaches when processors with hundreds or thousands of cores are commonplace, but programming models for these manycore architectures lag far behind the architectures themselves. We are developing Fractal, a manycore architecture and associated programming model we call relaxed streaming. Relaxed streaming allows flexible and convenient stream access, implicit memory management and dependency enforcement, and the decoupling of sequential and parallel phases of execution. This paper presents relaxed streaming in the context of our Fractal API, discussing the benefits of a relaxed streaming model over more traditional streaming models, especially in terms of convenience and ease of use.
Abstract This paper describes the performance analysis of the light field refocusing algorithm ru... more Abstract This paper describes the performance analysis of the light field refocusing algorithm running on different hardware specifications, including the Intel Pentium 4, SSE2 (Streaming SIMD Extensions), GPU, and also Cell Broadband Engine. The hardware chosen has unique features, making it interesting to compare their performance on such an application with each other, and how much advantage or disadvantage each one has over others.
ABSTRACT In this paper we illustrate an application that displays high-resolution images on Multi... more ABSTRACT In this paper we illustrate an application that displays high-resolution images on Multi-Projector Tiled High Resolution Display Wall. Our goal is to enable the users to view high-resolution image data, such as photos from satellites and microscopes. Panoramas can also be played circularly. Users can pan across the image, scale the image, and play panoramas interactively.
Abstract Emerging applications such as scientific computation, media processing, machine learning... more Abstract Emerging applications such as scientific computation, media processing, machine learning and data mining are commonly computation-and data-intensive [1], and they usually exhibit abundant parallelism. These applications motivate the design of throughputoriented many-and multi-core architectures that employ many small and simple cores and scale up to large thread counts. The cores themselves are also typically multi-threaded.
Abstract Architectures that aggressively exploit SIMD often have many data paths execute in locks... more Abstract Architectures that aggressively exploit SIMD often have many data paths execute in lockstep and use multi-threading to hide latency. They can yield high through-put in terms of area-and energy-efficiency for many data-parallel applications. To balance productivity and performance, many recent SIMD organizations incorporate implicit cache hierarchies. Examples of such architectures include Intel's MIC, AMD's Fusion, and NVIDIA's Fermi.
Abstract SIMD organizations have shown to allow high throughput for data-parallel applications. T... more Abstract SIMD organizations have shown to allow high throughput for data-parallel applications. They can operate on multiple datapaths under the same instruction sequencer, with its set of operations happening in lockstep sometimes referred to as warps and a single lane referred to as a thread. However, ability of SIMD to gather from disparate addresses instead of aligned vectors means that a single long latency memory access will suspend the entire warp until it completes.
Abstract As systems grow larger and computation is further spread across nodes, efficient data co... more Abstract As systems grow larger and computation is further spread across nodes, efficient data communication is becoming increasingly important to achieve high throughput and low power consumption for high performance computing systems. However, communication efficacy not only depends on application-specific communication patterns, but also on machine-specific communication subsystems, node architectures, and even the runtime communication libraries.
Abstract Applications often have a sequence of parallel operations to be offloaded to graphics pr... more Abstract Applications often have a sequence of parallel operations to be offloaded to graphics processors; each operation can become an individual GPU kernel. Developers typically explore a variety of transformations for each kernel. Furthermore, it is well known that efficient data management is critical in achieving high GPU performance and that" fusing" multiple kernels into one may greatly improve data locality.
Abstract There is a trend towards multicore or manycore processors in computer architecture desig... more Abstract There is a trend towards multicore or manycore processors in computer architecture design. In addition, several parallel programming models have been introduced. Some extract concurrent threads implicitly whenever possible, resulting in fine grained threads. Others construct threads by explicit user specifications in the program, resulting in coarse grained threads. How these two mechanisms impact performance remains an open question.
Future general purpose architectures will scale to hundreds of cores. In order to accommodate bot... more Future general purpose architectures will scale to hundreds of cores. In order to accommodate both latencyoriented and throughput-oriented workloads, the system is likely to present a heterogenous mix of cores. In particular, sequential code can achieve peak performance with an out-of-order core while parallel code achieves peak throughput over a set of simple, in-order (IO) or singleinstruction, multiple-data (SIMD) cores. These large-scale, heterogeneous architectures form a prohibitively large design space, including not just the mix of cores, but also the memory hierarchy, coherence protocol, and on-chip network (OCN).
We propose GROPHECY, a GPU performance projection framework that can estimate the performance ben... more We propose GROPHECY, a GPU performance projection framework that can estimate the performance benefit of GPU acceleration without actual GPU programming or hardware. Users need only to skeletonize pieces of CPU code that are targets for GPU acceleration. Code skeletons are automatically transformed in various ways to mimic tuned GPU codes with characteristics resembling real implementations. The synthesized characteristics are used by an existing analytical model to project GPU performance. The cost and benefit of GPU development can then be estimated according to the transformed code skeleton that yields the best projected performance. With GROPHECY, users can leap toward GPU acceleration only when the cost-benefit makes sense. The framework is validated using kernel benchmarks and data-parallel codes in legacy scientific applications. The measured performance of manually tuned codes deviates from the projected performance by 17% in geometric mean.
International Journal of Parallel Programming, 2011
Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique... more Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different processing elements (PEs). In addition, synchronization is often used to signal the completion of halo exchanges. Both communication and synchronization may incur significant overhead on parallel architectures with shared memory. This is especially true in the case of graphics processors (GPUs), which do not preserve the state of the per-core L1 storage across global synchronizations. To reduce these overheads, ghost zones can be created to replicate stencil operations, reducing communication and synchronization costs at the expense of redundantly computing some values on multiple PEs. However, the selection of the optimal ghost zone size depends on the characteristics of both the architecture and the application, and it has only been studied for message-passing systems in distributed environments. To automate this process on shared memory systems, we establish a performance model using NVIDIA's Tesla architecture as a case study and propose a framework that uses the performance model to automatically select the ghost zone size that performs best and generate appropriate code. The modeling is validated by four diverse ISL applications, for which the predicted ghost zone configurations are able to achieve a speedup no less than 95% of the optimal speedup.
There is a trend towards multicore or manycore processors in com- puter architecture design. In a... more There is a trend towards multicore or manycore processors in com- puter architecture design. In addition, several parallel program- ming models have been introduced. Some extract concurrent threads implicitly whenever possible, resulting in fine grained threads. Oth- ers construct threads by explicit user specifications in the program, resulting in coarse grained threads. How these two mechanisms im- pact performance remains an open question. Implicitly constructed fine grained threads exhibit more overhead due to additional thread scheduling, thread communication, and thread context switches. However, they also increase the flexibility in scheduling. There- fore, computation resources can be utilized further and workloads are more balanced among cores. Moreover, if scheduled properly, concurrent fine grained threads may exhibit more data affinity than coarse grained threads. In most parallel architectures, the last- level cache is typically shared among all the cores. Therefore, it...
Proceedings of the 11th ACM Conference on Computing Frontiers, 2014
Understanding workload behavior plays an important role in performance studies. The growing compl... more Understanding workload behavior plays an important role in performance studies. The growing complexity of applications and architectures has increased the gap among application developers, performance engineers, and hardware designers. To reduce this gap, we propose SKOPE, a SKeleton framework for Performance Exploration, that produces a descriptive model about the semantic behavior of a workload, which can infer potential transformations and help users understand how workloads may interact with and adapt to emerging hardware. SKOPE models can be shared, annotated, and studied by a community of performance engineers and system designers; they offer readability in the frontend and versatility in the backend. SKOPE can be used for performance analysis, tuning, and projection. We provide two example use cases. First, we project GPU performance from CPU code without GPU programming or accessing the hardware, and are able to automatically explore transformations and the projected bestachievable performance deviates from the measured by 18% on average. Second, we project the multi-node scaling trends of two scientific workloads, and are able to achieve a projection accuracy of 95%. this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
The history of parallel computing shows that good performance is heavily dependent on data locali... more The history of parallel computing shows that good performance is heavily dependent on data locality. Prior knowledge of data access patterns allows for optimizations that reduce data movement, achieving lower data access latencies. Compilers and runtime systems, however, have difficulties in speculating on locality issues among threads. Future multicore architectures are likely to present a hierarchical model of parallelism, with multiple threads on a core and multiple cores on a chip. With such a system, data affinity and localization becomes even more important to efficiently use per-core resources. We show how an application programming interface (API) with the right abstractions can conveniently indicate data locality and that a system can use this information to place threads in a way that minimizes cache miss rates and interconnect traffic. This information is often well understood and easily expressed by the programmer but is typically lost to the system, forcing runtime environments to rediscover it on the fly; a far more costly approach. Our system is particularly well suited for the trend in manycore architectures towards large numbers of simple cores connected by a decentralized interconnect fabric.
Page 1. Dynamic Warp Subdivision for Non-Speculative Runahead SIMT Gather Jiayuan Meng, Kevin Ska... more Page 1. Dynamic Warp Subdivision for Non-Speculative Runahead SIMT Gather Jiayuan Meng, Kevin Skadron Department of Computer Science, University of Virginia 1. Background: SIMT Architecture Scalar threads are grouped into warps that operate with a common instruction sequence in lockstep. (eg NVIDIA's Tesla architecture [1] ) 2.
Abstract This paper describes the performance analysis of the light field refocusing algorithm ru... more Abstract This paper describes the performance analysis of the light field refocusing algorithm running on different hardware specifications, including the Intel Pentium 4, SSE2 (Streaming SIMD Extensions), GPU, and also Cell Broadband Engine. The hardware chosen has unique features, making it interesting to compare their performance on such an application with each other, and how much advantage or disadvantage each one has over others.
People like magic. It is fun to imagine a cloud is shaped like a Mickey mouse, or the pattern of ... more People like magic. It is fun to imagine a cloud is shaped like a Mickey mouse, or the pattern of leaves and flowers are associated with human face. Sometimes in fiction movie or cartoon production, we like to see similar visual effects. In this paper, we define the problem as following: Given two images, one for pattern and one for shape. We output another image, which draws the shape provided in the shape image, but using the patterns defined in the pattern image. This is a typical texture synthesize problem with user interaction.
Abstract Computer games have become a driving application in the personal computer industry. For ... more Abstract Computer games have become a driving application in the personal computer industry. For computer architects designing general purpose microprocessors, understanding the characteristics of this application domain is important to meet the needs of this growing market demographic. In addition, games and 3D-graphics applications are some of the most demanding personal computer programs.
Diminishing returns in single thread performance have forced a reevaluation of priorities in micr... more Diminishing returns in single thread performance have forced a reevaluation of priorities in microprocessor design. Recent architectures have foregone deeper pipelining in favor of multiple cores per chip and multiple threads per core. The day approaches when processors with hundreds or thousands of cores are commonplace, but programming models for these manycore architectures lag far behind the architectures themselves. We are developing Fractal, a manycore architecture and associated programming model we call relaxed streaming. Relaxed streaming allows flexible and convenient stream access, implicit memory management and dependency enforcement, and the decoupling of sequential and parallel phases of execution. This paper presents relaxed streaming in the context of our Fractal API, discussing the benefits of a relaxed streaming model over more traditional streaming models, especially in terms of convenience and ease of use.
Abstract This paper describes the performance analysis of the light field refocusing algorithm ru... more Abstract This paper describes the performance analysis of the light field refocusing algorithm running on different hardware specifications, including the Intel Pentium 4, SSE2 (Streaming SIMD Extensions), GPU, and also Cell Broadband Engine. The hardware chosen has unique features, making it interesting to compare their performance on such an application with each other, and how much advantage or disadvantage each one has over others.
ABSTRACT In this paper we illustrate an application that displays high-resolution images on Multi... more ABSTRACT In this paper we illustrate an application that displays high-resolution images on Multi-Projector Tiled High Resolution Display Wall. Our goal is to enable the users to view high-resolution image data, such as photos from satellites and microscopes. Panoramas can also be played circularly. Users can pan across the image, scale the image, and play panoramas interactively.
Abstract Emerging applications such as scientific computation, media processing, machine learning... more Abstract Emerging applications such as scientific computation, media processing, machine learning and data mining are commonly computation-and data-intensive [1], and they usually exhibit abundant parallelism. These applications motivate the design of throughputoriented many-and multi-core architectures that employ many small and simple cores and scale up to large thread counts. The cores themselves are also typically multi-threaded.
Abstract Architectures that aggressively exploit SIMD often have many data paths execute in locks... more Abstract Architectures that aggressively exploit SIMD often have many data paths execute in lockstep and use multi-threading to hide latency. They can yield high through-put in terms of area-and energy-efficiency for many data-parallel applications. To balance productivity and performance, many recent SIMD organizations incorporate implicit cache hierarchies. Examples of such architectures include Intel's MIC, AMD's Fusion, and NVIDIA's Fermi.
Abstract SIMD organizations have shown to allow high throughput for data-parallel applications. T... more Abstract SIMD organizations have shown to allow high throughput for data-parallel applications. They can operate on multiple datapaths under the same instruction sequencer, with its set of operations happening in lockstep sometimes referred to as warps and a single lane referred to as a thread. However, ability of SIMD to gather from disparate addresses instead of aligned vectors means that a single long latency memory access will suspend the entire warp until it completes.
Abstract As systems grow larger and computation is further spread across nodes, efficient data co... more Abstract As systems grow larger and computation is further spread across nodes, efficient data communication is becoming increasingly important to achieve high throughput and low power consumption for high performance computing systems. However, communication efficacy not only depends on application-specific communication patterns, but also on machine-specific communication subsystems, node architectures, and even the runtime communication libraries.
Abstract Applications often have a sequence of parallel operations to be offloaded to graphics pr... more Abstract Applications often have a sequence of parallel operations to be offloaded to graphics processors; each operation can become an individual GPU kernel. Developers typically explore a variety of transformations for each kernel. Furthermore, it is well known that efficient data management is critical in achieving high GPU performance and that" fusing" multiple kernels into one may greatly improve data locality.
Abstract There is a trend towards multicore or manycore processors in computer architecture desig... more Abstract There is a trend towards multicore or manycore processors in computer architecture design. In addition, several parallel programming models have been introduced. Some extract concurrent threads implicitly whenever possible, resulting in fine grained threads. Others construct threads by explicit user specifications in the program, resulting in coarse grained threads. How these two mechanisms impact performance remains an open question.
Future general purpose architectures will scale to hundreds of cores. In order to accommodate bot... more Future general purpose architectures will scale to hundreds of cores. In order to accommodate both latencyoriented and throughput-oriented workloads, the system is likely to present a heterogenous mix of cores. In particular, sequential code can achieve peak performance with an out-of-order core while parallel code achieves peak throughput over a set of simple, in-order (IO) or singleinstruction, multiple-data (SIMD) cores. These large-scale, heterogeneous architectures form a prohibitively large design space, including not just the mix of cores, but also the memory hierarchy, coherence protocol, and on-chip network (OCN).
We propose GROPHECY, a GPU performance projection framework that can estimate the performance ben... more We propose GROPHECY, a GPU performance projection framework that can estimate the performance benefit of GPU acceleration without actual GPU programming or hardware. Users need only to skeletonize pieces of CPU code that are targets for GPU acceleration. Code skeletons are automatically transformed in various ways to mimic tuned GPU codes with characteristics resembling real implementations. The synthesized characteristics are used by an existing analytical model to project GPU performance. The cost and benefit of GPU development can then be estimated according to the transformed code skeleton that yields the best projected performance. With GROPHECY, users can leap toward GPU acceleration only when the cost-benefit makes sense. The framework is validated using kernel benchmarks and data-parallel codes in legacy scientific applications. The measured performance of manually tuned codes deviates from the projected performance by 17% in geometric mean.
International Journal of Parallel Programming, 2011
Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique... more Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different processing elements (PEs). In addition, synchronization is often used to signal the completion of halo exchanges. Both communication and synchronization may incur significant overhead on parallel architectures with shared memory. This is especially true in the case of graphics processors (GPUs), which do not preserve the state of the per-core L1 storage across global synchronizations. To reduce these overheads, ghost zones can be created to replicate stencil operations, reducing communication and synchronization costs at the expense of redundantly computing some values on multiple PEs. However, the selection of the optimal ghost zone size depends on the characteristics of both the architecture and the application, and it has only been studied for message-passing systems in distributed environments. To automate this process on shared memory systems, we establish a performance model using NVIDIA's Tesla architecture as a case study and propose a framework that uses the performance model to automatically select the ghost zone size that performs best and generate appropriate code. The modeling is validated by four diverse ISL applications, for which the predicted ghost zone configurations are able to achieve a speedup no less than 95% of the optimal speedup.
Uploads
Papers by Jiayuan Meng