VTK SMP
VTK SMP
VTK SMP
VtkSMP: Task-based
Parallel Operators for
VTK Filters
February 2012
Project-Team Moais
ISSN 0249-6399
RESEARCH
REPORT
N 8245
ISRN INRIA/RR--8245--FR+ENG
Abstract: Scientic simulations produce more and more memory consuming datasets. The required
processing resources need to keep pace with this increase. Though parallel visualization algorithms with
strong performance gains have been developed, there is a need for a parallel programming environment
tailored for scientic visualization algorithms that would help application programmers move to multi-core
programming.
Facing the challenge of genericity, standard solutions like Pthreads, MPI, OpenCL or OpenMP often fail to
offer environments that are both easy to handle and lead to efcient executions.
In this paper, we study the parallelization of patterns commonly used in VTK algorithms and propose a new
multi-threaded plugin for VTK that ease development of parallel algorithms. Focusing on code reusability,
we provide tools that implements these patterns in a generic way. With little code modication, we experiment these patterns with a transform lter, brute force isosurface extraction lter and a min/max tree
accelerated isosurface extraction. Results show that we can speed execution up to 30 times on a 48-core
machine.
Key-words: parallelism, parallel runtimes, work stealing, scientic visualization, VTK.
Inria
Grenoble INP
EDF
RESEARCH CENTRE
GRENOBLE RHNE-ALPES
Inovalle
655 avenue de lEurope Montbonnot
38334 Saint Ismier Cedex
VtkSMP
Introduction
The size of data produced by scientic simulations is growing at a steep rate. Postprocessing tools,
including scientic visualization libraries, are urged to evolve accordingly to cope with these datasets.
Simple PCs as well as large supercomputers are today built around multi-core processor architectures.
Taking advantage of their processing power requires a core-level parallelization.
Though many parallel algorithms have been proposed to supplement sequential ones, many commonly used libraries are still not supporting efcient multi-core parallel executions. The VTK scientic
visualization library is one of them. Beside the important effort required to revisit a large sequential code
base, an other issue is probably the lack of a standard parallel programming environment.
These kinds of programming environments are facing two main issues: providing a programming
model that enables the programmer to swiftly shift from its sequential programming habits to parallel
ones and a runtime system that ensures efcient executions even with a moderate optimization effort.
Usual runtime systems generally lack one of these requirements.
Our research goal is to provide a fully functionnal plugin for VTK able to provide tools that are used to
parallelize any VTK s algorithm. Driven by genericity and code reusability, we provide high-level parallel
operators that easyly replaces sequential pattern. Our plugin can handle differents runtime, choosen at
compile time. The performances of our plugin are studied in a way that, depending on the parallel runtime
it uses, the operators give the best of their capabilities.
In this paper we study 3 different parallelization patterns. The rst one targets loops with independent iterations (foreach loop) that produce independent data chunks of known size that can be directly
written in a global data structure without concurrency-related issues. Next, we look at foreach loops that
produces data with unknown memory footprint. A merge of partial results can thus be necessary to produce a compact data structure. Eventually, we propose to parallelize a tree traversal which is a common
accelerated data structure. Results show that work-stealing runtimes are better to ensure both efciency
and optimal ressources utilization on any kind of problems. They achieve speed-ups of 30 on a 48 cores
machine for very irregular problems that expose severe load imbalance.
The following sections of this paper will present an overview of multi-threaded approaches for scientic visualisation and expose general guidances on work-stealing approaches (sections 2 and 3). Then
we present our method of parallelisation based on data access patterns on section 4 and their efciency
on VTK lters on section 5. We conclude our paper in section 6.
Related Work
Message passing as popularized by the MPI standard, is a classical parallel programming model well
adapted for distributed memory machines. It however requires to explicitly manage data distribution
across processes. For stencil like algorithms needing a visibility of the state of neighbors, data on the
border of the partitioned domains are duplicated to reduce the data exchange overhead. These ghost cells
increase the complexity of the code and the memory usage, while on a shared memory system these
data duplication could be avoided. Moreover MPI implementations on shared memory machines tends to
suffer from high overheads. Ahrens et al. [ABM+ 01] relied on MPI to propose a two level parallelization
of VTK applications. Given that the data to process can be partitioned (using ghost cells if necessary),
their model supports data parallelism through a duplication of the processing pipeline. Additionally for
each pipe-line data can be partitioned in blocks that are streamed down the pipe-line, enabling concurrent
executions of the different stages. It however does not support dynamics load balancing, relying on the
user to dene the partitioning policy.
Hyperow [VOS+ 10, VOC+ 12] adopts similar goals but with a more modern context. They propose
a thread level streaming and pipeline duplication strategy, thus supporting shared memory architectures.
RR n 8245
Ettinger et al.
This thread level parallelization enables to avoid ghost cells, given that the necessary protection steps are
taken to avoid race conditions. If the programmer provides a GPU implementation for some lters, the
Hyperow runtime is also able to defer computation to the GPUs available on the machine.
Similarly an early approach for shared memory machines have been proposed in [ALS+ 00]. The
authors propose a thread based interface for VTK, hiding some details of thread handling. But directly
programming at a thread level is today recognized as a low level approach that is error-prone [Lee06].
The OpenMP standard relies on code annotations that the programmer uses to identify possible
sources of parallelism, typically indicating when the iterations of a loop are independent and can thus
be executed concurrently. Then the compiler and runtime can extract parallelism. This model avoids the
drawback of thread level programming. But OpenMP relies on static load balancing strategies, impairing
performance when the work-load vary dynamically due to the nature of the algorithm or a varying availability of cores. Version 3 of OpenMP was extended with the concept of tasks [DFA+ 09]. But current
implementations rely on a centralized list scheduling that incurs high overheads.
Work-stealing, through libraries like Cilk [FLR98], TBB [Rei07] or X-K AAPI [GBP07, LMDG11],
is emerging as an alternative of choice for shared memory programming. Relying on a dynamic load balancing runtime, these libraries can cope with irregular applications and unsteady core availability. They
also propose a programming model based on tasks and not threads to delimit the potential parallelism.
The runtime thus takes care of distributing these tasks on the processing cores, relying on low overhead
mechanisms, like a distributed task heap. To our knowledge, only a limited set of papers have been relying on these runtimes for parallelizing some scientic visualization lters [TDR10] and so far no generic
approach as emerged. In the following section we remind the bases of work-stealing and the related programming model. Work-stealing has also been experimented for balancing the work load in between the
different cores of a GPU for some specic algorithms like octree construction for particle sorting [CT08].
It is also supported on the Intel Xeon Phi accelerator through Cilk and TBB. Task oriented programming
relying on work-stealing scheduling thus appears as a possible unied paradigm for ensuring portability
across CPUs and GPUs.
Though not supported through the aforementioned libraries, work-stealing has been experimented for
balancing the work load in between the different cores of a GPU for some specic algorithms like octree
construction for particle sorting [CT08]. It shows that work-stealing could become a paradigm of choice
for ensuring portability across CPUs and GPUs. Work stealing.
The GPU is also a target of choice for accelerating visualization lters. The programming model,
based on Cuda or OpenCL, is SIMD oriented, often requiring signicant programming efforts compared
to a more classical sequential or multi-core programming approach. The OpenCL standard ensures a
functional portability to different devices, including multi-core CPUs, but the code needs to be reworked
to reach good performances.
The Piston approach [StLA12] proposes a programming environment for scientic visualization lters
relying on the Thrust library [thr12]. Thrust offer parallel versions of operators on STL like vectors and
list data structures for executions on GPUs through Cuda, but also on multi-core CPUs with OpenMP and
Intel TBB. As far as we know, only Cuda and OpenMP have been tested in the Piston context. A good
performance is reached for regular parallelism, but we can expect to face the performance limitations of
OpenMP on CPUs when dynamics load balancing is required. Testing with TBB would be very interesting
to validate the benets of work-stealing in the Thrust context.
VtkSMP
the parallel work or a centralized task list used to dynamicaly adapt to cores availability. Even if these
approaches t well problems that have either a regular workload or a non memory intensive data access
schema, this is not the case of many meshes manipulation lters.
Ettinger et al.
4.1 ForEach
Given some requirements, iteration over independant loop is very easy to parallelize. Indeed, using n as
n
the size of the loop, one can see such loop as n independant tasks. Thus giving each core numberofcores
elements to compute is the easiest way to parallelize this pattern. However such approach can lead to
load imbalance issues. If a few iterations take much more time than the others, some cores may nish
their computation before the other and wait them. Dynamic scheduling and work-stealing techniques can
efciently handle such problems.
But giving work to each processor is not the sole thing to care when dealing with independant loops.
One has to make sure to allocate enought memory for output data structures and to use thread safe methods
while computing each element. Moreover, handling writing offset may be a non-trivial task. Thus these
aspects are left to the user. Our plugin then guaranty to divide the iteration range of the loop the best way
regarding of the choosen runtime.
Our plugin provides the vtkSMP::ForEach operator. As for Intel TBBs tbb::parallel_for, one has
to move the sequential body of the loop into a freshly build class (e.g. gure 1) It has the advantage
to separate the algorithm (the pattern) and the computation. It also exhibit pieces of code that need to
be carefully manage (memory allocation, thread safe methods, aso.) Writing the class is a matter of
checking the sequential behaviour. Then calling the operator only need the range of the computation and
an instance of that class.
4.1.1 The Work-Stealing Approach
When the void operator() (vtkIdType id) const operation is very regular we faced a scalability issue due
to steals. Indeed, each steal adds a little overhead to the computation, loading the memory bus to retrieve
suitable data. The number of steals can grow signicantly as we get closer to the end of the computation.
These steals grabbing a small work load tend to incur more overhead than computation speed-up. To
avoid the performance drops induced by this behavior, the work-stealing scheduler denes a threshold
representing the minimal range for a computation, i.e. the grain. The problem is thus to set the grain to
the optimal value. If the grain is too coarse, load balancing becomes inaffective. Previous studies [PP95]
Inria
VtkSMP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
v o i d vtkSMPTransform : : T r a n s f o r m V e c t o r s (
v t k D a t a A r r a y * inNms ,
17
18
19
20
21
v t k D a t a A r r a y * outNms )
{
v t k I d T y p e n = inNms>GetNumberOfTu pl es ( ) ;
t h i s >U p d a t e ( ) ;
VcsModificator * myvectorsmodificator =
V c s M o d i f i c a t o r : : New ( ) ;
m y v e c t o r s m o d i f i c a t o r >i n V c s = inNms ;
m y v e c t o r s m o d i f i c a t o r >o u t V c s = outNms ;
m y v e c t o r s m o d i f i c a t o r > m a t r i x =
t h i s >M a t ri x >E l e me n t ;
22
23
24
25
26
vtkSMP : : Fo rE a c h ( 0 , n , m y v e c t o r s m o d i f i c a t o r
);
27
28
29
30
m y v e c t o r s m o d i f i c a t o r > D e l e t e ( ) ;
showed that the maximum amount of parallelism reachable is when the grain is ( n). Indeed the
n
and these tasks are limited to a range containting at least grain
number of tasks created is, at most, grain
n
iterations. Thus the critical path, i.e. the sequential part of the computation is (grain + grain
). The
optimum of this function appears when the grain reaches n. Experiments conrm that a n grain leads
to the best performance.
Ettinger et al.
To track and remove duplicates points of partial meshes, we used VTK builtin Locators. We extended
the behaviour of vtkMergePoints to support parallel operations. Basically, this locator keeps track of existent points thanks to a spacially-guided hash table. The bounding box of the resulting mesh is regularly
divided into a 3 D grid and each voxel of this grid maps to an entry in the table. Each entry stores the
indices of the points that belongs to this voxel. Knowing the index of the entry associated to a point is
simple mathematics involving point and bounding box coordinates.
vtkIdList**
vtkMutexLock**
vtkIdList**
Figure 2: Inner representation of vtkMergePoints (left) and our rst implementation of vtkSMPMergePoints (right). Each vtkIdList is designated as bucket.
The simplest use of our locator for a parallel merge of all partial meshes is, for each thread, to insert
their points one by one into the locator. This solution obviously produce wrong results if no care is taken.
Our solution is to use an hash table of mutex in addition to the one of points ids as shown in gure 2.
Each insertion of point is thus preceded by a lock operation. Since conicts only occurs for points that
lays on the boundary of a mesh, overheads are still low. However, memory footprint is twice as important
as the sequential object.
Using only one locator in the case of lters that already need one per thread is not highly productive.
So we inspired from at combining [HIST10] to benet from these arrays of buckets1 . With an iteration
over each voxel, one thread can merge all equivalent buckets in a row. Locks that were protecting voxels
from concurrent accesses are thus no longer needed. The only requirement that is still present is the
need of an atomic increase on the number of points in the output vtkPoints. But this approach imposes a
constraint to our framework: a thread must be able to retrieve data from any other. Moreover our casestudy shows that more than 90% of the buckets are empty. Taking care of these two particularities leads
to initiate the merge of buckets corresponding to a voxel only by a thread that actually have data in such
buckets. To do so, each thread looks for the rst non-empty bucket that it owns. If no other thread already
merged the buckets associated to the same voxel, it retrieves the data associated to this voxel from the
other threads and performs the merge in the output data structures. Otherwise it discards this bucket and
look for the next one that contains data.
Our framework provides both merge capabilities. Filters that use a locator to produce their output
points usualy end up using thread local locator while parallelized. In such case, the last technique can
uses the locator to perform points insertion bucket by bucket. Otherwise the function build a new locator
and uses it as described for the rst technique.
1A
Inria
VtkSMP
10
Ettinger et al.
5.2 vtkTransformFilter
An example of VTK algorithm that performs a loop with independent iterations over cells and/or points
is the vtkTransformFilter. It applies scales, translations and rotations on a mesh. The computation is
2 http://graphics.stanford.edu/data/3Dscanrep/
Inria
11
VtkSMP
performed by a vtkTransform that contains the description of the transformations to be applied. The
vtkTransform is supplied to the vtkTransformFilter and it is this vtkTransform that iterates on the input
data to produce the output.
We modied vtkTransformFilter to perform the needed memory management before every computation. And we built the vtkSMPTransform class that mimic the behaviour of the vtkTransform one except
that the sequential loop was turned into a parallel ForEach.
Figure 4a shows the performances obtained on our AMD48 machine. Moreover, as we targeted both
shared memory and NUMA machines, we have also studied the inuence of data placement. Since
memory pages of output data are touched only after each point or cell transform computation, we put
two vtkTransformFilter one after the other in our visualization benchmark. As a consequence, the rst
lter load data from a unique memory bank, leading to memory contention, and the second one have
the memory pages of its input data interleaved on each memory banks thanks to the rst lter. Thus the
second lter make a better usage of memory bandwidth. Results are shown in gure 4b.
50
50
Kaapi
OpenMP
TBB
Linear speedup
30
20
10
0
Kaapi
OpenMP
TBB
Linear speedup
40
SpeedUp
SpeedUp
40
30
20
10
10
20
30
#Cores
40
50
10
20
30
#Cores
40
50
Figure 4: Speedups for execution of parallelized loops within vtkTransformFilter. Comparison between
several runtimes of the execution of the ForEach loop vs. the sequential for loop.
As expected, mapping the data close to their computing core enhances speed-up and drops memory
contention. Results also show that work-stealing runtimes are at least as efcient as OpenMP for highly
regular loops which is known to be the stomping ground of the latter. Moreover, a side effect of
work-stealing runtimes is that they tend to smooth the memory contention issue.
5.3 vtkContourFilter
A widely-used lter that cannot calculate the size of its output structure before the actual computation is
the vtkContourFilter. Its purpose is to compute one or several isosurfaces on any kind of datasets. In its
current implementation, vtkContourFilter is mainly a switch between several specic implementations
depending on the type of the input dataset. The implementation parallelized in this study is the generic
one, the one applied if the input dataset does not have a specic algorithm that handles it.
The algorithm works as follows:
for each cell in the mesh, compare the scalar values associated to the points of this cell with the
isovalue being computed;
RR n 8245
12
Ettinger et al.
if the isovalue is either above the maximum or below the minimum of these values, do not compute
anything;
otherwise create the part of the output mesh that correspond to the values pattern.
Since the created portion of the output mesh depends on both the scalar distribution above and below the
isovalue and the topology of the cell, it is not possible to know how many points and cells3 will be created
per input cell.
50
Kaapi
OpenMP
TBB
Linear speedup
SpeedUp
40
30
20
10
0
10
20
30
#Cores
40
50
Figure 5: Speed-ups for execution of parallelized loop within vtkContourFilter. The difference between
the runtimes is due to the high load of the rst core.
The parallelized version of vtkContourFilter thus makes use of TLS and fuse the partial pieces of
mesh. Figure 5 shows the execution of the surface creation (with merge) on our AMD48 platform.
Speed-ups compare to the for loop in the sequential algorithm. Since isocontouring is an algorithm
that can often face load balancing issues, in this experiment we provided a very high load-imbalanced
input dataset.
To do so we computed scalars values in such a way that only a few rst cells can possibly contain
points whose scalar values are both above and below the isovalues. Thus using a static partitionning
of the iteration range leads to the rst core being the only one to create the output mesh. As expected,
work-stealing runtimes load balances efciently and outperforms the static partitioning of OpenMP. Nevertheless, dynamic scheduling parametered with the right size for partition (grain) gives similar results.
5.3.1 Focus on the Merge Operator
The vtkContourFilter is a good place to experiment the behaviours of our Merge operator since the isosurface computation makes use of a locator per thread.
Figure 6 presents the execution time for both merge implementations. The method using all the
locator is faster than the one using only one. It also requires twice as less memory since the array of
vtkMutexLock is not required in this case. The only drawback is that not all the lters that would make
use of the Merge operator in their parallel version can be parallelized using a thread local locator.
3 One
Inria
13
VtkSMP
Time (s)
10
20
30
#Cores
40
50
Conclusion
We presented the basis of a framework for VTK that aims at creating parallel lters. It targets multi-cores
platforms and takes advantage of reusability of sequential code.
Experiments showed guaranted performances without deep optimisation. Standard behaviour of our
operators is efcient enough for someone with a limited knowledge in parallelism to exploit them well.
Moreover specic behaviour of work-stealing runtimes such as X-K AAPI or TBB can be exploited to
reach better performances. They are also richly optimized so that the overhead of dynamic load balancing
is small enough to be competitive with OpenMP static partitioning for regular applications.
RR n 8245
14
Ettinger et al.
50
Kaapi
TBB
Linear speedup
SpeedUp
40
30
20
10
0
10
20
30
#Cores
40
50
Our goal is to provide a full edged parallel programming environnement dedicated to visualization
lters. Next steps will focus on other classical algorithms like vtkStreamTracer and its subclasses and
vtkExtractEdge.
We did not address the issue of parallelizing code on GPUs, load balancing still being delicate to
perform efciently for heterogeneous architectures. But this may change with new multi-core architectures for accelerators like the Intel Mic [Ska10] that integrate all features necessary for an efcient
work-stealing implementation.
Lastly, parallel composition may lead to better resource utilisation. The idea is to exploit asynchronous lter execution presented in HyperFlow [VOC+ 12] and merge it with our multi-core framework. We hope to maximize both CPU and memory bandwith utilisation without falling to the pitt of an
over utilisation of the memory bus.
Contents
1 Introduction
2 Related Work
4
5
5
5
5
6
7
7
8
9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Inria
15
VtkSMP
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Conclusion
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
10
10
11
12
12
13
References
[ABM+ 01]
James Ahrens, Kristi Brislawn, Ken Martin, Berk Geveci, C. Charles Law, and Michael
Papka. Large-scale data visualization using parallel data streaming. IEEE Computer
Graphics and Applications, pages 3441, 2001.
[ALS+ 00]
James Ahrens, Charles Law, Will Schroeder, Ken Martin, Kitware Inc, and Michael Papka.
A parallel approach for efciently visualizing extremely large, time-varying datasets. Technical report, 2000.
[BCOM+ 10] Franois Broquedis, Jrme Clet-Ortega, Stphanie Moreaud, Nathalie Furmento, Brice
Goglin, Guillaume Mercier, Samuel Thibault, and Raymond Namyst. hwloc: a Generic
Framework for Managing Hardware Afnities in HPC Applications. In IEEE, editor, PDP
2010 - The 18th Euromicro International Conference on Parallel, Distributed and NetworkBased Computing, Pisa, Italie, February 2010.
[BJK+ 96]
R.D. Blumofe, C.F. Joerg, B.C. Kuszmaul, C.E. Leiserson, K.H. Randall, and Y. Zhou.
Cilk: An efcient multithreaded runtime system. Journal of Parallel and Distributed Computing, 37(1):5569, 1996.
[CT08]
Daniel Cederman and Philippas Tsigas. On dynamic load balancing on graphics processors.
In Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics
hardware, GH 08, pages 5764, 2008.
[DFA+ 09]
Alejandro Duran, Roger Ferrer, Eduard Ayguad, Rosa M. Badia, and Jesus Labarta. A
proposal to extend the openmp tasking model with dependent tasks. Int. J. Parallel Program., 37:292305, June 2009.
[FLR98]
Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the
cilk-5 multithreaded language. SIGPLAN Not., 33:212223, 1998.
[GBP07]
Thierry Gautier, Xavier Besseron, and Laurent Pigeon. KAAPI: A thread scheduling runtime system for data ow computations on cluster of multi-processors. In Proceedings of
PASCO07, New York, NY, USA, 2007. ACM.
[HIST10]
Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. Flat combining and the
synchronization-parallelism tradeoff. In Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures, SPAA 10, pages 355364, New York, NY, USA,
2010. ACM.
[Lee06]
[LMDG11]
Fabien Le Mentec, Vincent Danjean, and Thierry Gautier. X-Kaapi C programming interface. Technical Report RT-0417, INRIA, 2011.
RR n 8245
16
Ettinger et al.
[LMGD11]
Fabien Le Mentec, Thierry Gautier, and Vincent Danjean. The X-Kaapis Application Programming Interface. Part I: Data Flow Programming. Technical Report RT-0418, INRIA,
2011.
[PP95]
Victor Y. Pan and Franco P. Preparata. Work-preserving speed-up of parallel matrix computations. SIAM J. Comput, 1995.
[Rei07]
J. Reinders. Intel threading building blocks. OReilly & Associates, Inc., Sebastopol, CA,
USA, rst edition, 2007.
[RVK08]
[Ska10]
Kirk Skaugen. Petascale to exascale: Extending intels hpc commitment. Technical report,
ISC keynote, 2010.
[StLA12]
Christopher Sewell, Li ta Lo, and James Ahrens. Piston: A portable cross-platform framework for data-parallel visualization operators. In Eurographics Symposium on Parallel
Graphics ans Visualization, 2012.
[TDR10]
Marc Tchiboukdjian, Vincent Danjean, and Bruno Rafn. Cache-efcient parallel isosurface extraction for shared cache multicores. In Eurographics Symposium on Parallel
Graphics ans Visualization, 2010.
[thr12]
[VOC+ 12]
Huy Vo, Daniel Osmari, Joao Comba, Peter Lindstrom, and Claudio Silva. Hyperow:
A heterogeneous dataow architecture. In Eurographics Symposium on Parallel Graphics
ans Visualization, 2012.
[VOS+ 10]
Huy Vo, Daniel Osmari, Brian Summa, Joo Comba, Valerio Pascucci, and Cludio Silva.
Streaming-enabled parallel dataow architecture for multicore systems. In Blackwell Publishing Ltd., editor, Eurographics/IEEE-VGTC Symposium on Visualization, pages 1073
1082, june 2010.
Inria
Publisher
Inria
Domaine de Voluceau - Rocquencourt
BP 105 - 78153 Le Chesnay Cedex
inria.fr
ISSN 0249-6399
RESEARCH CENTRE
GRENOBLE RHNE-ALPES
Inovalle