An Opencl Implementation For The Solution of Tdse On Gpu and Cpu Architectures
An Opencl Implementation For The Solution of Tdse On Gpu and Cpu Architectures
An Opencl Implementation For The Solution of Tdse On Gpu and Cpu Architectures
Abstract
Open Computing Language (OpenCL) is a parallel processing language that is ideally suited for running parallel algorithms on
Graphical Processing Units (GPUs). In the present work we report on the development of a generic parallel single-GPU code for
the numerical solution of a system of first-order ordinary differential equations (ODEs) based on the OpenCL model. We have
applied the code in the case of the time-dependent Schrödinger equation of atomic hydrogen in a strong laser field and studied its
performance on NVIDIA and AMD GPUs against the serial performance on a CPU. We found excellent scalability and a significant
speed-up of the GPU over the CPU device which tended towards a value of about 40 with significant speedups expected against
multi-core CPUs.
Keywords: General Purpose Graphical Processing Unit (GPGPU) Programming, Taylor Series, Runge-Kutta methods,
Time-Dependent Schrödinger Equation, Quantum Dynamics, Ordinary differential equations
Constant (64KB)
Global (1GB)
performance library routines. As a result of the possibilities 2.
with the GPGPU platform, it represents a hot topic within com-
putational physics. Two main computational platforms for GPU
T
P1 P2 T P3 P4 T
...
P31 P32T
computing are currently in the mainstream interest; namely the
...
...
CUDA and OpenCL frameworks. At the moment CUDA is of
heavy use in a number of different scientific areas, but interest
in OpenCL is increasing, with tools also available to convert Global Cache (8KB)
CUDA code into OpenCL code such as the program Swan[24]. Local (32KB) Private (256KB)
OpenCL is a language that was designed to suit the parallelism 18.
of GPUs. It is, in essence, very similar to CUDA but in terms
of features within the framework there are some significant dif-
T
P1 P2 T P3 P4 T
...
P31 P32T
ferences arising from CUDA being limited to a particular set of
hardware from a particular manafacturer whilst OpenCL is not.
The purpose of this work is two-fold. The first is to pursue Figure 1: An AMD GPU Model based on the data from
the development of a theoretical method to tackle demanding Appendix A. the ALU, each pair of which are grouped into
computational problems in the area of complex quantum sys- a single double precision processing element, are in red, the
tems under intense and ultrashort radiation fields. The second transcendental unit, in orange and marked with the T, is not
is to present a computational model which is only, we believe, used. Each grey box represents one compute unit, of which
in its starting phase, namely the development of a parallel com- there are 18. The pool of registers which form private memory
putational model which does not discriminate between GPU are shared amongst processing elements in a compute unit. The
and CPU architectures. In this sense, the present computational global cache caches global memory for use within the compute
model is designed and is able to run on both CPU and GPU- unit, it is not accessed explicitly. Also shown within the com-
based systems. To this end, the computational framework that pute unit is the local memory which is accessed explicitly and
we chose is based on the OpenCL language. Though the us- is the medium for communication within a compute unit. The
age of GPUs is already common within fields such as quantum global and constant memory shown are accessed by processing
chemistry [25] and usage is flourishing in fields as varied as elements within compute units.
fluid dynamics and magnetohydrodynamics (see the introduc-
tion from [26] and citations within) and statistical physics [27]
with some usage appearing in fields such as in interacting many 2. GPU programming and the OpenCL framework
particle systems [28].
Since CUDA is a more mature platform there exist routines GPUs are a type of compute device in OpenCL terminol-
which can optimize existing codes such as the use of an existing ogy. GPU architectures have blocks of processing elements.
accelerated FFT routine such as the FFTW3 library used in Ref Processing elements are similar to cores except for a few key
[29]. It is worth noting that most implementations using GPUs differences since they are, or effectively are, arthimetic logic
are on the CUDA platform. Thus for this additional reason the units (ALUs). A processing element will have access to a cer-
present OpenCL implementation represents an important con- tain amount of memory which it exclusively accesses. This
tribution in this newly emerged field. is known as private memory. Groups of processing elements
The paper is organized as follows. In Sec. 2 a basic de- execute in a SIMD fashion and may share a common mem-
scription of the OpenCL and GPU concepts and terminology ory which can be treated as local memory as described later.
is given. In Sec. 3 we formulate the physical problem un- These groups are known as compute units. GPU branch gran-
der question in mathematical terms. In Sec. 4 we present the ularity is coarse grained because of the SIMD design. GPUs
OpenCL implementation specific to the solution of a system of typically have a slower clock speed (700-900MHz) in compari-
ODEs, while in Sec. 5 we apply our approach to the case of an son to CPUs. Figure 1 demonstrates a typical configuration for
hydrogen atom and present the benchmarking results. Finally an AMD GPU. The specific architectures provided by different
we have relegated some the details about the hardware tested vendors may vary, but the abstraction provided by OpenCL will
and the specific numerical algorithms employed in this study to hold.
the appendicies. Global memory is available for access to all compute units.
Constant memory is a part of global memory which is not changed
2
by processing elements. A cache for global memory may also OpenCL C code is compiled during the runtime of the host
be available. Global Memory is typically not integrated onto code. The OpenCL C code can be specifically targeted to a
the same chip as the GPU. For a CPU, RAM typically uses particular instance of a problem; some aspects are known only
DDR2 and the newer DDR3 whilst a GPU typically accesses at runtime of the host code. This information can be used at
GDDR5 global memory. For AMD hardware GDDR5 memory compile time for the OpenCL C code and thus the code can be
has twice the bandwidth of DDR3 memory [30]. optimized for that particular instance. In this way, the compiler
Communication between the GPU and CPU typically oc- can take advantage of what is known at runtime of the host code.
curs over a PCI Express x16 connection. For the V7800 this Memory objects such as one dimensional arrays can be cre-
gives a theoretical maximum bandwidth of 8 GB/s while the ated for use by the device code by function calls to the OpenCL
peak realizable bandwidth is 6 GB/s. library. A handle is returned to the object by the library which
can then be used to refer to the object in future function calls.
2.1. OpenCL OpenCL as a framework provides for the execution of func-
OpenCL is a royalty free open standard. Initially developed tions known as kernels, which are written in OpenCL C. A ker-
by Apple
R Inc, the standard is being actively developed and nel is not directly called by the host code, instead, a call to a
worked upon by the Khronos group, a large multi-vendor con- specific kernel with specific memory objects as arguments is
sortium which includes companies such as NVIDIA
R , AMD
R , placed in a queue on the host device when the clEnqueueN-
IBM
R , ARM
R , Intel
R , Texas Instruments
R , Sony
R and others. DRangeKernel() function is called. The particular implementa-
The current implementations of OpenCL from Intel, AMD and tion of the OpenCL standard will take care of all further details.
NVIDIA are based on the 1.0 and 1.1 standards. The 1.2 stan- For example, the implementation will decide when to pass a
dard was released on the 15 of November 2011. particular batch of function calls queued from the CPU to the
OpenCL distinguishes between two types of code in any hardware scheduler that is present on a GPU. The queue is said
OpenCL program; the host code and the OpenCL C code. to be asynchronous.
All code that is written in standard programming languages Since the objective is parallelism, the aim is for multiple in-
such as C or Fortran can be regarded as host code. A regular stances of the same kernel to be simultaneously executed with
program with no connection to OpenCL can be viewed as con- independent data so as to spread the workload. The hardware is
taining entirely host code for example. The host code interacts set to assign instances of this execution, known as work items,
with OpenCL purely through function calls to the OpenCL li- with particular identification numbers. Three sets of identi-
brary. This means that any compiler can be used to compile the fication numbers are given; the local, group and global IDs.
host code as long as it can link against the OpenCL library. OpenCL combines work items into work groups. The minimum
The OpenCL C code is written in an OpenCL variant of the size of a work group for an AMD GPU is 64 work items. This
latest ANSI/ISO standard of C known as C99. The major dif- minium size is known as a wavefront in AMD terminology. For
ferences between OpenCL C and C99 are the restrictions placed NVIDIA the minimum size is called a warp. AMD GPUs cur-
on the language. A key restriction is the lack of recursion due to rently execute a half wavefront at a time on a compute unit for
GPU hardware issues and also that two or more dimensional ar- double precision instructions. The local ID of a work item rep-
rays must be treated as one dimensional arrays when being used resents its place within a work group. The purpose of the group
as arguments for kernel functions. Although complex numbers ID is to represent a particular work groups position in relation
are supported by the C99 standard they are not implemented to the other work groups. The global ID represents the position
in OpenCL C, instead, the user can create a complex structure of a work item in relation to all other work items.
containing two double precision elements, it is then a relatively For a specific work group size NGroup the global ID is equiv-
simple matter to define the relevant series of complex multipli- alent to:
cation functions. This, however, is an undesirable additional
step. It is preferable if optimized implementations were used IDGlobal = IDGroup NGroup + IDLocal
implicitly such as in C99. Other restrictions are listed in the In this way a work item knows its place in the order of
OpenCL specification [31]. The OpenCL C code is the code things in a manner similar to the concept of topologies in MPI.
that will actually be performing a particular computation on a That is, there exists what can be thought of as a local topology
particular target such as a GPU or a CPU. between work items in a work group and a global topology be-
Whilst CUDA is portable amongst most operating systems, tween work groups in the domain of the problem. The local and
OpenCL is portable in the greater sense of not being limited global topologies can be one, two or three dimensional in their
to specific hardware as well as operating systems. Support is layout.
not dependent on a single vendor. Possible compute devices in A work item can only communicate with other work items
OpenCL are not just limited to GPUs and CPUs, they can also in the same work group. Unlike in MPI, the creation of virtual
include FPGAs, DSPs, the IBM
R Cell architecture and many topologies is not built in by the framework though the equiva-
more. lent can be implemented by an interested OpenCL C program-
OpenCL C code will execute on any architecture but, in mer. A work item communicates with other work items in it’s
practice, it will require a slight code modification or possibly work group through the use of local memory; low latency mem-
a partial rewrite to achieve good performance from one archi- ory that may be dedicated to a particular compute unit or global
tecture to the next.
3
memory which is remapped to the work group. This communi- to the following set of coupled partial differential equation for
cation approach through memory is similar to that of OpenMP. the radial motion in channels γ:
There is a limitation on communication between work items
d X
in different work groups; they cannot communicate with each i Cγ (t) = Eγ Cγ (t) + Vγ,γ′ (t)Cγ′ (t), (4)
other during the execution of a kernel. This limitation is due to dt γ′
the execution of the kernel by the many work items, the coher- Eγ = hΦγ |ĤN |Φγ′ i (5)
ence of global memory cannot be known since the order of ex-
Vγ,γ′ (t) = hΦγ |V̂(t)|Φγ′ i (6)
ecution of the work items is determined by the hardware sched-
uler and not the programmer. By properly ordering the coefficients Cγ (t) into a column vector
When one queues kernel calls via an in-order execution queue C(t) and the diagonal (Eγ ) and non-diagonal (V̂γ,γ′ (t)) elements
it can be guaranteed that at the start of a function call all work into a square matrix Ĥ(t) we may rewrite the TDSE governing
items executing a particular kernel see a consistent view of the system-field dynamics, in matrix representation, as,
memory and so it can be said that the work items have been
synchronized. iĊ(t) = Ĥ(t)C(t), (7)
supplemented with the initial condition C(0) = C0 . The latter
3. TDSE projected on the stationary system’s eigenstate ba- set of equations for the coefficients, which represents a system
sis of coupled ordinary differential equations (ODE), are subject
to numerical solution. The ODE form of the TDSE, no matter
Let us consider a multielectron (N−electrons) atomic or molec- which system we have, allows us to utilize our solution algo-
ular system described through the hamiltonian operator ĤN (r̃N ) rithm at a very general level. Within this eigenstate represen-
with, the system’s electron’s positions r̃N = (r1 , r2 , .., rN ). We tation of the system’s TD wavefunction only two kinds of dy-
assume that the eigenstates of the hamiltonian, Φγ (r̃N ), have namical quantities are required for the solution; the eigenener-
been calculated as the solutions of the Schrödinger equation: gies and transition matrix elements between the system’s eigen-
states. The key point is that all the information about the exact
ĤN (r̃N ) − Eγ Φγ (r̃N ) = 0, (1)
nature of the system, whether multi-electron or not, whether
where with the index γ we represent all the quantum mechan- atomic or molecular, is contained in the values of the eigenen-
ical observables required to completely characterize the states. ergies and the matrix elements together with the required selec-
Let us now consider the TDSE of the above system subject to tion rules for the transitions. It is for this reason, that any TDSE,
an external time-dependent field represent by the potential op- atomic or molecular, can be formally re-expressed as a system
erator V(r̃N , t). The TDSE of the system in the presence of this of ODEs that the present computational algorithm is especially
external field is written as: important, since it is designed to accept as input elements, ex-
actly the matrix elements of Ĥ which uniquely define coupling
∂ h i
between the system and field.
i ψ(r̃N , t) = Ĥ(r̃N ) + V̂(r̃N , t) ψ(r̃N , t), (2)
∂t
supplemented with the initial condition ψ(r̃N , t0 ) = ψ0 . Thus, 3.1. Atomic/molecular system in linearly polarized radiation
the main goal is to calculate the time-dependent wave function within the dipole approximation
of the system, given the hamiltonian ĤN of the unperturbed sys- Since we are interested in atomic or molecular targets under
tem and the external potential V̂(t). To this end, since any phys- EM fields one very important simplification can be achieved if
ical state of a quantum mechanical system can be expanded in we assume a linearly polarized radiation field along some axis,
terms of it’s eigenstates, the N−electron wavefunction is writ- which without any loss of generality we may assume to be the
ten as: z−axis in a Cartesian coordinate system. This assumption is
Z
X used to determine the structure of the matrix involved in the
ψ(r̃N , t) = Cγ (t), Φγ (r̃N ), (3) TDSE system of equation (7). Now, let us make the channel in-
γ dex γ more specific; where γ represents a solution of the hamil-
where γ, in principle, includes both the bound and continuum tonian operator. The total angular momentum quantum number,
states of the spectrum. At this stage, towards developing a given by L̂ with it’s component along the z-axis L̂z and the to-
method of calculating the TD wavefunction, we consider the tal spin given by Ŝ and it’s component along the z−axis Ŝ z .
standard approach which assumes that the system is enclosed in Thus, we write γ as: γ = (ELS ML MS ). This is the so-called
a box. Having assumed this, the bound and the continuum solu- LS representation of the atomic states which is well suited to
tions of the system can now both treated with a common index- light atoms. Let us assume that the system starts from the state
ing representing a now discretized spectrum. Formally, projec- Φ 0 = |E 0 , L0 , S 0 , M L0 , MS 0 i. It is well established, through the
tion of the known discretized channel states Φγ onto the TDSE Wignet-Eckhart theorem [32], that in the dipole approximation
and subsequent integration over all spatial variables r̃N will lead for the coupling of the radiation field the non-vanishing ele-
ments are between these states with the same total spin and the
same magnetic angular and spin quantum numbers. Thus we
4
0 1 2 3 4 where C(t) is the vector containing the N unknown complex
coefficients and M(t) = −iH(t), where H(t) is a N×N symmetric
D01 matrix. At this point, it is worth noting that although the specific
0 0 discussion is for the case of TDSE calculations, the framework
can easily be applied to ordinary differential equations of the
1 D10 1 D12 kind shown in Equation (9).
Essentially the derivative is calculated through efficiently
implementing the following matrix-vector multiplication:
D21 2 D23
2
−i[Ed + D(t)]C
5
4.1. Algorithm for Splitting up Generic Work Splitting up a Block to Work Items. For dividing up a block of
An algorithm is necessary that splits up a workload of NWork work which we gave a Block Id of IDGroupB to, amongst work
units into a specific number of units given by NWorker . A worker items in a few work groups NGroupB , firstly we must split the
can be a work item or a work group. We get the start position work available between the work groups, so the following is
for a specific worker: done:
Remainder ← NWork (mod NWorker ) TotalWork ← NGroup ∗ NGroupB
Div ←NWork /NWorker IDGroupB ← IDLocal + NGroup ∗ IDBlock
S tart ← Div ∗ IDWorker Call main algorithm in section 4.1
if IDWorker ≤ Remainder then Here we assign a particular ID IDGroupB to each work group
S tart ← S tart + IDWorker within the Block of work with ID IDBlock . Now we can call the
else main algorithm to divide the section of the block assigned to
S tart ← S tart + Remainder each work group to the individual work items.
end if
Now a check is performed to make sure a worker has not 5. Application to the case of atomic hydrogen in a laser field
been assigned a value that is out of range of the available work.
If this over-assignment has occured the amount of work is ad- In this case the γ channel index is replaced from the en-
justed for the worker. ergy, the angular momentum and its component along the field
if IDWorker < Remainder then polarization axis. We ignore the spin operators and thus we
i←1 have for the eigenstates of atomic hydrogen γ = (ǫlml ), where
else ǫ the energy eigenvalue and l the angular momentum quan-
i←0 tum number. The continuum is discretized and together with
end if the discrete bound states [33]. Then, the eigenstate basis of
Div ← MIN(NWork − S tart, Div + i) the field free hydrogen hamiltonian is the partial wave basis
End = S tart + WorkPerWorker Φγ (r) = hr | γi = 1r Pǫγ lγ (r)Ylγ mLγ (r̂) where Ylml (r̂) are the spher-
ical harmonics.
This stops workers accessing unallocated memory and also stops X
workers from performing duplicate calculations. If an exact Ψ(r, t) = Cγ (t)Φγ (r).
number of workers is chosen so that the division is assured to γ
be correct then this is unnecessary. The eigenstates of atomic hydrogen are coupled to each other
by a strong pulsed laser. This laser field couples states that
Splitting up Work Groups Amongst Blocks of Work. A block of
differ in angular momentum by 1 unit while the magnetic quan-
work is treated as a series of tasks that involve identical instruc-
tum number was set to zero (see Figure 2). We set the magnetic
tions being executed but with different data. Due to occupancy
quantum number to zero because we assume that the initial state
issues it may not be ideal to assign a full block of work to one
was the ground state of hydrogen where l = 0 and therefore
work group. The following algorithm was created to perform
ml = 0 and since ml is constant we do not need to take it into
this split up calculation:
account. There are 649 eigenstates associated with each angu-
if NGroups < NWorkB then lar momentum L in the basis that is used for the computations in
IDGroupB ← 0 this paper. Nine of these eigenstates are boundary states of the
NGroupB ← 1 B-Spline basis used which are fixed at 0; a total of 640 states
Call main algorithm in section 4.1 then are explicitly represented in the calculations for each an-
else gular momentum. The population of the continua represent the
NGroupB ← NGroups /NWorkB level of ionization after the laser field has passed.
IDGroupB ← IDGroup (mod NGroupB ) The EM field was modelled by a sine squared pulse, linearly
S tart ← IDGroup /NGroupB polarized along the z−axis:
End ← MIN(S tart + 1, NWorkB )
ω
end if E(t) = ẑE0 sin2 ( t)sin(ωt) (10)
2n
Where NGroups is the number of work groups available, IDGroup
is the ID of a particular work group and NWorkB is the amount where ω is the photon frequency and n is the number of cy-
of blocks of work to split up. After the algorithm has finished cles per pulse. The propagation was performed in the veloc-
each work group will be associated with a particular block of ity gauge where the dipole operator is expressed as Dv = −p ·
work which is denoted by an ID IDGroupB . The number of work A(t)/c. For the velocity gauge a five point gaussian quadra-
groups in the block is given by NGroupB . ture
R t integrates the E field to give the vector potential A(t) =
The call to the main algorithm is done where NGroups be- − t dt′ E(t′ ) at each time step. The integration was found to
0
comes NWorker , IDGroup becomes IDWorker and NWork is still perform as expected by comparing it to an analytical expression
used as the amount of work. for a sine squared pulse with a particular photon frequency. The
method works equally well for length gauge calculations where
the electric field E(t) gives the time dependence.
6
The present GPU implementation of the Taylor and Runge-
Kutta propagators (see appendix) was used for calculations in
the case of atomic hydrogen. The accuracy and precision of the
propagators was verified by comparing the photoelectron spec-
trum (PES) of the system to a known working propagator. The
propagator is based on a NAG Runge-Kutta based solver. Since
above threshold ionization (ATI) has occurred the photoelectron
spectrum is distinct.
In terms of the particular OpenCL implementation the split-
ting of a block to work groups was made where the number of
blocks of work NWorkB mentioned in section 4.1 corresponds to
the number of angular momenta Ltot when we are using the ba-
sis representation. IDGroupB is the ID for a particular angular
momenta L.
For division of the work initially the algorithm in section 4.1
is called. This will assign NGroupB work groups to each angu-
lar momentum block of coefficients C L . Following this, a call is
necessary to divide the individual coefficents in C L amongst dif-
ferent work items. This is done through a call to the algorithm Figure 3: Shown is a sample PES of simulations with 20 eV
defined in section 4.1. A choice of number of work groups and photons with a pulse of intensity 1 × 1014 Wcm−2 of the three
work group sizes was made such that for every coefficient there propagators that were compared. A high agreement is seen be-
would be one corresponding work item. tween the classic RK4 and the Taylor propagators. The Runge-
In the benchmarks shown the matrix was treated as a very Kutta-Fehlberg is markedly different from the other two meth-
large one-dimensional array. Each diagonal matrix block E LD ods which mostly overlap.
was passed followed by the related superdiagonal dipole ele-
ment block DL in row major form. In this form the superdiag-
onal blocks were transferred to the GPU but the subdiagonal the case of Nl = 640 and l = 0, ..., 20 then the hamiltonian
blocks were not represented. Since the matrix is Hamiltonian has 8615040 double precision values which require an array of
the subdiagonal block is not necessary. An implementation was about 65 MB.
also made where both subdiagonal and superdiagonal blocks An approximate comparison between the Taylor propaga-
were present although the runtime was longer. tor and the Runge-Kutta method was made by comparing a
10th order Taylor propagator to a classic 4th order Runge-Kutta
5.1. Benchmarking Results propagator and the 4th (5th) order Embedded pair Runge-Kutta-
Felhberg (RKF) method. The 5th order solution was chosen
Since OpenCL allows for both GPU and CPU execution we from the RKF method. It has been noted in the literature that
have benchmarked GPU execution on an AMD FirePro v7800, high order Taylor propagators with large step sizes perform bet-
a single GPU on a NVIDIA Tesla S1070 Computing System ter than lower order Taylor propagators with smaller time steps
node and a dual core Intel Xeon. [34]. As a result of this a 10th order Taylor propagator was
In comparing the Taylor propagator a specific step size and chosen. So we can compare like with like the step size of the
order was chosen so that for every computation it can be guar- Taylor propagator was altered so that an equal number of ma-
anteed that the propagator will maintain unitarity. The step size trix vector calculations would be performed. Similarly, the 5th
chosen does not represent the optimal choice and so should not order solution from the RKF method was also performed with
be used in comparison to other methods. What is of interest is an adjusted time step.
how the method scales as the work size is linearly increased. As can be seen from Figure 4 there was no major discrep-
Since the method is nearest neighbour the computational over- ancy in the runtime of the Taylor and Classic RK4 propagators
heard for the simulation also rises linearly. Any deviations from but there was a major discrepancy with the RKF 4(5) propa-
this linearity would be due to the limitations in the hardware or gator which we attribute to it being an embedded pair method.
algorithms used. The RKF 4(5) propagator, of which we use the 5th order solu-
By chosing to benchmark along the number of angular mo- tion, also consistently deviates in the photoelectron spectrum,
menta in the basis set, the computational cost of the problem an example of which is seen in Figure 3. All three methods did
can be increased linearly by increasing the number of angular contain the expected structure within the PES, but the RKF 4
momenta. The computational cost is linear as the matrix vector (5) method deviates in the expected intensities. By comparing
calculation is, in this particular application, a nearest neighbour the unitarity of the solution of the methods a lower bound on
problem. The size of the Hamiltonian in terms of number of the error can be obtained. The Taylor propagator kept unitarity
double precision elements is (Nl + Nl2 )l where Nl is the num- to a level of 1.9 × 10−14 , while the classic RK4 method kept
ber of pairs of double precision elements required to represent unitarity to 2.3 × 10−11 and the RKF 4(5) method deviated from
the coefficients for each angular momenta. For example, for unitarity by 2.5 × 10−11 . Step sizes for each method were cho-
7
Figure 4: A graph of the performance of a 10th order Taylor, 4th
order Classic RK4 and 5th Order Runge-Kutta-Fehlberg propa- Figure 5: A graph of the performance of an AMD v7800 in
gators with time steps of 6.25 × 10−3, 2.5 × 10−3 and 3.75 × 10−3 comparison to one GPU compute device in the Tesla S1070 for
respectively. The time steps were chosen so that the same num- a tenth order Taylor propagator. Shown is the effects of sev-
ber of matrix vector calculations would be performed in all eral different configurations of work items in each work group;
cases. Since the Fehlberg method is a multi-order method it the work group size is shown in brackets. The optimal num-
requires slightly more computations and so does not scale as ber of work items per work group is architecture dependent. 64
well as step-size control is not implemented. was optimal for the AMD GPU but for the NVIDIA GPU 192
work items per work group was optimal. A step size of 0.005
was chosen. The number of equations indicates the number of
sen so that an identical number of matrix vector calculations real equations, that is every complex equation consists of 2 real
would be performed. The performance figures should not be equations.
used to decide on the choice of method, rather it is used here
to demonstrate that the number of matrix vector calculations,
which corresponds to the number of kernels queued, appears
to be the primary factor for deciding the runtime speed of the
algorithms.
It can be seen from Figure 5 that for the NVIDIA GPU, 192
work items gives the greatest reduction in runtime whilst for the
AMD GPU, 64 work items gives the best performance in these
circumstances. The AMD GPU has a highly linear increase in
run time as expected from a consistent use of the computational
resources.
When the GPU results are compared to the serial CPU re-
sults a clear trend is seen (Figure 6). The GPU based simula-
tions using OpenCL scale better than the CPUs; The runtime
within the region shown in the figure for the particular pulse
described, where x is the number of double precision elements
in the vector of coefficients, is:
Figure 6: The runtime in seconds of the best performing
tINT EL (x) = 0.14x − 170 NVIDIA and AMD configurations from Figure 5 with an In-
tAMD (x) = 0.0032x + 14 tel Xeon.
tNVIDIA (x) = 0.010x + 9.3
The CPU timing must break down for smaller situations but increases. If, in what is most likely an overly optimistic sce-
this is unimportant since the number of explicit states is 640 nario, one took the CPU scaling to be linear with the number
this means the smallest possible vector of coefficients is 2560 of cores this would still provide a speedup of the GPU of an or-
elements. der of magnitude in comparison to a multi-core system which,
Figure 7 indicates the general trend which can be extrap- incidentally, would be more expensive to purchase. With the
olated from the above equations: the speedup for the AMD V7800 card the relationship would terminate at a matrix which
device tends towards a 40 times speedup whilst the NVIDIA can fit into an array of size 256 MB because the largest single
device tends towards a 14 times speedup as the problem size block of memory allocatable on the device is 256 MB.
8
the matrix vector operations represent the most significant com-
putational bottleneck in the method.
6. Conclusions
A vast number of problems can be formulated in terms of
a system of first-order ODEs. For propagators in which matrix
vector calculations represent a significant bottleneck, signifi-
cant runtime reductions can be achieved by the use of GPUs
through the OpenCL language. A number of strategies for op-
timization exist in OpenCL which we discussed briefly for our
particular case. Optimizing the existing code will require fur-
ther work for an expected further order of magnitude improve-
ment in runtime scalability. It also goes without saying that,
with improvements in compilers and hardware, future trends
should be for fine-tuning optimizations to be performed by so-
phisticated compilers and hidden behind generic functions.
7. Acknowledgements
Figure 7: AMD V7800 and NVIDIA T10 GPU speedups in
comparison to the runtime on a single core of an Intel Xeon This work has been funded under a Seventh Framework
CPU. The speedup factor here is the ratio of runtimes ttGPU
CPU
. Programme under the project HPCAMO/256601. It has been
performed in association with SCI-SYM center of DCU. The
authors acknowledge ’Ireland’s High Performance Computing
5.2. Optimization approaches Center (ICHEC) for their support in using ICHEC’s computa-
Although no optimization has been performed, this repre- tional resources. Both authors acknowledge support from the
sents a future line of work when the need for further runtime COST CM0702 action. LAAN is supported by the Science
reductions becomes an issue. Aligning memory accesses is the Foundation Ireland (SFI) Stokes Lectureship program.
first step in any optimization. It makes certain that memory
accesses occur on word aligned boundaries. This is achieved Appendix A. OpenCL compute devices tested
by ensuring that each row of the Hamiltonian and each angular
AMD FirePro V7800. The AMD FirePro V7800 is a PCI-e
momenta block of the vector C(t) are aligned.
x16 connected graphics card [37] with 288 processing cores.
If a large cache is available on the GPU then prefetching can
Each core consists of four arithmetic logic units (ALU) and a
be enhanced by explicit cache functions available in OpenCL.
transcendental unit which are fed instructions through a Very
Alternatively compute unit memory can be utilized. Compute
Long Instruction Word (VLIW). The ALUs can be thought in
unit memory accesses have a much lower latency than global
OpenCL terms as a processing element. For double precision
memory access by approximately an order of magnitude. Using
the transcendental unit is not used and the remaining four are
compute unit memory also means that there will be less demand
grouped into two double precision execution units, thus there
on the memory controllers.
are two double precision processing elements per processing
On AMD for example, multiple accesses to a specific mem-
core. This means that for practical purposes 576 double preci-
ory controller are serialized when there is a conflict. Atomic
sion instructions can be executed simultaneously. For floating
operations should be avoided if possible on the AMD architec-
point calculations 1152 instructions can be executed. The pro-
ture as a single atomic operation can dramatically reduce all
cessing cores are grouped into compute units. Obviously the
other memory operations. Unrolling loops can also help the
actual number of instructions executed in a cycle is dependent
compiler take advantage of the memory access structure but it
on the form of the workload. A compute unit (a SIMD proces-
also increases register use [35].
sor) consists of 16 of the processing cores; as a result there are
Another direction to improve the present implementation
18 compute units. 1 Gigabyte of global memory is available as
is the use of more sophisticated propagation algorithms than
well as 32KB of memory per compute unit. Each processing
the Taylor and the Runge-Kutta. An important candidate, wor-
element has access to a pool of registers (256KB per compute
thy of consideration, should be the Arnoldi/Lanczos algorithm.
unit). Global memory is accessed with GDDR5. The core clock
Whilst other theoretical studies have remarked that the Taylor
is 700 MHz.
propagator is both simple and reliable [34] they argue that it
is slower than the Lanczos propagator, mainly due to smaller NVIDIA Tesla S1070 Computing Systems. The NVIDIA Tesla
timesteps which the Taylor propagator requires [36, 16]. An S1070 computing system consists of multiple Tesla T10 GPUs
optimized Taylor propagator should lend itself towards the con- which are based on the GeForceGTX 200 GPU [38]. Each GPU
struction of the Krylov subspace for a Lanczos propagator since contains 240 scalar processing cores and 4GB of memory [39].
Currently a single GPU is targeted.
9
Intel Xeon. The Intel
R Xeon
R W3503 [40] used is a 64 bit method 4th and 5th order steps are calculated using the same
dual core CPU with a clock speed of 2.4 GHz, a 4 MB cache derivative calculation information; the difference between the
and with support for DDR3 memory with a 25.6 GB/s memory two methods gives an indication of the local error size.
bandwidth.
References
Appendix B. The Taylor and Runge-Kutta time propaga-
[1] K. C. Kulander, Time-dependent hartree-fock of multiphoton ionization:
tors Helium, Phys. Rev. A 36 (1987) 2726.
[2] P. Lambropoulos, P. Maragakis, J. Zhang, Two-electron atoms in strong
The Truncated Taylor Series Propagator. Within this single- fields, Physics Reports 305 (1998) 203.
step algorithm the forwarded solution is obtained as: [3] J. S. Parker, L. R. Moore, D. Dundas, K. T. Taylor, Double ionization of
X dtn helium at 390 nm, J. Phys. B 33 (2000) L691–L698.
[4] J. S. Parker, B. J. S. Doherty, K. T. Taylor, K. D. Schultz, C. I. Blaga, L. F.
C(t + dt) = C(n) (t), (B.1) DiMauro, High-energy cutoff in the spectrum of strong-field nonsequen-
n=0
n!
tial double ionization, Phys. Rev. Lett. 96 (2006) 133001.
[5] H. Wabnitz, A. deCastro, P. Gurtler, T. Laarmann, W. Laasch, J. Schulz,
where C (n) (t) is the n−th derivative of C(t) at time t. A recursive T. Moller, Multiple ionization of rare gas atoms irradiated with intense
expression for the required n-th derivatives of the coefficient vuv radiation, Phys. Rev. Lett. 94 (2005) 023001.
vector can be retrieved by successive integrations of Equation [6] A. A. Sorokin, S. V. Bobashev, K. Tiedtke, M. Richter, Multi-photon
(4) as: ionization of molecular nitrogen by femtosecond soft x-ray fel pulses,
J. Phys. B: At. Mol. Opt. Phys. 39 (2006) L299.
−i [7] K. C. Kulander, Time-dependent theory of multiphoton ionization of
C(n) (t) = H(t)C(n) (t) (B.2) xenon, Phys. Rev. A 38 (1988) 778.
n [8] J. L. Krause, K. J. Schafer, K. C. Kulander, Optical harmonic generation
where the zero derivative is equal to C(t). To arrive at this in atomic and molecular hydrogen, Chem. Phys. Lett. 178 (1991) 573.
[9] N. A. Nguyen, A. D. Bandrauk, Electron correlation of one-dimensional
expression one shall assume the time derivative of the Hamil-
h[sub 2] in intense laser fields: Time-dependent extended hartree-fock
tonian itself, within the forwarded time interval dt, is much and time-dependent density-functional-theory approaches, Phys. Rev. A
smaller than the rate of change of the coefficients. Particularly 73 (3) (2006) 032708.
for the present problem, this latter assumption is an excellent [10] J. Caillat, J. Zanghellini, M. Kitzler, O. Koch, W. Kreuzer, A. Scrinzi,
Correlated multielectron systems in strong laser fields: A multiconfig-
approximation, provided that the chosen time step dt is much
uration time-dependent hartree-fock approach, Phys. Rev. A 71 (2005)
smaller the field’s period 2π/ω, in other words, dt << 2π/ω. It 012712.
can be shown that this expression does indeed give an approxi- [11] P. G. Burke, P. Francken, C. J. Joachain, R-matrix-floquet theory of mul-
mate form of the unitary operator. In practical calculations, the tiphoton processes, J. Phys. B 24 (1991) 751.
[12] H. W. van der Hart, M. A. Lysaght, P. G. Burke, Time-dependent mul-
above expression consists of a Taylor series truncated to some tielectron dynamics of ar in intense short laser pulses, Phys. Rev. A 76
order N, which in combination with the time step dt sets the (2007) 043405.
order of accuracy of the solution. Finally, from this expression, [13] X. Guan, O. Zatsarinny, K. Bartschat, B. I. Schneider, J. Feist, C. J. Noble,
after calculation of the derivatives of the coefficient at a known General approach to few-cycle intense laser interactions with complex
atoms, Phys. Rev. A 76 (2007) 053411.
time t, they are combined together in a summation in order to [14] H. W. van der Hart, M. A. Lysaght, P. G. Burke, Momentum distributions
then calculate the wavefunction at a later time t + dt according of electrons ejected during ultrashort laser interactions with multielectron
to Equation B.1. The calculation for each step consists of N atoms described using the r -matrix basis sets, Phys. Rev. A 77 (2008)
matrix-vector multiplications and N vector additions onto the 065401.
[15] L. A. A. Nikolopoulos, J. S. Parker, K. T. Taylor, Combined r-matrix
solution of the system a step later. eigenstate basis set and finite-difference propagation method for the time-
dependent schrödinger equation: The one-electron case, Phys. Rev. A 78
Explicit Runge-Kutta Methods. Runge-Kutta methods have been (2008) 063420. doi:10.1103/PhysRevA.78.063420.
some of the most widely used methods for solving ordinary dif- [16] M. Lysaght, L. Moore, L. Nikolopoulos, J. Parker, H. Hart, K. Taylor, Ab
Initio Methods for Few- and Many-Electron Atomic Systems in Intense
ferential Equations [41]. Runge-Kutta methods use information
Short-Pulse Laser Light, in: A. D. Bandrauk, M. Ivanov (Eds.), Quantum
from several steps to approximate a Taylor expansion [42, pg Dynamic Imaging, CRM Series in Mathematical Physics, Springer New
906]. The class of explicit Runge-Kutta methods is expressed York, 2011, pp. 107–134, 10.1007/978-1-4419-9491-2 8.
in the form [43]: [17] L. Moore, M. Lysaght, L. Nikolopoulos, J. Parker, H. van der Hart,
K. Taylor, The rmt method for many-electron atomic systems in intense
short-pulse laser light, Journal of Modern Optics 58 (13) (2011) 1132–
S
X 1140. doi:10.1080/09500340.2011.559315 .
[18] S. Pennycook, S. Hammond, G. Mudalige, S. Wright, S. Jarvis, On the
C(t + dt) = C(t) + dt bi C(i) (B.3)
acceleration of wavefront applications using distributed many-core archi-
i=1 tectures, The Computer Journaldoi:10.1093/comjnl/bxr073 .
i−1
X [19] Cuda (2012).
(i) ( j)
C = f t + ci dt, C(t) + dt
ai j C (B.4) URL www.nvidia.com/cuda
j=1 [20] Opencl.
URL www.khronos.org/opencl
where f () represents the derivative of C. [21] Pgi cuda fortran compiler.
URL http://www.pgroup.com/resources/cudafortran.htm
A number of Runge-Kutta methods exist such as the classic [22] Fortrancl.
fourth order Runge-Kutta method and the 4th(5th) order em- URL http://code.google.com/p/fortrancl/
bedded pair Fehlberg method[44]. In the Runge-Kutta-Fehlberg
10
[23] M. A. Marques, A. Castro, G. F. Bertsch, A. Rubio, oc-
topus: a first-principles tool for excited electron–ion dynam-
ics, Computer Physics Communications 151 (1) (2003) 60–78.
doi:10.1016/S0010-4655(02)00686-0.
[24] M. Harvey, G. D. Fabritiis, Swan: A tool for porting cuda programs to
opencl, Computer Physics Communications 182 (4) (2011) 1093–1099.
doi:10.1016/j.cpc.2010.12.052.
[25] J. E. Stone, D. J. Hardy, I. S. Ufimtsev, K. Schulten, GPU-
accelerated molecular modeling coming of age, Journal of
Molecular Graphics and Modelling 29 (2) (2010) 116–125.
doi:10.1016/j.jmgm.2010.06.010.
[26] R. Ueda, Y. Matsumoto, M. Itagaki, S. Oikawa, Effectiveness of gpgpu
for solving the magnetohydrodynamics equations using the cip-mocct
method, Plasma Fusion Res.
[27] M. Weigel, Simulating spin models on gpu, Computer Physics Communi-
cations Special Edition for Conference on Computational Physics Trond-
heim Norwaydoi:10.1016/j.cpc.2010.10.031 .
[28] T. Kramer, Time-dependent approach to transport and scattering in atomic
and mesoscopic physics, in: R. Bijker, O. Castan˜Os, R. Jáuregui, &
O. Rosas-Ortiz (Ed.), American Institute of Physics Conference Series,
Vol. 1334 of American Institute of Physics Conference Series, 2011, pp.
142–165.
[29] H. Bauke, C. H. Keitel, Accelerating the fourier split operator method via
graphics processing units, Computer Physics Communications 182 (12)
(2011) 2454–2463.
[30] GDDR5 Memory description.
URL http://www.amd.com/us/products/technologies/gddr5
[31] Khronos Group, The OpenCL Specification, original draft Sept 2010
(June 2011).
URL http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf
[32] I. Sobel’man, Introduction to the theory of atomic spectra, Pergamon
Press Ltd, Oxford, 1972.
[33] L. A. A. Nikolopoulos, Electromagnetic transitions between states sat-
isfying free-boundary conditions, Phys. Rev. A 73 (2006) 043408.
doi:10.1103/PhysRevA.73.043408.
[34] J. Parker, C. W. Clark, Study of a plane-wave final-state theory of above-
threshold ionization and harmonic generation, J. Opt. Soc. Am. B 13 (2)
(1996) 371–379. doi:10.1364/JOSAB.13.000371.
[35] AMD Accelerated Parallel Processing OpenCL¢, Tech. rep., AMD (Aug
2011).
URL http://developer.amd.com/sdks/amdappsdk
[36] E. S. Smyth, J. S. Parker, K. Taylor, Numerical integration
of the time-dependent Schrödinger equation for laser-driven he-
lium, Computer Physics Communications 114 (1-3) (1998) 1–14.
doi:10.1016/S0010-4655(98)00083-6.
[37] ATI FirePro V7800 Professional Graphics (2010).
URL http://www.amd.com
[38] Tech. rep., NVIDIA (April 2010). [link].
URL http://www.nvidia.com/docs/IO/43395/SP-04154-001_v03.pdf
[39] Tesla S1070 Datasheet, NVIDIA (June 2008).
[40] Intel Xeon Processor W3503 (4M Cache, 2.40 GHz, 4.80 GT/s Intel QPI)
(2009).
URL http://ark.intel.com/products/40799
[41] J. R. Cash, A. H. Karp, A variable order runge-kutta method for initial
value problems with rapidly varying right-hand sides, ACM Trans. Math.
Softw. 16 (1990) 201–222.
[42] W. H. Press, S. A. Teukolsky, W. T. Vetterling, B. P. Flannery, Numer-
ical Recipes 3rd Edition: The Art of Scientific Computing, 3rd Edition,
Cambridge University Press, 2007.
[43] U. M. Ascher, L. R. Petzold, Computer Methods for Ordinary Dif-
ferential Equations and Differential-Algebraic Equations, SIAM, 1998.
doi:10.2277/0898714125.
[44] E. Fehlberg, Low-order classical runge-kutta formulas with stepsize con-
trol and their application to some heat transfer problems, Tech. Rep.
NASA TR R-315, NASA, George C. Marshall Space FLight Center, Mar-
shall Ala (July 1969).
11