Single Pulse Detection Algorithms For Real-Time

Single Pulse Detection Algorithms for Real-time Fast Radio
Burst Searches using GPUs

arXiv:1910.08324v2 [astro-ph.IM] 27 Apr 2020
Karel Adámek1 and Wesley Armour ∗ 1
1
Oxford e-Research Centre, Department of Engineering Sciences, University of
Oxford, 7 Keble road, OX1 3QG, Oxford, United Kingdom
April 28, 2020
Abstract
The detection of non-repeating or irregular events in time-domain radio astronomy has
gained importance over the last decade due to the discovery of fast radio bursts. Existing or
upcoming radio telescopes are gathering more and more data and consequently the software,
which is an important part of these telescopes, must process large data volumes at high
data rates. Data has to be searched through to detect new and interesting events, often in
real-time. These requirements necessitate new and fast algorithms which must process data
quickly and accurately. In this work we present new algorithms for single pulse detection
using boxcar filters. We have quantified the signal loss introduced by single pulse detection
algorithms which use boxcar filters and based on these results, we have designed two distinct
”lossy” algorithms. Our lossy algorithms use an incomplete set of boxcar filters to accelerate
detection at the expense of a small reduction in detected signal power. We present formulae
for signal loss, descriptions of our algorithms and their parallel implementation on NVIDIA
GPUs using CUDA. We also present tests of correctness, tests on artificial data and the
performance achieved. Our implementation can process SKA-MID-like data 266× faster than
real-time on a NVIDIA P100 GPU and 500× faster than real-time on a NVIDIA Titan V GPU
with a mean signal power loss of 7%. We conclude with prospects for single pulse detection
for beyond SKA era, nanosecond time resolution radio astronomy.
Keywords — FRB — Transient detection — Astronomy data reduction — Computational as-
tronomy — Single pulse detection — GPU — CUDA
1 Introduction
The discovery of Fast Radio Bursts (FRBs) by Lorimer et al. [2007] and Rotating Radio Transients
(RRATs) by McLaughlin et al. [2006] has highlighted the importance of single pulse detection in
time-domain radio astronomy. FRBs and RRATs are rare, non repeating or irregular events,
therefore their accurate detection is of great importance if we are to understand the nature of
the objects producing them. Furthermore, on modern radio telescopes the algorithms for single
pulse detection are required to process more and more data collected by these instruments. Also,
the rise of multiwavelength astronomy which requires real-time or near-real-time (Middleton et al.
[2017]) detections increases the requirements put on single pulse detection even more. All of this
necessitates the need for fast and accurate algorithms designed to detect single isolated pulses.
Single pulse detection algorithms are of importance in searches for giant pulses or irregular pulses
from nulling pulsars also.
∗ E-mail address: [email protected]
1
The single pulse search was described by Cordes and McLaughlin [2003] as an exercise in
matched filtering. They employed a set of boxcar filters of widths 2n . This technique was used suc-
cessfully in subsequent works (for example Burke-Spolaor and Bailes [2010], Cordes et al. [2006],
Deneva et al. [2009], Rubio-Herrera et al. [2013]). The boxcar filter approach was also adopted by
many software packages, like Heimdall1, Seek2 , Destroy3. In some cases it is beneficial to use
more targeted matched filters as suggested by Keane et al. [2010].
The single pulse detection problem is also being explored using machine learning and deep learn-
ing techniques for example Connor and van Leeuwen [2018], Wagstaff et al. [2016], Zhang et al.
[2018]. Comparison of these techniques to traditional searches is however outside the scope of this
work.
It is becoming increasingly important to understand the associated sensitivity and signal loss
introduced by algorithms and software when considering the overall performance of a radio tele-
scope. As such, any sensitivity loss introduced by algorithms or software must be investigated.
This was discussed by Keane and Petroff [2015], where the authors compared the sensitivity of
different single pulse detection algorithms. Single pulse detection algorithms can have many dif-
ferent forms all with different sensitivities, and also different computational costs. In the case of
single pulse detection by boxcar filters, the decrease in sensitivity can occur when a detected pulse
is poorly matched by the filters used to extract it from the noise in which it is embedded.
In this work, we quantify sources of signal loss which occur when single pulses are detected
using boxcar filters. Based on these, we have designed two distinct methods to distribute and
create a limited set of boxcar filter widths in such a way that the signal loss is controlled. We
show that the methods presented significantly increase the computational efficiency of traditional
approaches and hence provide a significant reduction in execution time, which is of relevance to
real-time processing pipelines. We present formulae for calculating the signal loss introduced by
a given set and distribution of boxcar filters for an idealized model which uses rectangular pulses.
These formulae allow the user to tune the signal loss of their single pulse detection scheme.
We present calculations of the computational complexity and the number of memory accesses
associated with each method.
Based on our two methods we have designed and implemented two single pulse detection
algorithms for NVIDIA GPUs. These algorithms rely on reusing partial sums to calculate long
boxcar filters required for detection of wide single pulses. This increases computational efficiency
and decreases the number of memory accesses. We present optimization techniques used in the
implementation of these algorithms on GPUs. We also describe the advantages and disadvantages
of these algorithms. Reusing partial sums to increase computational efficiency is well known in the
radio astronomy community. It has been successfully used in the fast folding algorithm by Staelin
[1969] with the most recent implementation by Parent et al. [2018] (using python). The reuse of
partial sums was also successfully used in the tree de-dispersion algorithm by Zackay and Ofek
[2017] who implemented tree de-dispersion using CPU, but also suggested an algorithm suitable
for GPUs.
This work is part of the AstroAccelerate software package Armour et al. [2019], a GPU opti-
mized time-domain processing pipeline for radio astronomy data. It contains GPU implementa-
tions of common signal processing tools used in time-domain radio astronomy, such as de-dispersion
by Armour et al. [2012b], a GPU implementation of the Fourier domain acceleration search by
Dimoudi et al. [2018], and also periodicity search with a GPU implementation of the harmonic
sum by Adámek and Armour [2018]. Aspects of the work have been used in Karastergiou et al.
[2015] and Mickaliger et al. [2018].
This paper is structured as follows. First, in section 2 we analyze the sensitivity of the single
pulse detection method which uses boxcar filters and derive formulae for quantifying the signal
loss. In section 3 we present two distinct algorithms and describe their implementation using the
CUDA language extension on NVIDIA GPUs. Our results are presented in section 4, where we
present sensitivity of our algorithm when applied to rectangular, Gaussian and double Gaussian
1 https://sourceforge.net/projects/heimdall-astro/
2 https://github.com/SixByNine/psrsoft
3 https://github.com/evanocathain/destroy_gutted
2
pulse profiles with and without white noise. We prove the correctness of our implementation by
comparing the measured signal loss with predicted values. We also present the performance of
both algorithms and compare it to Heimdall, a GPU accelerated pipeline by Barsdell et al. [2012],
Jameson and Barsdell [2018]. We conclude the paper in section 5.
2 Single pulse detection

The aim of the single pulse search, which represents a whole set of techniques and methods, is to
find isolated pulses in input data. In this work we focus on the step responsible for recovering
the pulse, located somewhere in the input time-series. We will refer to this step as single pulse
detection or the SPD algorithm.
Single pulse detection, in general, relies on a match filtering process, which is a convolution of
the input time-series x with a response function h. Convolution in the time domain is given by
T
X −1
y[n] = h[i]x[n − i] (1)
i=0
where n is the time sample, y is the filtered time-series and T is the length of the response function.
The response function is typically designed to detect pulses of a certain or similar shape of width
T . Significantly longer pulses of the same shape or of different shape require a different response
function h. A Matched filtering approach thus requires multiple passes through the data in order
to cover the desired range of widths and shapes and the convolution is computationally expensive
and offers little opportunity to reuse data4 .
Because of the computational expense of matched filtering, we have decided to use boxcar filters
for our single pulse detection algorithm. If we assume that the pulse we would like to detect could
be located anywhere within the time-series, be of any shape, and have a wide range of widths,
the boxcar filter is a viable alternative. Boxcar filters are less sensitive than matched filters, but
offer two important advantages over the matched filter approach. They are independent of the
pulse shape and they allow us to reuse data and computations, which is critical when producing
an algorithm for execution on modern computer architectures, especially accelerator architectures
such as GPUs.
The boxcar filter is a simple running sum which can be expressed as
L−1
X
XL [n] = x[n + j] , (2)
j=0
where L is the boxcar width. This formalism allows us to reuse partial sums because we can form
a new longer boxcar filter from an appropriate combination of shorter boxcar filters.
We can measure the strength of a sample by calculating the signal-to-noise ratio (SNR). The
SNR of the n-th sample from a time-series is calculated by the formula
x[n] − µ(x)
SNR[n] = , (3)
σ(x)
where µ(x), σ(x) is the mean and standard deviation of the underling noise in the initial time-series
x. By applying a boxcar filter to a time-series we are creating a new time-series with different
mean µ(XL ) and standard deviation σ(XL ). The SNR for a pulse from this time-series is then
calculated as
XL [n] − µ(XL )
SNRL [n] = , (4)
σ(XL )
4 This is because partial sums from which we can construct the output sample y[n] cannot be used to construct
sample y[n + i], since the elements of x within these sums are weighted by the matched filter h.
3
where XL [n] is given by equation (2). The value XL [n] of the new time-series is a value of the
boxcar filter and the bin width, or put another way, the number of the accumulated samples from
the initial time-series x is equal to the boxcar width.
We have chosen to adopt the SNR of the pulse as a figure of merit by which we measure how
well any SPD algorithm detects individual pulses.
The SNR produced by the algorithm which we call recovered SNR (RSNR), to distinguish it
from the true SNR of the pulse, is calculated using equation (4). The value of the RSNR depends
not only on the initial SNR of the injected pulse and its shape, but also on its position within
the time-series and its width. This is because to localize the pulse and sum the power contained
within it, we use an incomplete set of boxcar filters. That is, the boxcar filters do not cover every
possible pulse width at every possible time sample. As a consequence a pulse of the same shape
could be detected with different RSNR based on its position in time and its width. Lower RSNR
occurs when the injected pulse is not properly matched by these filters either in width or position
in time. The highest RSNR will be produced by the boxcar with width and position that best fits
the unknown signal.
2.1 Idealized signal model

In order to compare algorithms, to evaluate sensitivity loss and to simplify sensitivity analysis, we
have introduced an idealized model (simplified toy model) of the input signal.
The simplest form of signal which fits this role is the signal with a rectangular pulse, without
any noise, where all samples except those of the pulse are set to zero. The pulse is described
by its position ts within the time-series, by its width S, and by its amplitude A. The mean (µ)
and standard deviation (σ) which are required in the calculation of the SNR (eq. (3)) are set to
µ(x) = 0 and σ(x) = 1 to simulate the presence of white noise.
To get the value of the mean µ(XL ) and standard deviation σ(XL ) for longer boxcar filters we
use the white noise approximation. That is the mean and standard deviation for time-series after
application of the boxcar filter of width L is given as
µ(XL ) = Lµ(x) ,
√ (5)
σ(XL ) = Lσ(x) .
√
For the white noise approximation we get µ(XL ) = 0 and σ(XL ) = L.
For the purpose of comparing the sensitivity of the SPD algorithm to different pulse widths
we normalize the amplitude (A) of the rectangular pulse to
√
A(S) = C/ S , (6)
where C is a normalization constant. The normalization ensures that the RSNR produced by the
boxcar filter which fits the rectangular pulse perfectly is the same regardless of the pulse width S.
2.2 Sensitivity analysis

We quantify the sensitivity of the SPD algorithm in terms of signal loss. The signal loss Ψ is
given as the fraction of the pulse’s true SNR (SNRT ) which was not detected by the algorithm.
That is, the difference between RSNR detected by the algorithm and the true value SNRT of the
injected pulse divided by SNRT ,
SNRT − RSNR RSNR(S, L, ds )

Ψ= =1− , (7)
SNRT SNRT
where the value of RSNR depends on the pulse width, width of the boxcar filter used for detection
and on the position of the boxcar filter with respect to the pulse’s position.
As the position of the signal is unknown we must apply the boxcar filter of width L to the
whole time-series x, which may or may not by applied to each and every time sample. If we are
4
only interested in the highest RSNR detected, for fixed boxcar width L, then this reduces the
sensitivity analysis to a problem of two consecutive boxcar filters separated by Ls time samples,
we call this distance boxcar separation. This problem which is depicted graphically in Figure 1
is then repeated throughout the whole time-series x. This allows us to avoid studying individual
boxcar filters, but still gives enough flexibility to evaluate sensitivity or signal loss introduced by
the SPD algorithm for any signal present in the time-series.
Bi
Bi+1
Ls L
Figure 1: Layout of the boxcar filters covering a time-series x. Boxcars (Bi , Bi+1 ) are shown as
hatched boxes. Depending on the value of Ls (here Ls = 2), multiple boxcars can intersect with
themselves. Only two are shown here.
In our analysis of the sensitivity of SPD algorithms, for a detailed description see appendix
A, we have, using the idealized signal model, identified two quantities of the RSNR which can be
used to describe the sensitivity of an SPD algorithm.
The first is the lowest RSNR detected RSNRmin (S) by the SPD algorithm under the worst
conditions given the pulse width S. This acts as an infimum (greatest lower bound) for RSNR
detected by the SPD algorithm for any pulse of a given width S. Along with it, we define worst
case scenario signal loss
Ψw (S) = 1 − RSNRmin (S)/SNRT , (8)
that is the greatest signal loss introduced by the SPD algorithm.
The second quantity is the highest possible value of RSNR (RSNRmax (S)) which could be
recovered by the SPD algorithm under the best possible circumstances for the given pulse width
S. This acts as a supremum (least upper bound) for RSNR which could be detected by the SPD
algorithm. To accompany this quantity we have defined the systematic signal loss
Ψl (S) = 1 − RSNRmax (S)/SNRT , (9)
which is always introduced by the SPD algorithm for a pulse of width S.
2.2.1 Parameters of the SPD algorithm

We have also determined two parameters which describe the SPD algorithm and have direct
implication on the sensitivity. We characterize the SPD algorithm by the set of boxcar filter
widths B = {L1 , L2 , . . . , Lmax } and by boxcar separation Ls for each boxcar width L ∈ B.
The number of different boxcar widths, that is the sparsity of the set B affects the value of
RSNRmax . We can increase RSNRmax (decrease the systematic signal loss Ψl ) by including more
boxcar filter widths performed by the SPD algorithm. The set B to a lesser degree also affects
RSNRmin .
The value of the boxcar separation Ls is most important for the value of RSNRmin . In order
to increase RSNRmin (decrease the worst case signal loss Ψw ) we have to decrease the boxcar
separation Ls .
5
The SPD algorithm may consist of multiple iterations where both B and Ls may differ. We
assume that the time-series in which we want to detect pulses is completely covered by boxcar
filters of a given width L.
2.2.2 Importance of maximum and minimum RSNR

There are several reasons why RSNRmax/min(S) is important. Firstly, from surpremum and in-
fimum properties of the RSNRmax/min we get RSNRmax (S) ≥ RSNRmin (S). That is, we cannot
increase RSNRmin (S) above RSNRmax (S). Since RSNRmin (S) is mainly improved by decreasing
boxcar separation Ls this means that decreasing Ls may yield diminishing results. This is shown
in Figure 2.
Effect of decreasing distance between boxcar filters Ls on RSNRmin

18
RSNRmax Ls=L/2 Ls=L/8
Ls=L Ls=L/4
16 100
Recovered SNR in %
Recovered SNR
90
14
80
12
70
10
60
8 50
0 20 40 60 80 100
Signal width [samples]
Figure 2: Behavior of RSNRmin (S) using different boxcar separation Ls . This shows that
RSNRmin (S) is limited by RSNRmax (S). In order to increase RSNRmin (S) further, thus increasing
the sensitivity of a SPD algorithm, we have to increase RSNRmax (S) first.
Secondly, the RSNRmax/min(S) gives a hint at how we can increase or decrease the sensitivity
of an algorithm, but also how to trade sensitivity for computational performance.
Lastly, by knowing RSNRmax/min(S) for a given SPD algorithm we can compare them and rank
them. It can also serve as a verification tool to check if an implementation of a given algorithm
works as expected. In this case the algorithm must not produce any RSNR value which would lie
outside the limits given by RSNRmax/min (S).
2.2.3 Error in detected pulse width and time

The error in the detected width of the pulse for the SPD algorithm can be expressed as
|ST − L|
ǫW = (10)
ST
where ST is the true pulse width and L is the boxcar width which has detected the pulse, i.e. the
detected width.
6
Similarly we can define the error in detected position in time relative to its true position (time
localization) as a fraction of its true width ST as
|tT − t|
ǫt = (11)
ST
where tT is the true pulse position in time and t is the boxcar time position.
3 Single pulse detection algorithms

The single pulse detection algorithm may be required to scan for very long pulse widths Send with
related maximum boxcar filter width Lend. This could lead to unnecessary precision and longer
execution time as more boxcar filters than it is necessary would be calculated. This is why both of
our proposed SPD algorithms execute in multiple iterations with different parameters and inputs.
This allows us to control precision and increase performance. In order to decrease the sensitivity
of the SPD algorithm and to lower the amount of data which are processed we use decimation in
time (by a factor D).
To distinguish quantities from different iterations we decorate them with an index i, starting
with i = 0. We assume for simplicity that the decimation factor D does not change during the
execution of the algorithm and D(i) means D to the power of i and not an iteration index. Thus
we have:
• xi input time-series for given iteration (x0 = x is the initial time-series)

• Bi = Li1 , Li2 , . . . , Lim , . . . , Limax a set of boxcar widths used in ith iteration

• XLi [n] is the value of the boxcar filter for width L at time sample n
• X i [n] = XLi max [n] is the value of the boxcar filter at the end of iteration
• Send is the maximum desired pulse width to be searched for
• Lend is the maximum boxcar width calculated by the SPD algorithm; Lend is the boxcar
width of the nearest longer boxcar filter to the value of Send
The output of the SPD algorithm is the highest RSNR value detected by the boxcar filters.
We have chosen to separate the output by iteration, that is the output for ith iteration is
Y i [n] = maxi SNR(XLi [n]) ,

(12)
L∈B
where the SNR value is calculated using equation (4). In addition to the SNR value we assume
that the SPD algorithm provides the width of the boxcar which produced the highest SNR value
W i.
When designing an algorithm that is suitable for execution on GPUs, it is important to have
enough parallelism to ensure high utilization of the GPU hardware. Modern GPUs can process
many thousands of threads concurrently. The memory available for a single thread on the GPU
is extremely limited, either in the form of registers or the amount of shared memory available.
Therefore cooperation between threads within a single threadblock5 is important. The amount of
resources (for example the amount of local memory) consumed by a threadblock and how many
threadblocks can be executed concurrently then depends on the number of threads per threadblock.
5 Threadblock is a set of threads that can cooperate with each other.
7
3.1 BoxDIT algorithm
The algorithm we call BoxDIT is based on the ideal SPD algorithm. The ideal SPD algorithm is
the algorithm that performs boxcar filters of all widths up to a maximum Lmax at each element
of the initial time-series, thus detects any rectangular pulse with S < Lmax with its true SNR
value. The ideal SPD algorithm has poor performance due to unnecessary high sensitivity. The
BoxDIT algorithm trades some sensitivity for performance by introducing the decimation step
which reduces the number of boxcar filter widths performed and increases boxcar separation Ls .
Algorithm 1: Pseudo-code for the BoxDIT algorithm. The algorithm performs I iterations
required to reach desired boxcar width Lend. The quantity X i represents values of boxcars
at the end of the boxcar filter step. These values are used to built up longer boxcar filters
in higher iterations. Function Boxcars() calculates boxcar filters up to Limax and function
Decimate() performs decimation in time.
Input: x0 ;
Output: Y , W ;
L0s = 1;
L0MAX = 0;
for i = 0 to I do
Boxcar separation for this iteration;
Lis = Li−1
s D = D ;
(i)
Calculating boxcar filters on decimated data xi ;

Boxcars (Y i [n], W i [n], X i+1 , xi , X i , Limax );
Decimating data for next iteration;
Decimate (xi+1 , xi );
Width of the longest boxcar calculated in this iteration;
LiMAX = Li−1 (i) i
MAX + D Lmax ;
end
The BoxDIT algorithm (algorithm 1) is a sequence of iterations, where each iteration has two
steps. The first step is to perform all boxcar filters up to Limax with respect to time-series xi , at
every point of the input data xi . From the perspective of the initial time-series x0 the algorithm
calculates boxcar filters with a width step D(i) at every D(i) point of the initial time-series. For
example, if Limax = 4 for all i then: boxcar filters of width L = 1, 2, 3, 4 are calculated at every
point; after decimation, boxcar filters of width L = 6, 8, 10, 12 are calculated at every second point
and so on. Thus, each subsequent iteration detects wider pulses in the input data. This step uses
the previously calculated boxcar filter data to calculate longer boxcar filters.
The second step is to decimate the time-series xi in time by a factor D to create time-series
i+1
x for the next iteration. This is repeated until Lend is reached. The decimation in time that is
used in this paper is given by
D−1
X
xi+1 [n − Limax /D] = xi [Dn + j] , (13)
j=0
for all n − Limax /D ≥ 0. That is, the 0th element of the decimated time-series is set to a position
of the next time sample which must be added to the partial sum containing 0th element of the
initial time-series x0 . This is shown in Figure 3. Such a decimation has the advantage that the
maximum width Limax performed at any iteration of the BoxDIT algorithm has to fulfil only the
condition that it is divisible by D. During the decimation step we also reduce the number of time
samples to Ni = Ni−1 /D = N0 /D(i) , this is important as the higher iterations work with fewer
points. Thus each BoxDIT iteration is characterized only by Limax and D.
8
i=0
L0MAX= 4
i=1
L1MAX= 4 + 8 = 12
i=2
L2MAX= 4 + 8 + 16 = 28
i=3
Figure 3: Decimations and maximum widths calculated at each iteration of the BoxDIT algorithm.
Here Limax = 4 and D = 2. In the red, we see parts of the time-series which are omitted by the
decimations in time, i.e. total shift.
The width of the boxcar filter calculated by iteration i is given as

i−1
X
LiM = D(k) Lkmax + D(i) Lim . (14)
k=0
This could be also used to calculate the maximum boxcar width LiMAX for given iteration i.
The value of the boxcar filter at time sample n for iteration i is given as
Lim −1
X
i−1
XL [n] = X [n] + xi [nis + j] , (15)
j=0
where nis = n/D(i) .

The advantage of the BoxDIT algorithm is that the sensitivity of the algorithm is easily adjusted
by changing the maximum boxcar width Limax calculated by the boxcar step of the algorithm. The
disadvantage of the BoxDIT algorithm is that it has higher memory bandwidth requirements since
in addition to the decimated input data it also needs the values of the longest boxcar filter at every
point, which in effect doubles the size of the input data.
3.1.1 BoxDIT GPU implementation

The GPU implementation of the BoxDIT algorithm performs both steps, calculation of boxcar
filters and decimation, in one GPU kernel. The implementation must be able to extend the already
calculated boxcar filters by using values from the previous iteration. Such an implementation can
be used for any iteration of the BoxDIT algorithm without change.
On the input, we have the time-series xi , values for the longest boxcar filter from the previous
iteration X i−1 , and maximum boxcar width Limax calculated in this iteration. On the output we
expect to have the highest RSNR Y i for every point of the input time-series, associated boxcar
width W i , values of the longest boxcar calculated X i and decimated time-series for next iteration
xi+1 . We have used the decimation factor D = 2 for the GPU implementation of BoxDIT.
Calculating all boxcars up to given maximum width Limax at a given point is equivalent to
performing a prefix sum or a scan. In this case, the scan has to be performed at every point of
the input time-series. We have found that using the standard prefix sum algorithm like the Hillis-
Steel scan Hillis and Steele [1986] to be slow even when used on cached data. The reason is that
series independent scan operations do not cooperate by reusing data enough. Our implementation
focuses on increased cooperation between independent prefixed sums within the GPU thread-block.
The pseudo-code for the parallel GPU implementation of the BoxDIT algorithm is given in
Algorithm 2 and a graphical sketch is presented in Figure 4. This figure shows how we divide work
among GPU threads and how threads cooperate on a calculation of independent prefixed sums.
As a first step, each thread calculates a much shorter prefixed sum of length wm . That is each
thread calculates partial sums B[n; w], where n is a time sample and w = 1, 2, . . . , wm is the length
9
Algorithm 2: BoxDIT pseudo code.
Input: xi , X i , n;
Output: xf , X f , Y f , YLf ;
shared float s input [nThreads];
shared float s SNR [nThreads];
shared int s widths [nThreads];
float Bw [wm ], SNR, temp;
int w, widths;
int t=threadIdx.x;
Calculate initial partial sums (boxcar filters) up to width wm ;
for k = 0 to wm do
Pk
Bw [k] = j=0 xi [n + t + j];
Calculate SNR values corresponding to partial sums Bw ;
Calculate SNR (X i [n + t] + Bw , SNR, width);
Compare SNR values and store highest SNR and its width;
Compare SNR (SNR, width, s input, s widths);
end
Bw [wm ] is stored into shared memory;
s input [t]=Bw [wm ];
i
for w = 4 to Bmax do
if t − w ≥ 0 then
We are changing SNR values at position t − w, i.e. values Bw are used to build up
higher boxcar widths at position t − w;
temp=s input [t − w];
SNR=s SNR [t − w];
width=s widths [t − w];
for k = 0 to wm do
Bw [k] = Bw [k] + temp;
Calculate SNR (X i [n + t − w] + Bw , SNR, width);
Compare SNR (SNR, width, s input, s widths);
end
s SNR [t − w]=SNR;
s widths [t − w]=width;
w = w + 4;
end
10
S3 S2 S1 A
B[n,w]
SNR +
SNR
B[n-wm,wm+w]
B[n-2wm,2wm+w]
B[n-3wm,3wm+w]
Figure 4: Graphical sketch of the BoxDIT algorithm. Each thread operates on a single sample,
threads A, S1 , S2 , and S3 are emphasized. At the beginning, each thread calculates a small scan
of size wm (here wm = 4) and stores these values in the thread’s registers. This is represented by
rectangles of increasing size at below every fourth thread. Each thread then uses sums calculated
by threads with smaller index to calculate scan of the required size. The thread A uses largest
sums (wm long) from threads S1,2,3 to calculate prefixed sum of length 8, 12, and 16.
of the partial sum in the number of time samples. These partial sums B[n, w] are stored into the
GPU registers6 of a given thread. These partial sums serve as accumulators used for calculation
of the longer partial sums. The value of the last partial sum B[n; wm ] is also stored into shared
memory7 , because its value has to be distributed among other threads. The maximum SNR Y i [n]
for a given time sample n and boxcar width W i [n] are also stored in the shared memory as they
will be modified by other threads.
After this initial phase, each thread reads the partial sum B[n − wm ; wm ] from its appropriate
place in shared memory. Then using accumulators stored in its registers (partial sums B[n; w], for
w = 1, 2, . . . , wm ) each thread is able to produce a prefixed sum up of length w = 2wm . That is
partial sums B[n; w], where w = 1, 2, . . . , 2wm . These partial sums are calculate as
B[n − wm ; wm + w] = B[n − wm ; wm ] + B[n; w] . (16)
The accumulators in the threads registers are then updated to the new values representing partial
sums B[n − wm ; wm + w]. The thread calculates SNR value for each accumulator and compares
it to the highest detected SNR for a sample n − wm . If the new SNR is higher than the old SNR,
the highest SNR Y i [n] is updated together with boxcar width W i [n].
With each following iteration, each thread calculates longer partial sums for time samples
which are further from its starting time samples.
The number of iterations required to calculate all required boxcars is J = Limax /wm . The
6 GPU registers are the fastest type of GPU memory, but private to an individual thread.
7 Shared memory is a fast user managed cache on the GPU which all threads in a thread block can access.
11
algorithm used in our implementation can be expressed by equation
j−1
X
XLi im [n] = B[n + kwm ; wm ] + B[n + jwm ; w] , (17)
k=0
where j = ceil(Lim /wm ).

At the last step of the algorithm, the maximum values of the SNR Y i [n], the boxcar width
i
W [n] and the longest accumulated partial sum representing the value of the boxcar filter of width
LMAX [n] are stored into the device memory.
The intermediate partial sums produced by the algorithm are not stored to device memory as
they are needed only for finding the maximum SNR for every input sample.
3.2 IGRID algorithm

Our second algorithm, which we call IGRID, is based on the decimation in time SPD algorithm
(DIT algorithm). The DIT algorithm has poor sensitivity, with the average signal loss of 20% and
the worst signal loss of 40%, but it has high performance. In terms of RSNR both RSNRmax/min
are low. The IGRID algorithm decreases signal loss at the expense of the performance.
The DIT algorithm, shown in Algorithm 3, is a sequence of decimations in time applied repeat-
edly on the input time-series x0 , thus in effect producing boxcar filters of width Li = 2i , where
i is the number of decimations performed so far. Therefore we have B = {1, 2, 4, . . . , 2I }, where
I is the total number of iterations performed. The distance between these boxcar filters is the
same as the width Lis = 2i . Based on our analysis in section 2 we can say that large values of
Ls are mostly responsible for low RSNRmin values, while sparse coverage of boxcar widths B are
responsible for low RSNRmax values. The IGRID algorithm corrects for these shortcomings.
Algorithm 3: Decimation in time (DIT) algorithm. Function Calculate SNR calculates

SNR based on boxcar filter value, boxcar width, and mean and standard deviation. Function
Compare SNR compares SNR values for the same sample and stores highest SNR.
Input: xi , n;
Output: Y [n], W [n];
float SNR, temp;
int width;
int J number of iterations performed by the kernel ;
SNR=0;
for j = 1 to J do
Calculate SNR (xj [n], SNR);
width = 2j ;
Compare SNR (SNR, width, Y [n], W [n]);
xj+1 [n] = xj [2n] + xj [2n + 1];
end
One way of decreasing Ls is to perform an additional DIT operation on the input data xi pro-
ducing additional time-series xi+1 , which is shifted by one sample (2i samples from the perspective
of the initial data). That is for initial time-series we would create one more time-series x11 [n] =
x0 [2n + 1] + x0 [2n + 2] in addition to the already existing time-series x10 [n] = x0 [2n] + x0 [2n + 1]
which results from DIT operation which was not shifted. Let’s call these decimated time-series
layers and decorate them with a subscript indicating their shift in the number of samples with re-
gard to the initial time-series x0 . That is xih is a layer in the iteration i with the shift h. Let’s also
introduce IGRID step which is a set of layers with different shifts but with the same decimation.
In general if we have a parent layer xih which is decimated by i times, that is each sample is a
sum of 2i samples from the initial time-series x0 , we can use it to calculate two new child layers,
12
0
0
1
0 0
2 } 2
1
3 }
0
4 }
2
6 }
Figure 5: A graphical representation of our IGRID algorithm. The horizontal solid black lines
represent the time-series where each slot is one time sample. The DIT algorithm is represented
by time-series preceded by zero. Our IGRID algorithm improves the loss in sensitivity of the DIT
algorithm by introducing additional layers of boxcar filters which have an additional time shift
associated with them.
one layer xi+1

h with the same shift h as the parent’s and second layer xi+1
h+d which is shifted by
d = 2i initial time samples compared to the parent. This is shown in Figure 5, where we have:
• i = 0 Layer: 0 → 0, 1
• i = 1 Layer: 0 → 0, 2 and 1 → 1, 3
• i = 2 Layer: 0 → 0, 4 and 2 → 2, 6 note that layers 1 and 3 are not needed for further
iterations
In this example, we have chosen not to use layers x21 and x23 . This means the Ls increases, but we
do not need to keep these layers in memory and process them in further iterations.
Thus by using P layers we can decrease the distance between boxcar filter to Ls = 2i /P . These
layers must be shifted by 2i /P samples in order to have a constant step between each boxcar filter.
An uneven distribution of boxcar filters leads to increased signal loss at some parts of the input
time-series.
We can look at the algorithm differently using the time shifts alone, which now represent
appropriate layers. This is shown in Figure 6. The layer dependencies (from Figure 5) have the
structure of a binary tree (for decimation factor D = 2). If we focus on the right branches (marked
in red) we see that after a few IGRID steps there are no layers dependent on layers located in
these right branches of the binary tree. If we are able to keep and process all layers from these
branches in local memory, then all we need to keep in the device memory are layers xi0 .
The binary tree structure also hints at how we can deal with the second problem of the DIT
algorithm, which is the scarcity of the set of boxcar filter widths B. By decreasing Ls we increase
RSNRmin (S) but not RSNRmax (S), which could result in the situation, where RSNRmin (S) cannot
increase further since RSNRmin (S) ≤ RSNRmax (S). This is shown in Figure 2.
We can increase the number of calculated widths if we use boxcar filters from previous IGRID
steps and add them to boxcar filters from the current IGRID step. In other words, we traverse
the binary tree in an upward direction toward the root. The IGRID steps contain boxcar filters
of width L = 2i , thus by using previous IGRID steps we can produce boxcar filters of width
d
X
L= ak 2i−k , (18)
k=0
13
0
0
0 1
IGRID steps
1
0 2 1 3
2
0 4 2 6
3
0 8 4 12
4
Figure 6: Representation of the IGRID algorithm as a binary tree. Areas which are enclosed in
red contains layers which are not needed for the computation of the wider boxcar filters in the
following IGRID steps.
Effect of increasing number of boxcar widths on R-SNRmax

18
17
Recovered SNR in %
16 100
Recovered SNR
99
98
97
95
15
92.5
90
14
13 R-SNRmax(S) depth=0
R-SNRmax(S) depth=1 80
R-SNRmax(S) depth=2
R-SNRmax(S) depth=3
12
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Signal width [samples]
Figure 7: Effect of using the previous IGRID steps to increase the set of boxcar filter widths B
on RSNRmax (S).
where d is depth or the number of previous IGRID steps used, and ak = ±1 are variables deter-
mining the width of the resulting boxcar filter. The values of ak are further restricted to a0 = 1 for
positive L and also a1 = 1 since smaller boxcars then L = 2i would be calculated in the previous
IGRID step. This means that any given IGRID step, depending on depth d can produce any
boxcar width 2i ≤ L < 2i+1 . An example of this is shown in table 1 for starting boxcar of width
L = 16. However, by increasing the depth d we increase the number of layers which the IGRID
14
algorithm needs to access and therefore share between IGRID steps. The effect on RSNRmax (S)
caused by increasing the set B is shown in Figure 7.
Table 1: Example of possible boxcar widths which could be calculated by traversing the binary
tree in an upward direction and adding partial sums from previous iterations using equation (18).
Depth Width Increment
a0 a3 a2 a3 a1 a3 a2 a3
a 1 -1 -1 1 1 -1 1 1
0 16 0
1 16 24 +8
2 16 20 24 28 +4
3 16 18 20 22 24 26 28 30 +2
The advantage of the IGRID algorithm is that it has low memory bandwidth requirements (for
lower precision) as all data required by the algorithm is contained in the decimated input data
xi . The disadvantage of the IGRID algorithm is that for higher precision we have to use previous
IGRID steps which substantially changes the character of the algorithm (more input data, more
complicated calculations of boxcar filters). This makes the IGRID algorithm hard to implement
in a flexible way.
3.2.1 IGRID GPU implementation

We have implemented the IGRID algorithm with three levels of precision, these are abbreviated
as IG(d), where d is the depth. We have used three depths d = 1, 2, 3 which corresponds to IGRID
algorithms IG(1), IG(2), and IG(3). Each implementation of the IGRID algorithm also uses a
different number of layers. We have chosen 4 layers for the IG(1) 8 layers for the IG(2) and 16
layers for the IG(3) algorithm per IGRID step. We have implemented three different precision’s
because it allowed us to optimize each case and get the highest performance.
The main problem of the IGRID algorithm, especially when it is performed with a higher depth
parameter, is the amount of data which needs to be shared between IGRID steps. For IG(1) we
do not need any additional data since the input layer xi−1 h can also be used to calculate wider
boxcar filters. However for IG(2) we need in addition to the input layer xi−1 i−2
h , also its parent xh ,
i−3
and finally for IG(3) we need the parents of the parents (xh ). This means that the size of the
input data for IG(2) is twice as big as for IG(1) and the size of the input data for IG(3) is four
times bigger than that of IG(1). All of this data needs to be read and written by each IGRID
step. This would slow down the code considerably and make it memory bandwidth bound.
In our implementation of the IGRID algorithm we have chosen to process more than one
IGRID step per one execution of the GPU kernel. Algorithm 4 describes how the IGRID steps
are split into blocks. Processing more IGRID steps per GPU kernel is beneficial in a number
of different ways. If we perform more IGRID steps together we not only remove device memory
accesses between individual IGRID steps, but we also reduce the amount of data shared between
consecutive GPU kernel executions, since each IGRID step reduces the amount of data by half due
to decimation in time. The data shared between IGRID steps within one GPU kernel execution
are stored in GPU shared memory.
The second benefit comes when we process whole right branches which are marked in Figure
6. As discussed above, each right branch depends only on the layer with zero shift, but in order to
calculate whole IGRID step we might need layers from different branches. That is demonstrated
by IGRID with a step of two in Figure 6. We see that, to calculate a whole step, we need layers
with shifts h = {0, 2, 1, 3}, but layers with shifts {2} and {1, 3} are from different branches. If we
do not calculate the entire right branch within one execution we would have to save all or some
of the intermediate data to the device memory.
Therefore we have chosen to calculate whole sub-sections of the binary tree starting always at
the layer with zero shift and going down to a chosen depth, even if it means calculating parts of
15
Algorithm 4: Pseudo-code for IGRID SPD algorithm. Variable I is the number of IGRID
steps required for single pulse search, which is calculated on the maximum boxcar width Lend
which is user defined, SpK is the number of IGRID steps performed by the GPU kernel, and
A is the number of layers used.
calculate number of kernel executions K = I/SpK;
loop through kernel executions;
for k = 0 to K do
GPU Kernel begin
loop through IGRID steps performed by kernel ;
for s = 0 to SpK do
loop through A layers;
for a = 0 to A do
calculations of layers;
end
end
end
end
the tree which are later discounted. Recalculating parts of the binary tree allows us to save device
memory bandwidth and increases the performance in the process. This is shown in Figure 8. The
other indirect benefit is that we can achieve higher sensitivity, since we calculate parts of the tree
which would not be calculated otherwise.
The calculation of the binary tree itself inside the GPU kernel exploits the property of decima-
tion in time, where the output decimated time-series is half the size of the input. After decimation
we are able to fit two decimated layers into the memory space instead of one. The process of cal-
culating these two child layers from the parent layer can be performed in one operation. If we
use a running sum of size two (a boxcar filter of width two) we can calculate the two child layers
in-place but they will be mixed together, one in even samples, the second in odd samples. Figure 9
depicts how this calculation of a sub-section of the binary tree is performed. As we perform more
decimations we can fit more layers into the same memory space but also the individual layers are
more intertwined as we see in Figure 9 for later IGRID steps.
The GPU kernel uses shared memory to calculate these decimations. A thread which performs
the boxcar filter of width two has to add the sample with increasing step, in order to use samples
from the correct layer. Each thread always writes into the same memory address. Thus during the
calculation a thread calculates time samples from different layers as is shown in Figure 5, where,
for example, the 7th thread calculates 8th time sample from the zero shifted layer, then the 4th
time sample from the layer which is shifted by one sample and so on. Accesses to shared memory
are without bank conflicts. The algorithm for IG(2) is shown in Algorithm 5.
3.3 Computational and memory complexity

The number of operations and number of memory accesses of one iteration for both proposed
algorithms is summarised in table 2.
4 Results
In this section we present a selection of representative configurations for each algorithm and
compare them based on performance and signal loss. Implementations presented here assume
fp32 floating point numbers for the input.
First we present signal loss for each algorithm and configuration for the idealized signal model
(sec. 2.1) using a rectangular pulse. We verify the correctness of our implementation of both
16
Algorithm 5: The IGRID IG(2) algorithm. Function Calculate SNR calculates SNR and
Compare SNR compares SNR for the same sample.
Input: xih , xi−1 , n, i;
Output: xi+1 i+1 i
h , xh+2i , Y , Y ;
i
B0 and Bd represent boxcar filters;

float SNR, B0, Bd;
int width;
Calculate two child layers;
d = 2i ;
width = 2d;
B0 = xih [2n] + xih [2n + 1];
Bd = xih [2n + 1] + xih [2n + 2];
highest SNR is stored in initial data coordinates;
SNR = Calculate SNR (B0, width);
Compare SNR (SNR, width, Y i [2dn], YLi [2dn]);
SNR = Calculate SNR (Bd, width);
Compare SNR (SNR, width, Y i [2dn + d], YLi [2dn + d]);
storing to memory;
xi+1
h [n] = B0;
xi+1
h+d [n] = Bd;
Traversing binary tree first depth;
width = width + d;
B0 = B0 + xih [2n + 2];
Bd = Bd + xih [2n + 3];
Calculate SNR (. . .);
Compare SNR (. . .);
Traversing binary tree second depth;
width = width + d;
B0 = B0 + xih [2n + 2] + xi−1 h [4n + 6];
B0 = B0 + xi−1h [4n + 4];
Bd = Bd + xih [2n + 3] + xi−1 h [4n + 8];
Bd = Bd + xi−1 h [6n + 6];
Calculate SNR (. . .);
Compare SNR (. . .);
17
New iteration
New iteration
Figure 8: Representation of how IGRID kernels compute layers and progress to the next iteration.
Black circles represent layers which we must calculate, in order to reach desired sensitivity. Red
circles show layers which are calculated but not written to device memory (thus recalculated later
by the next iteration) and Green circles are layers which are calculated but not needed, these are
used to increase sensitivity.
th0 t 1 t2 t3 t4 t5 t6 t7 t8 t9 .
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1
IGRID steps
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
2 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1
3 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
Figure 9: Visualization of IGRID GPU kernel. Iterations of the IGRID algorithm are separated
by thick horizontal dashed lines. The number shows to which layer a given time sample belongs.
The coloured number uses the same colour coding as in Figure 8. Black shows the layer we have
to calculate, red a time sample from a layer which is calculated but not used, and green a layer
which is calculated but it is not necessary to achieve the desired sensitivity.
algorithms using the idealized signal model by comparing the measured signal loss Ψm l and Ψw
m
(RSNRmax , RSNRmin ) to predicted values of the signal loss Ψl and Ψw . We also present the error
in the detected width of the pulse and the error in the detected position in time relative to its
true position (time localization), which are introduced by each algorithm.
Next we present RSNR and error in pulse width and time localization for the idealized sig-
nal model with Gaussian and double Gaussian pulse profiles. These pulse shapes are a better
approximation of the pulses encountered in real observations.
The recovered SNR and detected width for a rectangular, Gaussian and double Gaussian pulse
shapes which are embedded in white noise are presented afterwards. For these results we have
used pulses of SNR = {8, 5, 3}.
18
Table 2: Number of operations and number of memory accesses of selected SPD algorithms, where
B is the number of boxcar filters performed in one iteration.
# floating # global
Algorithm point operations memory accesses
Ideal boxcar filter N2 N2
Decimation in time N log2 N 4N log2 N
BoxDIT BN log2 N B +1 6N log2 N
B +1
IGRID(IG(1)) N log2 N 4N log2 N
Following this we present the performance of our algorithms on two generations of NVIDIA
GPUs, the P100 and the Titan V. The performance of the algorithm is measured in the number of
dedispersion trials8 (DM trials) which could be performed in real-time for sampling time ts = 64µs.
We also present the dependency of this number on the maximum width performed by the algorithm.
Lastly we present a comparison of AstroAccelerate with Heimdall and its single pulse search
algorithm. We also demonstrate how we calculate mean and standard deviation which is then
used for the calculation of recovered SNR.
Results presented here are for three variants of the IGRID algorithm which we have described
at beginning of section 3.2.1. To match these three IGRID configurations we chosen three config-
urations of the BoxDIT algorithm, each with a different ratio of sensitivity and performance. We
can change the sensitivity of the BoxDIT algorithm by changing the value of the maximum boxcar
width calculated in the boxcar filter step of the algorithm. These configurations are: BD-32 which
calculates a maximum width of Limax = 32 between decimations for any i; BD-16 with maximum
width of Limax = 16; and BD-8 which uses maximum width of Limax = 8.
4.1 Measurement of signal loss

To measure the signal loss of a given SPD algorithm we have to find the RSNR produced by the
algorithm and then calculate signal loss using equation (7). The value of RSNR depends not only
on the SNR of the injected pulse but also on its position within the time-series. This is because to
localize the pulse and sum the power it contains, we use an incomplete set of boxcar filters. That
is, the boxcar filters do not cover every possible pulse width at every possible time sample. Lower
RSNR occurs when the injected pulse is not properly matched by these filters.
Therefore, to accurately measure RSNRmax/min we have to create data where a pulse of given
width will be present in all relevant positions in time. By sliding the pulse along its whole width
in time we can evaluate all of these relevant positions for a pulse of a given width. The test data
used for the measurement of the RSNRmax/min consists of a set of time-series where each pulse
shift is stored in its own time-series.
The test data are then processed by the SPD algorithm. From each time-series we select the
highest RSNR and from these we select the maximum, which is the RSNRmax , and the minimum
which is the RSNRmin for a given pulse width.
To evaluate the value of the mean and standard deviation for longer boxcar filters we have
used the white noise approximation as discussed in section 2.1 and shown in equation (5).
When we measure the RSNRmax and RSNRmin using our idealised signal model, the pulse will,
as slides along its width, be detected by boxcar filters placed at different times. Since we select
the first maximum (minimum) we introduce a bias into our measurement of time localization.
Meaning that pulses are detected earlier than their true position in time.
When Gaussian noise is present in the test data we must modify how we measure RSNRmax/min .
Since we select the highest SNR for each pulse position, this would prefer pulses which are enhanced
by the addition of the noise. Hence, when noise is present we use an averaged RSNR from multiple
measurements of the pulse’s RSNR at the same position. This way we can mitigate the effect of
8 A DM trial is a time-series which is the output of the de-dispersion transform algorithm, it corrects for the
effects of interstellar medium.
19
the noise to some degree. Measurement made with noise will always have a tendency to get higher
values of RSNRmax/min, because we select the maximum SNR. For the SNR calculation we have
also used the white noise approximation (5).
4.2 Signal loss for idealized signal

The measured RSNRmax and RSNRmin together with measured signal loss (eq. (7)), error in
detected width (eq. (10)) and error in time localization (eq. (11)), of the rectangular pulse using
our idealized signal model for our chosen configurations of our BoxDIT algorithm are shown in
Figure 10 and for our IGRID algorithm in Figure 11. Both errors are expressed as a percentage
of the true pulse width.
The error in the time localization shown in Figures 10 and 11 are biased for earlier pulse
detection as discussed above.
The comparison to predicted signal loss which is calculated by equations (24) and (31) are
shown in figures 10 and 11 as gray and black lines. The measured systematic signal loss Ψm l must
be higher or equal to the predicted Ψl , while measured worst signal loss Ψm w must be lower or
equal to the predicted worst signal loss Ψw for a given pulse width. In essence any measured value
of the RSNR must lie be between values for Ψl and Ψw .
Lastly, in figures 10 and 11, we also present the cumulative average signal loss hΨX (S0 )i up to
a given pulse width S0 which is calculated as
PS0
ΨX (s)
hΨX (S0 )i = s=1 . (19)
S0 − 1
This represents the mean signal loss hΨX (S0 )i one can expect for all pulses with width S < S0 .
4.2.1 Discussion
The averaged signal loss for our chosen configurations of our BoxDIT algorithm (Fig. 10) ranges
from 1% for configuration BD-32 to 3% for configuration BD-8. For IGRID the averaged signal
loss starts at 2% for configuration IG(3) and increases to 12.5% for IG(1). The sudden decreases
in RSNR (increases in signal loss), which occur at different places for different configurations are
caused by decimation in time. The decimation in time increases boxcar separation Ls , which in
the case of width and time localization determines the time resolution we are able to achieve. The
boxcar separation represents the minimum increment in the measured width and the index of the
time sample. Thus it is more significant for shorter pulse widths as it is a proportionally bigger
part of the pulse’s width. This is why the error in the width and time index decreases with pulse
width.
Our IGRID algorithm behaves in a similar way (Figure 11). Our IGRID algorithm has, in
general, higher signal loss and a higher error in width and time localization than our BoxDIT
algorithm. This is because IGRID places boxcar filters more sparsly than BoxDIT.
A comparison of the measured signal loss Ψm m
l and Ψw values to the predicted values of the
signal loss Ψl and Ψw for both algorithms is shown on the left in Figure 10 for BoxDIT algorithm
and in Figure 11 for IGRID algorithm. Both algorithms give the same values for Ψl as is predicted.
However both algorithms have better worst signal loss Ψw than what is predicted by equation (31).
This can be explained in each case. In the case of the IGRID algorithm this is because we calculate
more layers than are necessary (section 3.2.1). This results in a smaller signal loss for some pulse
widths. For the BoxDIT algorithm the higher RSNRmin is cause by boxcar filters of a shorter
width which cover the pulse completely (best case) which has higher RSNR thus lower signal loss
than boxcar filters considered in our sensitivity analysis for the worst case.
4.3 Other pulse shapes

The profile of a real signal is usually far from rectangular, this is why we have included results
from studies of two other pulse profiles; a Gaussian shape and combination of two Gaussians. We
20
BD-32: signal loss and recovered RSNRmax/min
BD-32: Error in width and time
6
5 Error for RSNRmin
0 16 Error for RSNRmax
4
Width [%]
3
2
2.5 1
15.5 0
Recovered SNR
Signal loss [%]
-1
-2
5 -3
15 2
1
7.5
Time [%]
True SNR

Theoretical l
0

Theoretical w -1

14.5
10 Measured l

Measured w -2

Cumulative average l

-3
Cumulative average w
14 -4
0 50 100 150 200 250 300 350 400 450 0 50 100 150 200 250 300 350 400 450 500
Signal width [samples] Signal width [samples]
12
0 16 10 Error for RSNRmin
8 Error for RSNRmax
Width [%]
6
4
2.5 2
15.5
Recovered SNR
Signal loss [%]
0
-2
5 -4
-6
15 3
2
7.5 1
Time [%]
True SNR

Theoretical l
0

Theoretical w -1

14.5 -2
10 Measured l

Measured w -3

Average l -4

Average w -5
14 -6
0 50 100 150 200 250 300 350 400 450 0 50 100 150 200 250 300 350 400 450 500
25
0 16 20 Error for RSNRmin
Error for RSNRmax
15
Width [%]
10
2.5 5
15.5 0
Recovered SNR
Signal loss [%]
-5
-10
5
-15
15 5
7.5
0
Time [%]
True SNR

Theoretical l

Theoretical w 14.5 -5
10 Measured l
!
Measured w
"
Average l -10
#
Average w
14 -15
0 50 100 150 200 250 300 350 400 450 0 50 100 150 200 250 300
Figure 10: BoxDIT: Measured signal loss Ψl and Ψw and associated RSNRmax/min (left) and error
in measured width (as a percentage of the real pulse width), time difference relative to the true
position of the pulse (as a percentage of the real pulse width) (right). These results are for our
idealized signal model for three different configurations of our BOXDIT algorithm. A positive
value of the error for the detected width indicates that the detected width is longer than the real
width, a negative value indicates that it is shorter than the real width. For localization in time,
negative values indicate that the pulse was detected at earlier time than the actual time at which
the pulse occurred, a positive value indicates that the pulse was detected later. Together with
measured signal loss (red and blue) we also show the predicted values of systematic signal loss
Ψl in grey (located under red line) and worst signal loss Ψw in black. Lastly we also show the
cumulative average signal loss in violet and orange.
21
IG(3): signal loss and recovered RSNRmax/min
IG3: Error in width and time
0 16 12
10 Error for RSNRmin
8 Error for RSNRmax
Width [%]
2.5 6
15.5
4
5 2
Recovered SNR
Signal loss [%]
0
15 -2
7.5 -4
-6
14.5 6
10
4
Time [%]
True SNR
12.5
$
Theoretical l
14 2
%
Theoretical w 0
15 &
Measured l
'
Measured w 13.5 -2
(
Average l
17.5 )
Average w
-4
13 -6
0 50 100 150 200 250 300 350 400 450 0 50 100 150 200 250 300 350 400 450 500
0 16 25
20 Error for RSNRmin
Error for RSNRmax
15
Width [%]
2.5
15.5 10
5
5 0
Recovered SNR
Signal loss [%]
15 -5
7.5 -10
-15
14.5 10
10
5
Time [%]
True SNR
12.5
*
Theoretical l
14
+
Theoretical w 0
15 ,
Measured l
-
Measured w 13.5 -5
/
Average l
17.5 0
Average w -10
13
0 50 100 150 200 250 300 350 400 450 0 50 100 150 200 250 300 350 400 450 500
0 16 40
Error for RSNRmin
30 Error for RSNRmax
Width [%]
2.5 20
15.5
10
5 0
Recovered SNR
Signal loss [%]
15 -10
7.5 -20
14.5 25
10
20
15
Time [%]
True SNR 10
12.5
1
Theoretical l
14
5
2
Theoretical w 0
15 3
Measured l -5
4
Measured w 13.5 -10
8
Average l -15
17.5 :
Average w -20
13 -25
0 50 100 150 200 250 300 350 400 450 0 50 100 150 200 250 300 350 400 450 500
Figure 11: IGRID: Measured signal loss Ψl and Ψw and associated RSNRmax/min (left) and error
in measured width (as a percentage of the real pulse width), time difference relative to the true
position of the pulse (as a percentage of the real pulse width) (right). These results are for our
idealized signal model for three different configurations of our IGRID algorithm. A positive value
of the error for the detected width indicates that the detected width is longer than the real width,
a negative value indicates that it is shorter than the real width. For localization in time, negative
values indicate that the pulse was detected at earlier time than the actual time at which the pulse
occurred, a positive value indicates that the pulse was detected later. Together with measured
signal loss (red and blue) we also show the predicted values of systematic signal loss Ψl in grey
(located under red line) and worst signal loss Ψw in black. Lastly we also show the cumulative
average signal loss in violet and orange.
22
have measured RSNR, error in detected width and error in time localization using the idealized
signal model. For these alternative pulse shapes we cannot use the equation (7) to calculate signal
loss, because these pulses have a non-uniform distribution of power between samples of the pulse.
Although the pulse is still normalized to the constant SNRT the non-uniform distribution means
we can sum most of the pulse power with much shorter boxcar filters which overestimate SNR of
the pulse, resulting in RSNR higher than SNRT .
These pulse shapes are inserted in a similar fashion to the rectangular pulse shape. The input
time-series is zeroed - all elements are set to zero. The pulse is then injected at the required
position.
5 samples 10 samples 20 samples

8
;
4
2
0
Magnitude
12
5 11
8
0
16
7 14
12
8
4
0
0 1 2 3 4 5 6 0 2 4 6 8 10 0 4 8 12 16 20
Sample
Figure 12: The three different pulse shapes used in this article. The pulses are shown with
resolution of 5, 10 and 20 samples. The shaded region in each graph shows the samples of the
pulse which were summed by the boxcar filter which produces the highest RSNR. The number of
samples covered by the shaded area is at top right corner.
To produce these pulses we have sampled the normal distribution given by
(x − µ)2

1
f (x) = √ exp − , (20)
2πσ 2 2σ 2
where the mean is set to µ = S/2 and the variance, σ 2 , depends on the shape type and width
of the pulse. For a Gaussian like shape we have used σ 2 = S/(8 ln 2) Brandt [2014] and for the
double Gaussian profile we have used σ 2 = (S/4)/(8 ln 2) and σ 2 = (S/5)/(8 ln 2). The Gaussian
profiles are then sampled at discrete intervals to produce values which are then injected into the
idealized dataset. These pulses are shown in Figure 12. The signal is normalised such that when
all its samples are summed together, i.e. a boxcar of signal width is applied to it, it will produce
RSNR = C = 16.
The results for the BoxDIT algorithm in the most sensitive configuration BD-32 are presented
in Figure 13. We present results for the fastest configuration of our IGRID algorithm (IG(1))
in Figure 14. We also show an alternative figure of merit to the signal loss in the form of the
recovered fraction of a signal’s power in Figure 15 for both algorithms.
23
RSNRmax width RSNRmin time
BD-32: RSNR of non rectangular pulse shapes RSNRmin width RSNRmin time
60
Error in measured width and time [%]

RSNRmax
17.9 RSNRmin 40
17.8 20
0
17.7 -20
Recovered SNR
Gaussian Gaussian
17.6 -40
-60
20
18 10
17.9 0
17.8 Double Gaussian -10 Double Gaussian
17.7 -20
17.6 -30
-40
50 100 150 200 250 300 50 100 150 200 250 300
Pulse width [samples] Pulse width [samples]
Figure 13: Measured RSNR values and error in pulse width and time localization for different
pulse shapes for BoxDIT BD-32 configuration.
RSNRmax width RSNRmin time
IG(1): RSNR of non rectangular pulse shapes RSNRmin width RSNRmin time
50
Error in measured width and time [%]
18 RSNRmax
17.8 RSNRmin 25
17.6 0
17.4 -25 Gaussian
Recovered SNR
17.2 Gaussian
-50
18 50
17.6 25
17.2 0
-25
16.8 -50
Double Gaussian
16.4 Double Gaussian
-75
50 100 150 200 250 300 50 100 150 200 250 300
Figure 14: Measured RSNR values and error in pulse width and time localization for different
pulse shapes for IGRID IG(1) configuration.
1
1
Recovered fraction of signal power
Recovered fraction of signal power
0.98 0.9
0.96
0.94
0.92 RSNRmax RSNRmin 0.8 RSNRmax RSNRmin
1
0.95 0.95
0.9 0.85
0.85 0.75
0.8 Gaussian Gaussian
0.75
1 1
0.9
0.95 0.8
0.9 Double Gaussian 0.7 Double Gaussian
0.85
25 50 75 100 125 150 175 200 225 250 275 300 25 50 75 100 125 150 175 200 225 250 275 300
Figure 15: Fraction of power recovered by the both algorithms. BoxDIT algorithm in configuration
BD-32 (left) and IGRID algorithm in configuration IG(1) (right).
4.3.1 Discussion
Figures 13 and 14 show that the RSNR of pulse shapes other than rectangular are higher than
the maximum for the rectangular pulse. This is due to the non-uniform distribution of power
throughout these pulse shapes, where the power is concentrated into fewer samples. This means
that a shorter boxcar filter can sum most of the power contained in the pulse which results
24
in a higher RSNR. The fluctuations in RSNR for short widths is cause by the poor discreet
representation of pulse shapes, shown in Figure 12.
The error in measured width and position in time is presented in figures 13 and 14. We see
that boxcar filters systematically under-estimate pulse width when applied to a non-rectangular
pulse. The error in the measured width is consistently about 40% of the true pulse width for the
single Gaussian and about 30% for double Gaussian. These findings are consistent with the fact
that Gaussian like pulse profiles tend to be detected with shorter boxcar filters than their true
width due to the non-uniform distribution of power in the pulse.
The error in time localization is presented in figures 13 and 14. Time delay in detection is
again expressed as a percentage of the true pulse width. Positive values indicate that the pulse is
detected later in time, while negative values indicate that the pulse was detected earlier in time.
These results show that Gaussian pulses are detected later than their true position in time. This
is also shown in Figure 12 which highlights the portion of the pulse which result in the highest
RSNR detection. If we consider a Gaussian pulse that is 20 samples wide we see that it is detected
by the boxcar filter with width of 11 samples, centered about the peak. Meaning that the pulse is
detected with a time delay of 4 or 5 samples which represents 20% or 25% time delay compared
to the true pulse width of 20 samples.
The recovered power by the BoxDIT (BD-32) and IGRID (IG(1)) algorithms for pulses of
Gaussian like shapes are presented in Figure 15. We see that to produce the highest RSNR for
Gaussian like pulses, the algorithm sums on average 85% of the signal’s power or 94% for the
double Gaussian pulse profile. These results show that boxcar filters which sum all the power
contained within the pulse have lower RSNR and thus are not selected as candidates.
The noticeable increase in the width error visible in Figure 14 for the double Gaussian pulse
profile as well as the drop in the fraction of recovered power in Figure 15 is because the IGRID
algorithm (IG(1)) detects only the first Gaussian out of two that are present in the pulse. The error
in time localization, which is unchanged indicates that it is the first Gaussian that is detected.
For other configurations of our IGRID algorithm this was not observed. This is because these
configurations have a denser set B as well as shorter boxcar separation Ls and they are able to
better match the entire pulse.
4.4 Time-series with white noise

To verify the results of our SPD algorithms when there is noise present in the input data we have
inserted Gaussian noise (mean µ = 0, standard deviation σ = 1) into the idealized signal and
measured RSNRmax/min. To measure RSNRmax/min we have used the method described in section
4.1. The measured values of RSNRmax/min for BoxDIT in configuration BD-32 and IGRID in
configuration IG(1) are presented in Figure 16. The injected pulses have SNR=16. The effect of
the noise on the measured RSNRmax/min is higher for pulses with lower SNR, as expected.
We have also used time-series with white noise to test SNR extraction. Examples for each
pulse shape are presented in Figure 17 for the rectangular pulse, and in Figure 18 for the Gaussian
and double Gaussian pulse. The pulses have SNR = {8, 5, 3}, width S = 20 samples and they
start at sample 100. For the rectangular pulse, the detection should occur at sample 110 by a
boxcar filter of width 20 samples. For other pulse shapes the boxcar width is shorter as shown in
Figure 12. The addition of the while noise means that single pulses shown in these figures do not
represent overall sensitivity of our algorithms.
4.4.1 Disscusion
The RSNR reported by our BoxDIT algorithm in configuration BD-32 and by our IGRID IG(1)
algorithm are presented in Figure 16. The introduction of Gaussian white noise into our idealized
model with rectangular pulse had only marginal effect on the detected RSNR by both algorithms.
This is, for the most part, due to the initial SNR of the injected pulse which was SNRT = 16.
We also see that RSNR produced by both SPD algorithms fluctuate about RSNRmin and often go
25
Ideal RSNRmax RNSRmax white noise Ideal RSNRmax RNSRmax white noise
Ideal RSNRmin RNSRmin white noise Ideal RSNRmin RNSRmin white noise
16
16 15.5
15
15.6 14.5
14
15.2 Rectangular 13.5 Rectangular
Recovered SNR
Recovered SNR
13
18.4 18.4
18 18
17.6
17.6 Gaussian Gaussian
17.2
18.4
18
18 17.6
17.2
17.6 Double Gaussian 16.8 Double Gaussian
16.4
25 50 75 100 125 150 175 200 225 250 275 300 25 50 75 100 125 150 175 200 225 250 275 300
Figure 16: Comparison of RSNRmax/min for our idealized signal without noise (gray and black)
with RSNRmax/min for signals with Gaussian white noise (red and blue).
10 50 9 50
9 SNR=8 40 SNR=8 40
30 8 30
8
20 7 20
7 10 10
Width [ samples]
Width [ samples]
6 0 6 0
50 50
7 7
SNR=5 40 SNR=5 40
RSNR
RSNR
6 30 6 30
5 20 5 20
4 10 4 10
3 0 3 0
50 50
4 SNR=3 40 4 SNR=3 40
30 30
3 20 3 20
10 10
2 0 2 0
80 90 100 110 120 130 140 80 90 100 110 120 130 140 0 200 400 600 800 1000 0 200 400 600 800 1000
Time sample Time sample
Figure 17: Results for the rectangular pulse detections with different SNR embedded in white
noise. The left graph shows a injected rectangular pulse of different initial SNR. It is split to show
the RSNR values on the left and boxcar widths on the right. The points are colour coded by the
RSNR. There might be more than one SNR value per sample because the SPD algorithm runs in
multiple iterations and the results of all iterations are shown. Same results are shown in wider
perspective in the graph on the right hand side. We see that the SNR=3 pulse would be hard
to distinguish from background noise. This result is for BoxDIT algorithm with maximum search
width w = 8000 samples.
below RSNRmin . These variations are caused by fluctuations in the detection boxcar’s width and
it’s position in time.
Studying other pulse shapes we see that RSNR reported by BoxDIT is higher than in case
without the noise. This is most visible for the Gaussian pulse shape. This again is due to the
pulse shapes having a non-uniform distribution of power in their profile, hence they are detected
with shorter boxcar filters than the actual pulse width. This makes it easier for the added noise to
change the behavior of the SPD algorithm by changing the distribution of power throughout the
pulse. For example, by decreasing the peak in the Gaussian pulse shape the boxcar which detects
the pulse will be wider as it needs to accumulate power from more samples and in combination with
our normalization this will decrease RSNR closer to 16. On the other hand by slightly increasing
the peak in pulse profile, the pulse will be detected by a shorter boxcar producing higher RSNR.
We see that in the case of the rectangular pulse (fig. 17) the detection occurs at the correct
position in time for SNR=8 and SNR=5. For pulse width SNR=3 the highest RSNR is detected
at time sample 116 but the detected width underestimates the actual pulse width. This detection
is mostly due to noise. The pulse itself is detected correctly at the time sample 110 with width 20,
26
12 50 9 50
11 SNR=8 40 SNR=8 40
10 30 8 30
9
8 20 7 20
7 10 10
Width [samples]
Width [ samples]
6 0 6 0
50 50
7 7
SNR=5 40 SNR=5 40
RSNR
6 6
RSNR
30 30
5 20 5 20
4 10 4 10
3 0 3 0
50 50
5 SNR=3 40 4 SNR=3 40
4 30 30
20 3 20
3
10 10
2 0 2 0
80 90 100 110 120 130 140 80 90 100 110 120 130 140 80 90 100 110 120 130 140 80 90 100 110 120 130 140
Time sample Time sample
Figure 18: Results for the detections for the Gaussian pulse with different initial SNR embedded
in white noise is shown on the left set of graphs, and for double Gaussian on the right set of graphs.
Each graph shows RSNR on the left and detected widths on the right. The points are colour coded
by the RSNR. Results are for BoxDIT algorithm with maximum search width w = 8000 samples.
but it has lower RSNR. With decreasing pulse SNR we see that noise has stronger effect on the
detected width and position of the pulse as would be expected. The Figure 17 also shows same
data in a wider perspective with all candidates above a given threshold. For SNR=3 we see that
the injected pulse is detected correctly, however it is hard to distinguish from background noise.
4.5 Performance
We have evaluated the performance of our algorithms and their implementation in different con-
figurations by measuring the execution time. As a derived metric we have calculated the number
DM trials that can be searched in real-time using sampling time ts = 64µs. Processing data in
real-time means performing all of the required operations on the data faster then we acquire the
data. The number of DM trials processed in real-time NDM is calculated as ratio of the num-
ber of processed time-samples per second by the SPD algorithm NP and the number of acquired
time-samples per second NS = 1/ts . That is
NP
NDM = . (21)
NS
The number of DM trials processed in real-time depends on the telescope sampling time and for a
different sampling time other then ts = 64µs, the number of DM trials searched in real-time will
be different. For example for ts = 128µs the number of DM trials processed in real-time will be
(128) (64)
multiplied by factor of two NDM = 2NDM .
The number of DM trials searched in real-time also depends on the maximum width of the
boxcar filter Lend used for detection of the pulses. In the following results we have used Lend = 8192
samples, which for a sampling time ts = 64µs means a boxcar filter which is approximately half
a second long. When searching for shorter pulses, the performance of the SPD algorithm can be
increased by using shorter Lend .
We present the performance of our implementations on two NVIDIA GPUs. The first is the
TESLA P100 which is a scientific card from the Pascal generation. The second is the Titan V,
which is a high end prosumer card from the Volta generation. We have not included any card
from the current Turing generation since this generation is not designed for scientific tasks and
workloads. The hardware specifications of these two GPUs are summarized in table 3.
Execution time and how it scales with increasing number of time-samples (per one DM trial)
and execution time scaling with increasing number of DM-trials for the P100 and the Titan V are
presented in Figure 19. The calculated number of DM-trials processed in real-time for different
sampling times is presented in Figure 20.
27
Table 3: GPU card specifications. The shared memory bandwidth is calculated as BW(bytes/s) =
(bank bandwidth (bytes)) × (clock frequency (Hz)) × (32 banks) × (# multiprocessors).
P100 TITAN V
Total CUDA Cores 3584 5120
Streaming Mul. (SMs) 56 80
Core Clock 1303 MHz 1455 MHz
Memory Clock 1406 MHz 850 MHz
Device m. bandwidth 720 GB/s 652 GB/s
Shared m. bandwidth 9121 GB/s 14550 GB/s
FP32 performance 9.3 TFLOPS 12.3 TFLOPS
GPU memory size 16 GB 12 GB
TDP 250 W 250 W
CUDA version 10.1 10.1
Driver version 418.39 415.27
120 120 40 40
BD-8 BD-8
BD-16 35 BD-16 35
100 BD-32 100 BD-32
IG1 30 IG1 30
Execution time [ms]
Execution time [ms]
IG2 IG2
80 IG3 80 IG3
25 25
60 60 20 20
15 15
40 40
10 10
20 20
5 5
0 0 0 0
7 9 10 11 16 18 19 20
27 29 210 211 216 218 219 220 2 2 2 2 2 2 2 2
Number of DM trials Number of time-samples Number of DM trials Number of time-samples
Figure 19: Measured execution time (P100 and Titan V) for both BoxDIT and IGRID algorithms
for increasing number of DM trials with 131072 time-samples per DM trial is on the left. Execution
time for a varying number of time-samples per DM trial for 256 DM trials is shown on the right.
All results are for a maximum boxcar width Lend = 8192 samples. In both cases our algorithms
scale linearly.
The comparison of the two algorithms with respect to their signal loss and achieved performance
is presented in Figure 21. This figure combines results presented in Figure 10 and Figure 11 with
the number of DM trials processed in real-time using a sampling time ts = 64µs. This figure
allows us to compare the algorithms signal loss/performance ratio.
Performance as a function of maximum boxcar width searched by the SPD algorithm is pre-
sented in Figure 22. In the case of IGRID algorithm, these results depend on how many IGRID
steps we perform per GPU thread-block.
4.5.1 Discussion
The execution time for both algorithms is presented in Figure 19. Both algorithms scale linearly
with the number of DM trials as well as with the number of time-samples. This is because the
SPD problem is separable into independent parts, at the level of DM-trials as well as within a
single DM-trial. Thus adding more time samples or DM-trials is equivalent.
The calculated number of DM-trials processed in real-time for the Titan V GPU is presented
in Figure 20. We see that for a sampling time of ts = 64µs the IG(1) algorithm is capable of
processing three million DM-trials in real-time and the BD-32 algorithm is capable of processing
almost six hundred thousand DM-trials in real-time. When considering radio-astronomy beyond
28
BD-8 BD-32 IG2
BD-16 IG1 IG3
1.2x107
40k
1x107 30k
DM trials in real-time
20k
8x106
10k
2k
6x106
64ns 256ns 512ns 1µs
4x106
2x106
0
32µs 64µs 128µs 256µs
Sampling time
Figure 20: The number of DM-trials processed in real-time the dependency on sampling time.
the SKA era, for extremely short sampling times such as ts = 256ns, the real-time performance of
our IG(1) algorithm is about ten thousand DM-trials in real-time. This however has to be taken
in the context of a whole pipeline in which the SPD algorithm is used.
The comparison of both SPD algorithms based on their real-time performance for ts = 64µs
and their cumulative averaged signal loss is presented in Figure 21. When considering the trade-
off between signal loss and performance, both algorithms perform well. For roughly two times
increase in signal loss the performance increases by roughly two times. There are some exeptions
to this rule. For example BD-8 running on Titan V has almost the same performance as BD-16.
Also note, IG(1) on Titan V offers a performance increase of only 1.4× when compared to IG(2).
Figure 21 presents the different behaviour of the GPUs used and also differences between both
algorithms. The GPU resource utilization as reported by the NVIDIA visual profiler for selected
configurations of BoxDIT and IGRID algorithms is summarised in Table 4.
6 6
2.5×10 4.0×10
BOXDIT different versions BOXDIT different versions
IGRID different versions IGRID different versions
6
3.5×10
Number of DM trials in real-time
6
2.0×10
6
3.0×10 IG1
IG1
1.5×10
6 2.5×106
IG2
6
2.0×10
1.0×106 IG2 6
1.5×10
BD-8
IG3 BD-16 BD-8
5.0×10
5 1.0×106
IG3 BD-16
5 BD-32
5.0×10
BD-32
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Integrated Signal loss in % Integrated Signal loss in %
Figure 21: Dependence of performance on the maximum width computed by the SPD algorithm
used. Results are for P100 (left) and TITAN V (right).
29
6 6
2.5×10 BD-32 4.0×10 BD-32
BD-16 BD-16
BD-8 6 BD-8
IG1 3.5×10 IG1

2.0×10
6 IG2 IG2
IG3 IG3
3.0×106
1.5×106 2.5×106
2.0×106
6
1.0×10
1.5×106
6
5.0×10
5 1.0×10
5
5.0×10
4 8 16 32 64 128 256 512 2048 8192 4 8 16 32 64 128 256 512 2048 8192
Maximum width searched Maximum width searched
Figure 22: Dependence of performance on the maximum width computed by the SPD algorithm
used. Results are for P100 (left) and TITAN V (right).
Table 4: Compute/memory Utilization of used GPU’s per configuration as reported by the

NVIDIA visual profiler.
P100 TITAN V
BoxDIT BD-32 90%/15% 85%/45%
BoxDIT BD-16 80%/35% 80%/85%
BoxDIT BD-8 80%/55% 45%/85%
IGRID IG(3) 75%/25% 85%/45%
IGRID IG(2) 85%/35% 90%/75%
IGRID IG(1) 85%/55% 65%/85%
30
When we compare utilisation of both algorithms we see that they are similar on both GPUs.
On the P100 GPU both algorithms are limited by compute performance. On the Titan V GPU
they are limited by global memory bandwidth for higher signal loss configurations (BD-8, IG(1)),
and by compute performance for lower signal loss configurations (BD-32, IG(3)). Algorithms with
lower signal loss have to calculate a denser set of boxcar filters (more different widths) which are
closer together. This requires more compute, but they use the same amount of input data. Despite
similar GPU utilization IGRID is the better performing algorithm.
There are three main reasons why the IGRID algorithm is faster than BoxDIT. Firstly, the
IGRID algorithm requires a smaller number of boxcar filters than the BoxDIT algorithm because
IGRID decimates the input data after each IGRID step. If we compare BoxDIT BD-8 with the
IGRID IG(2) algorithms the number of boxcar filters required by the IGRID IG(2) algorithm
is half9 compared to the number of boxcar filters required by the BoxDIT BD-8. On the P100
GPU where both algorithms are limited by compute we see that the IG(2) configuration has a
lower signal loss but higher performance while compute utilization is almost the same for both
algorithms. This shows that IGRID is more economical with the number of boxcar filters required
to achieve a similar signal loss.
The IGRID algorithm also requires smaller input data, this is because in the IGRID algorithm
the input time-series xi serves two functions, first it’s values represent the values of the previously
calculated boxcar filters which are used to built even longer boxcar filters. Secondly it serves
as sample values, i.e. elements which are added to the existing boxcar filters to build up longer
boxcars. In case of the BoxDIT algorithm we have to store these quantities in separate input
arrays.
On Titan V both algorithms in their less sensitive variants are limited by device memory
bandwidth. When we compare IGRID IG(2) with BoxDIT BD-8 we again see that IGRID is the
better performing algorithm with lower signal loss while both algorithms are limited by device
memory bandwidth.
The IGRID algorithm has some disadvantages as well. For high efficiency each configuration
needs to be implemented individually and the complexity and size of the input required by IGRID
increases with decreasing signal loss. Therefore extending IGRID to even higher sensitivities (lower
signal loss) might be difficult. Furthermore, the signal loss of the IGRID algorithm depends on
parameters of the implementation. Even though they do not decrease below the theoretically
predicted worst signal loss Ψw the analysis of the data processed by the same configuration of
IGRID algorithm (same depth) but with different parameters might cause issues.
This problem is not shared by the BoxDIT algorithm. All presented configurations of our
BoxDIT algorithm are served by a single implementation of the BoxDIT algorithm which only
takes as an input, the number of boxcar filters to be performed (or maximum size of the boxcar
filter) per iteration. Further more these might change from iteration to iteration which enables
the user to decrease signal loss for certain pulse widths while increase it for others.
Figure 22 shows how performance depends on the maximum boxcar filter width Lend used.
We see that IGRID algorithms are much less sensitive to increases in Lend. This is due to lower
number of GPU kernel execution needed by the IGRID algorithm to get to the desired boxcar
width.
Lastly we present present fractions of real-time for SKA-like data. For SKA-like data we assume
that the single pulse detection pipeline will process 6000 DM-trials per second with sampling time
ts = 64µs. The BoxDIT algorithm in configuration BD-32 is capable of processing SKA-like data
33× faster than real-time and 106× faster than real-time on Titan V. The IGRID algorithm in
IG(1) configuration is capable of processing SKA-like data 266× faster than real-time on P100
GPU and 500× faster than real-time on Titan V GPU. If the final SKA single pulse detection
pipeline runs exactly in real-time with no spare processing time the IG(1) algorithm would take
0.4% of processing time on P100 GPU and 0.2% on Titan V GPU.
9 The BD-8 performs 10 iterations with decimation after each step and in each step performs 8 boxcar filters.
This gives 8 · 2 · N boxcar filters. The IGRID performs 14 IGRID steps of which the first 6 are in full resolution
and all subsequent IGRID steps are decimated. This gives 6N + 2N = 8N boxcar filters.
31
4.6 Heimdall pipeline
In order to compare RSNR acquired by our SPD algorithms which are part of AstroAccelerate with
another signal processing pipeline we have chosen to use Heimdall. Heimdall is a GPU accelerated
transient detection pipeline developed by Jameson and Barsdell [2018], Barsdell et al. [2012].
AstroAccelerate’s single pulse search pipeline contains these steps: first data are de-dispersed;
then the mean and standard deviation are calculated; followed by the SPD algorithm; lastly
candidates are selected using a peak-finding algorithm. Since recovered/SNR by the pipeline
depends also on other steps we will briefly describe them as well.
4.6.1 De-dispersion transform

The de-dispersion transform used in AstroAccelerate was developed by Armour et al. [2012a].
This implementation of the de–dispersion transform requires that the DM trials are calculated in
groups with a constant step in DM. In contrast the de-dispersion transform used in the Heimdall
pipeline has an adaptive step and so can have a varying step in DM for every DM-trial. This means
that with AstroAccelerate we cannot fully match the de-dispersion scheme used by Heimdall (the
Heimdall scheme should be more accurate).
2.4
2.2
1.8
DM step [pc/cm3]
1.6
1.4
1.2
0.8
0.6
0.4 Heimdall
AstroAccelerate
0.2
0 500 1000 1500 2000 2500 3000 3500
3
DM [pc/cm ]
Figure 23: Comparison between DM steps used by Heimdall’s and AstroAccelerate.
We have based the AstroAccelerate de-dispersion scheme on the one used by Heimdall, a
comparison of the two presented in Figure 23. Furthermore the input time series for each DM-
trial can be decimated in time. This decimation step is matched for both codes apart from the
first group of DM trials which, in the case of AstroAccelerate is performed at full time resolution.
Heimdall in contrast, only performs the first DM trial at full time resolution.
4.6.2 Calculation of mean and standard deviation

The correct calculation or estimation of the mean and the standard deviation of the base level
noise in the input data is essential because these values are used to calculate the RSNR. Only
those detections with RSNR above a user defined number of standard deviations are considered
significant and potential candidate detections. To ensure we have a fast, flexible and robust GPU
code to calculate mean and standard deviation we have chosen to implement a streaming algorithm
by Chan et al. [1983] in CUDA for NVIDIA GPUs. This algorithm has several advantages. It is
32
numerically stable and it is more precise than conventional methods. This algorithm is also very
well suited to a parallel implementation.
We estimate the true value of the standard deviation of the underlying base level noise by
point-wise outlier rejection. Each point is compared with the current estimate of the mean and
standard deviation and rejected if it is significantly different (the threshold is user defined). This
is repeated until convergence.
Each application of a boxcar filter on the DM trial changes its mean and standard deviation.
Since we cannot calculate the standard deviation after each boxcar filter, because it would be too
computationally expensive, we need to approximate changes in the standard deviation introduced
by application of boxcar filters. Therefore, we calculate values of the standard deviation only for
time-series which are decimations of the input time-series in time, i.e. we calculate the standard
deviation only for boxcar filters of width that are equal to a powers of two. We then interpolate
values of standard deviation for all intermidiate boxcar widths. An example of the results produced
by this method is presented in Figure 24 where we have used SIGPROC10 ’fake’ to generate input
data.
Data generated by fake P=22.69ms; DM=49.8pc/m3; SNR=8

40
35
30
25
20
Standard deviation
15
10 Boxcar planes no OR
5 Time decimated planes no OR
0
40 No signal
35
30
25
20
15
10
5
0
1000 2000 3000 4000 5000 6000 7000 8000
Boxcar width [samples]
Figure 24: Difference between stdev values calculated from time-series after applying a boxcar of
given width and standard deviations calculated from time-series after successive decimations in
time. Data were generated by SIGPROC ’fake’. Top line: A time-series that contains a strong
signal, Bottom line: A time-series without a signal present.
4.6.3 Comparison to Heimdall pipeline

We have compared the performance, reported SNR values and reported widths produced by As-
troAccelerate to those produced by Heimdall. Both codes standard output provides SNR values
and pulse widths. The execution time of Heimdall’s SPD algorithm is measured using Heimdall’s
internal benchmark. To measure the execution time of the SPD algorithm within AstroAccelerate
we have used the NVIDIA visual profiler.
The output of the Heimdall pipeline was compared to the AstroAccelerate SNR output for two
different scenarios. The first scenario focused on the detection of single pulses with fixed initial
SNR and increasing pulse width. The second scenario, on the response of both pipelines to pulses
10 https://sourceforge.net/projects/sigproc/
33
of fixed width but decreasing initial SNR of the pulses. For both scenarios, SIGPROC fake was
used to generate an observation file containing a fake pulsar. AstroAccelerate’s output contains
many more data points per detection when compared with Heimdall’s output. This is because we
compare the full Heimdall pipeline with Astroaccelerate’s basic candidate selection (peak-finding).
In order to keep figures simple and readable, we do not display most of the points produced by
AstroAccelerate and instead we show only the highest SNR candidates.
In the first scenario, we have compared both codes using four different pulse widths (created by
changing the duty cycle parameter in SIGPROC fake) with fixed initial SNR. The SIGPROC fake
file used was generated using period P = 1000ms, DM = 90pc/cm3 , SNRinit = 8 and varying duty
cycle. Recovered SNR for each case is presented in Figure 25 and the reported width of the pulses
are shown in Figure 26. Our implementation delivers SNR values which have a consistent SNR
value for all peaks, independent of the position of the pulse within the input data. AstroAccelerate
also reports more precisely the pulse width. We see that the Heimdall SPD algorithm, which uses
a sequence of boxcar filter widths that are equal to powers of two (and hence cannot accurately
represent all pulse widths), reports varying RSNR. This is because in some cases, the boxcar filter
width is shorter(longer) that the actual width of the injected pulse or the injected pulses sit in-
between adjacent boxcars used by Heimdall’s SPD algorithm or both. This is shown in Figure 26
where Heimdall reports noticeably longer or shorter pulse widths. In both cases where the SPD
algorithm does not sum all samples of the pulse (shorter boxcar filter width) or adds too much of
the noise samples (longer boxcar filter width) the reported SNR will be lower than the true value.
In the case where the position of the pulse in the time series straddles adjacent boxcar filters a
similar effect to that above will occur, even if those boxcars have a width equal to the pulse.
The second scenario compares the response of the SPD algorithm to different initial SNR
values of the pulsar by processing a SIGPROC fake file which contained a pulsar with parameters
P = 100ms, DM = 50pc/cm3 and a variable initial SNR. The pulse width was set to 32 samples
which is an ideal width for the SPD algorithms in both pipelines. The results are presented in
Figure 27. Our implementation delivers consistent results which are similar to the results from
the Heimdall pipeline. The values of RSNR for initial pulsar SNRinit = 8 and SNRinit = 4 are
higher for the Heimdall pipeline, but for lower initial pulsar SNR values the RSNR reported by
both pipelines are similar. The values of RSNR depend on the correct estimation of the mean and
the standard deviation of the underlying noise, where the two pipelines differ. To get these results
we have used an approximation of the mean and standard deviation with outlier rejection where
samples with σ > 2 were rejected.
AstroAccelerate 300
Heimdall AstroAccelerate AstroAccelerate AstroAccelerate
100 140 Heimdall Heimdall Heimdall
200
120
80 250
100 150
60
SNR
80
200
100
60
40
40 150
50
20
20
0 0 0 100
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
time [s] time [s] time [s] time [s]
Figure 25: Comparison recovered SNRs by Heimdall and Astroaccelerate in single pulse case.
Pulses are from left 100, 200, 400, and 800 samples wide.
For our performance comparison, we have used the same two NVIDIA GPU cards (P100 and
Titan V). We present speed-up for different maximum DM values searched, i.e. we present ratio
of Heimdall’s SPD execution time over BoxDIT execution time in configuration BD-16. Our code
is faster by an order of magnitude achieving almost 100× speed-up for P100 and almost 200×
34
300 300 600 1200
AstroAccelerate AstroAccelerate AstroAccelerate AstroAccelerate
Heimdall Heimdall Heimdall Heimdall
550
250 1100
250
500
1000
200 200 450
width [samples]
400 900
150 150
350 800
100 100 300

700
250
50 50
600
200
0 0 150 500
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Figure 26: Comparison reported pulse width by Heimdall and Astroaccelerate in single pulse case.
Pulses are from left 100, 200, 400, and 800 samples wide.
60
AstroAccelerate 30 20
Heimdall AstroAccelerate AstroAccelerate AstroAccelerate
Heimdall Heimdall Heimdall
14
50 18
25
16
12
40
20 14
SNR
10
30 12
15
10
20 8
10 8
10 6
6
5
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
Figure 27: Comparison recovered SNRs by Heimdall and Astroaccelerate in pulsar case. Initial
SNR of the pulsar is from left 8, 4, 2, and 1.
35
speed-up for Titan V. This is shown in Figure 28. For lower values of maximum DM searched
our speed-up is lower, because the first group of DM-trials in case of AstroAccelerate is processed
in full time resolution in contrast to Heimdall which process only the first DM-trial in full time
resolution.
220
P100
200 TITAN V
180
160
140
Speedup
120
100
80
60
40
20
0 500 1000 1500 2000 2500 3000
3
max DM [pc/cm ]
Figure 28: How performance depends on the maximum width computed by the SPD algorithm
used.
5 Conclusions
In this work, we focused on single pulse detection using an incomplete set of boxcar filters. We
have quantified properties of these single pulse detection (SPD) algorithms and quantified errors
introduced by them.
We have described the signal loss of the SPD algorithm in terms of the systematic signal loss
and in terms of the worst case signal loss, these change with pulse width and parameters of the
SPD algorithm. The systematic signal loss represent the loss in signal SNR which will occur even
under best possible circumstances. This signal loss is different for different pulse widths and it is
occurs because the SPD algorithm uses an incomplete set of boxcar filters. The worst case signal
loss represent largest signal loss possible for a given pulse width and SPD algorithm. These two
quantities represents upper and lower bounds on possible signal loss for given pulse width and
SPD algorithm.
We have also identified two governing parameters of the single pulse detection algorithm. These
are the set of boxcar widths used for detection and the boxcar separation Ls , which is the distance
between two neighbouring boxcar filters of the same width. The systematic signal loss depends
only on how dense (lower signal loss) or sparse (higher signal loss) the set of boxcar filter widths
are. The worst case signal loss depends strongly on boxcar separation Ls and weakly on density
of set of the boxcar filter widths.
Based on the importance of these two parameters we have designed two distinct single pulse
detection algorithms. We have also presented our parallel implementation of these algorithms
on many-core architectures, namely NVIDIA GPUs. We have verified the correctness of these
implementation and discussed their performance.
Our first algorithm is called BoxDIT, the BoxDIT algorithm calculates boxcar filters at every
36
point of the input time-series which is progressively decimated. Decimation steps increase perfor-
mance, but also increase the signal loss. This algorithm is designed to be flexible and to minimise
systematic signal loss Ψl . It is more suited for situations were high sensitivity (low signal loss) is
required where it offers good performance. This makes it optimal for processing large amounts of
archival data. The disadvantage of the algorithm is that with increasing signal loss performance
does not increase accordingly.
The second algorithm is called IGRID and it is based on a distribution of boxcar filters using
a binary tree structure. This algorithm has higher performance than our BoxDIT algorithm at
the cost of higher signal loss, but for low signal loss this algorithm is slower. IGRID is best suited
for real-time search scenarios where performance is critical. The disadvantage of the algorithm is
that decreasing signal loss increases the memory footprint of the algorithm which leads to lower
performance.
Using IGRID in its less sensitive configuration we are able to process SKA-MID like data
266× faster then real-time on P100 GPU and 500× faster then real-time on Titan V GPU. Even
when considering the beyond SKA era of ultra high time resolution radio-astronomy, decreasing
sampling time to nanosecond scales (ts = 512ns) the IGRID algorithm on Titan V GPU would be
able to process 22000 DM-trials in real-time.
BoxDIT algorithm is part of AstroAccelerate software package Armour et al. [2019] and As-
troAccelerate with BoxDIT algorithm was used by Mickaliger et al. [2018].
Acknowledgements
This work has received support from an STFC Grant (ST/R000557/1). The authors would like
to acknowledge the use of the University of Oxford Advanced Research Computing (ARC) facility
in carrying out this work. The authors would also like to express thanks to Aris Karastergiou for
valuable discussions and input.
References
Adámek, K. and Armour, W. (2018). A GPU implementation of the harmonic sum algorithm. In
Lorente, N. P. F., editor, ADASSXXVIII, page arXiv:1812.02647.
Armour, W., Adámek, K., Dimoudi, S., Novotný, J., Carels, C., and Ouannoughi, N. (2019).
AstroAccelerate.
Armour, W. et al. (2012a). A GPU-based Survey for Millisecond Radio Transients Using
ARTEMIS. In Ballester, P., Egret, D., and Lorente, N. P. F., editors, ADASS XXI, volume 461
of ASP Conf. Ser., page 33.
Armour, W., Karastergiou, A., Giles, M., Williams, C., Magro, A., Zagkouris, K., Roberts, S.,
Salvini, S., Dulwich, F., and Mort, B. (2012b). A GPU-based Survey for Millisecond Radio
Transients Using ARTEMIS. In Ballester, P., Egret, D., and Lorente, N. P. F., editors, ADASS
XXI, volume 461 of Astronomical Society of the Pacific Conference Series, page 33.
Barsdell, B. R., Bailes, M., Barnes, D. G., and Fluke, C. J. (2012). Accelerating incoherent
dedispersion. Monthly Notices of the RAS, 422:379–392.
Brandt, S. (2014). Data Analysis : Statistical and Computational Methods for Scientists and
Engineers. 4th Ed. Cham.
Burke-Spolaor, S. and Bailes, M. (2010). The millisecond radio sky: transients from a blind
single-pulse search. MNRAS, 402:855–866.
Chan, T. F., Golub, G. H., and LeVeque, R. J. (1983). Algorithms for computing the sample
variance: Analysis and recommendations. The American Statistician, 37(3):242–247.
37
Connor, L. and van Leeuwen, J. (2018). Applying Deep Learning to Fast Radio Burst Classifica-
tion. The Astronomical Journal, 156:256.
Cordes, J. M., Freire, P. C. C., Lorimer, D. R., Camilo, F., Champion, D. J., Nice, D. J., Ra-
machandran, R., Hessels, J. W. T., Vlemmings, W., van Leeuwen, J., Ransom, S. M., Bhat,
N. D. R., Arzoumanian, Z., McLaughlin, M. A., Kaspi, V. M., Kasian, L., Deneva, J. S., Reid, B.,
Chatterjee, S., Han, J. L., Backer, D. C., Stairs, I. H., Deshpande, A. A., and Faucher-Giguère,
C.-A. (2006). Arecibo Pulsar Survey Using ALFA. I. Survey Strategy and First Discoveries.
The Astrophysical Journal, 637:446–455.
Cordes, J. M. and McLaughlin, M. A. (2003). Searches for Fast Radio Transients. ApJ, 596:1142–
1154.
Deneva, J. S., Cordes, J. M., McLaughlin, M. A., Nice, D. J., Lorimer, D. R., Crawford, F., Bhat,
N. D. R., Camilo, F., Champion, D. J., Freire, P. C. C., Edel, S., Kondratiev, V. I., Hessels,
J. W. T., Jenet, F. A., Kasian, L., Kaspi, V. M., Kramer, M., Lazarus, P., Ransom, S. M., Stairs,
I. H., Stappers, B. W., van Leeuwen, J., Brazier, A., Venkataraman, A., Zollweg, J. A., and
Bogdanov, S. (2009). Arecibo Pulsar Survey Using ALFA: Probing Radio Pulsar Intermittency
And Transients. The Astrophysical Journal, 703:2259–2274.
Dimoudi, S., Adamek, K., Thiagaraj, P., Ransom, S. M., Karastergiou, A., and Armour, W.
(2018). A GPU Implementation of the Correlation Technique for Real-time Fourier Domain
Pulsar Acceleration Searches. The Astrophysical Journal Supplement Series, 239(2):28.
Hillis, W. D. and Steele, Jr., G. L. (1986). Data parallel algorithms. Commun. ACM, 29(12):1170–
1183.
Jameson, A. and Barsdell, B. (2018). Heimdall - A Transient Detection Pipeline.
Karastergiou, A., Chennamangalam, J., Armour, W., Williams, C., Mort, B., Dulwich, F., Salvini,
S., Magro, A., Roberts, S., Serylak, M., Doo, A., Bilous, A. V., Breton, R. P., Falcke, H.,
Griemeier, J.-M., Hessels, J. W. T., Keane, E. F., Kondratiev, V. I., Kramer, M., van Leeuwen,
J., Noutsos, A., Osowski, S., Sobey, C., Stappers, B. W., and Weltevrede, P. (2015). Limits on
fast radio bursts at 145MHz with artemis, a real-time software backend. Monthly Notices of the
Royal Astronomical Society, 452(2):1254–1262.
Keane, E. F., Ludovici, D. A., Eatough, R. P., Kramer, M., Lyne, A. G., McLaughlin, M. A.,
and Stappers, B. W. (2010). Further searches for Rotating Radio Transients in the Parkes
Multi-beam Pulsar Survey. Monthly Notices of the RAS, 401:1057–1068.
Keane, E. F. and Petroff, E. (2015). Fast radio bursts: search sensitivities and completeness.
Monthly Notices of the RAS, 447:2852–2856.
Lorimer, D. R., Bailes, M., McLaughlin, M. A., Narkevic, D. J., and Crawford, F. (2007). A bright
millisecond radio burst of extragalactic origin. Science, 318(5851):777–780.
McLaughlin, M. A., Lyne, A. G., Lorimer, D. R., Kramer, M., Faulkner, A. J., Manchester, R. N.,
Cordes, J. M., Camilo, F., Possenti, A., Stairs, I. H., Hobbs, G., D’Amico, N., Burgay, M., and
O’Brien, J. T. (2006). Transient radio bursts from rotating neutron stars. Nature, 439:817–820.
Mickaliger, M. B., Jankowski, F., Rajwade, K., Adamek, K., Armour, W., Bassa, C., Breton, R. P.,
Caleb, M., Driessen, L., Karastergiou, A., Kramer, M., Morello, V., Sanidas, S., Stappers, B.,
and Walker, C. (2018). Upper limits on radio afterglow emission and previous outbursts for
the very bright FRB180309 from observations with the Lovell Telescope. The Astronomer’s
Telegram, 11606:1.
Middleton, M. J., Casella, P., Gandhi, P., Bozzo, E., Anderson, G., Degenaar, N., Donnarumma, I.,
Israel, G., Knigge, C., and Lohfink, A. (2017). Paving the way to simultaneous multi-wavelength
astronomy. New Astronomy Reviews, 79:26–48.
38
Parent, E., Kaspi, V. M., Ransom, S. M., Krasteva, M., Patel, C., Scholz, P., Brazier, A.,
McLaughlin, M. A., Boyce, M., Zhu, W. W., Pleunis, Z., Allen, B., Bogdanov, S., Caballero, K.,
Camilo, F., Camuccio, R., Chatterjee, S., Cordes, J. M., Crawford, F., Deneva, J. S., Ferdman,
R., Freire, P. C. C., Hessels, J. W. T., Jenet, F. A., Knispel, B., Lazarus, P., van Leeuwen, J.,
Lyne, A. G., Lynch, R., Seymour, A., Siemens, X., Stairs, I. H., Stovall, K., and Swiggum, J.
(2018). The implementation of a fast-folding pipeline for long-period pulsar searching in the
PALFA survey. The Astrophysical Journal, 861(1):44.
Rubio-Herrera, E., Stappers, B. W., Hessels, J. W. T., and Braun, R. (2013). A search for radio
pulsars and fast transients in M31 using the Westerbork Synthesis Radio Telescope. MNRAS,
428:2857–2873.
Staelin, D. H. (1969). Fast folding algorithm for detection of periodic pulse trains. IEEE Proceed-
ings, 57:724–725.
Wagstaff, K. L., Tang, B., Thompson, D. R., Khudikyan, S., Wyngaard, J., Deller, A. T.,
Palaniswamy, D., Tingay, S. J., and Wayth, R. B. (2016). A Machine Learning Classifier
for Fast Radio Burst Detection at the VLBA. PASP, 128(8):084503.
Zackay, B. and Ofek, E. O. (2017). AN ACCURATE AND EFFICIENT ALGORITHM FOR DE-
TECTION OF RADIO BURSTS WITH AN UNKNOWN DISPERSION MEASURE, FOR
SINGLE-DISH TELESCOPES AND INTERFEROMETERS. The Astrophysical Journal,
835(1):11.
Zhang, Y. G., Gajjar, V., Foster, G., Siemion, A., Cordes, J., Law, C., and Wang, Y. (2018).
Fast radio burst 121102 pulse detection and periodicity: A machine learning approach. The
Astrophysical Journal, 866(2):149.
A Sensitivity analysis
Here we present a detailed description of our derivation of sensitivity analysis for a given SPD
algorithm. We use signal-to-noise ratio (SNR) as defined in equations (3) and (4) as a figure of
merit for pulse detection by a given SPD algorithm. To distinguish SNR produced by the SPD
algorithm from the true SNR of the injected pulse we use recovered SNR RSNR. To simplify the
sensitivity analysis have used an idealized signal model as defined in section 2.1 which includes the
normalization of the injected rectangular pulse as given by equation (6). Furthermore, we assume
that the SPD algorithms return only the highest RSNR for a given sample, that is if multiple
boxcar filters of different widths are applied only the boxcar filter which has produced the highest
RSNR is returned by the SPD algorithm.
Using the idealized signal we can simplify the SNR calculation (4) and define RSNR of the
boxcar filter which acts on a rectangular pulse. When a boxcar filter encounters the pulse it will
give a value XL = ds A, where ds is the portion of the pulse (in the number of time samples)
that has been summed by the boxcar filter or pulse coverage. The pulse coverage is an integer
value between zero and the pulse width, 0 ≤ ds ≤ S. Using the formula for SNR (4) together
with white noise approximation (5), mean (µ(x) = 0) and standard deviation (σ(x) = 1) for
the idealized signal, and with normalization of the pulse’s amplitude A(S) (6), we can write the
recovered SNR (RSNR) by the boxcar filter of width L as
ds A(S) ds C
RSNR(S, L, ds ) = √ = √ √ . (22)
L L S
The recovered SNR by the boxcar filter in general depends on signal width S, boxcar width L,
and on the portion of the pulsed covered by the boxcar filter ds . The dependence on the pulse
coverage ds is another way of saying that the RSNR depends on the pulse’s position.
39
A boxcar filter which does not intersect with the injected rectangular pulse will sum up to zero
(XL = 0 from eq. (2)), that is ds = 0 thus RSNR(S, L, ds ) = 0 as well.
To quantify RSNRmax (S) and RSNRmin (S) we first have to define the worst case detection
event and the best case detection event.
A.0.1 Best detection case

By investigating the best detection case and the signal loss of the SPD algorithm under best
possible circumstances we are also investigating the minimal signal loss which is introduced by the
SPD algorithm.
The maximum RSNR((S, L) for a fixed S and L is when a boxcar filter covers the whole pulse
or as much of the pulse as is possible if the pulse is wider than the boxcar detecting it. There are
three possible configurations, depending on the size of the pulse S with respect to the size of the
boxcar filter L. These are S < L, S > L and S = L. The situation is summarized in Figure 29,
which only shows the first two cases.
S<L ds=S
S>L ds=L
Figure 29: The best case detection event. The pulse is shown as lines with circles at the ends and
the boxcar is shown as a hatched square.
The values of pulse coverage ds and RSNR, using equation (22) for all three possible cases are:
√ √
• S < L then ds = S and RSNRS<L = ( SC)/ L
• S = L then ds = S = L and RSNRL = C
√ √
• S > L then ds = L and RSNRS>L = ( LC)/ S
As the SPD algorithm may consist of many different boxcar filter widths, we first determine
the response RSNRmax (S; L) of a single boxcar filter of width L to a different pulse width S. We
define RSNRmax (S; L), a function of S and parametrized by L, as a combination of the cases we
have discussed above, that is
( √
√S C :S<L
RSNRmax (S; L) = √L (23)
L
√ C
S
:S≥L
This quantity gives us the highest RSNR possible for a given boxcar of width L for any pulse of
width S. The behavior of RSNRmax (S; L) for a fixed boxcar filter of width L is shown in Figure
30.
40
To find the response of the whole SPD algorithm to a pulse of width S we need to find the
maximum value of RSNRmax (S; L) produced by boxcar filters of every boxcar width L used by
the SPD algorithm. That is

RSNRmax (S) = max RSNRmax (S; L) . (24)
L∈B
20
C=16
RSNRmax(S;L=32)
RSNRS<L
RSNRS>L
15
RSNRmax(S;L)
10
0
0 20 40 60 80 100
Signal width S
Figure 30: The response of the boxcar filter of with L = 32 to different pulse widths, that is the
behaviour of RSNRmax (S; L). The peak is located at L = S, where the boxcar filter adds up all
of the pulse power.
An example of the behavior of RSNRmax (S) is shown in Figure 31. We see that the form
of RSNRmax (S) depends on the number of boxcar widths we combine together into the SPD
algorithm. Thus in order to increase (decrease) RSNRmax (S) we need to add (remove) more
boxcar filters of different widths to (from) set B.
The associated systematic signal loss Ψl is calculated using equation (7). This represents
minimal signal loss which is always introduced by the SPD algorithm for given pulse width S.
A.0.2 Worst detection case

To find the worst detection case we need to find a position of the pulse which minimizes RSNR(S, L, ds ).
Let’s suppose that we have a rectangular pulse of arbitrary width S, which is detected by a box-
car filter of fixed width L. Let’s also suppose that the starting time of our rectangular pulse is
increasing, that is the pulse is sliding forward in time. We assume that the boxcar filters of given
width are distributed evenly throughout the time-series with boxcar separation Ls . As the pulse
slides forward in time it crosses the beginning of the next boxcar filter covering the time-series
every Ls steps. When the signal crosses the beginning of the next boxcar filter the situation is the
same as it was Ls steps before. Which means that the only shifts that matter happens between
two consecutive boxcar filters on the span of Ls samples. Thus for the investigation of the worst
case detection we need to consider two consecutive boxcar filters.
Depending on the pulse width we can have three cases:
S ≥ L + Ls the pulse is wider than the extent of two consecutive boxcar filters. There is no way
to shift the pulse, so the pulse would not cover at least one boxcar filter. The pulse coverage
is ds = L.
41
20
15
RSNRmax(S)
10
C=16 RSNRS<L for L=32

RSNRmax RSNRS>L for L=32
0
0 20 40 60 80 100
Signal width S
Figure 31: Response (behaviour of RSNRmax (S)) of an SPD algorithm, where widths of boxcar
filters are powers of two, that is B = {1, 2, . . . , 2t }, and boxcar separation is Ls = L. The
RSNRmax (S; L) for L = 32 is emphasized to illustrate how RSNRmax (S) is constructed.
S ≤ L − Ls the pulse is shorter than the overlap of two consecutive boxcar filters. Wherever the
pulse is, the pulse will always be covered by at least one boxcar filter with pulse coverage
ds = S.
L − Ls < S < L + Ls the pulse lies between two consecutive boxcar filters.
The first two cases are shown in Figure 32 and the last case is shown in Figure 33.
L+Ls
S>L+Ls
S<L-Ls
Ls L-Ls
Figure 32: Pulses of width S ≤ L − Ls and S ≥ L + Ls are shown as lines ending with circles and
boxcar filters are shown as grey hatched rectangles. The pulses shift in time from one boxcar to
the next. The boxcar which is always covered by the pulse for S ≤ L + Ls or always covered by
the pulse width S ≥ L − Ls is emphasized in black. Parameters are L = 5, Ls = 3 with pulses of
width S = 8 and S = 2.
42
dsL=2 dsU=4
dsL=3 U
ds =3
dsL=4 dsU=1
dsL=4 dsU=2
dsL=3 dsU=3
Figure 33: Pulses are shown as lines ending with circles and boxcar filters are shown as hatched
rectangles. A pulse of width L − Ls < S < L + Ls starts at the beginning of the lower boxcar and
as we shift the pulse in time it moves to the starting point of the upper boxcar. The worst case
is emphasized in red. It also shows pulse coverage for both lower and upper boxcars. Parameters
are L = 5, Ls = 3, S = 4.
We see that from these three cases, the worst case can only occur when L − Ls < S < L + Ls .
Our goal is then to find the position of the pulse which minimizes RSNR.
Let’s denote the boxcar which starts earlier in time as the lower boxcar with pulse coverage dL s
and boxcar next to it as the upper boxcar with dU s . To continue our example with the pulse shifting
in time, let’s inject our rectangular pulse so that its beginning coincides with the beginning of the
lower boxcar. As the pulse shifts in time, the pulse coverage of the lower boxcar dL s will decrease
while pulse coverage of the upper boxcar dU s will increase. Let’s now suppose that the shift is
continuous instead of discrete. In the continuous case, there will be a time when dL s = dU
s , this is
shown in Figure 33. If we shift the pulse in either direction one of the pulse coverages will increase
at the expense of the other. When performing single pulse detection we are interested only in the
highest RSNR, for fixed S and L, this means the highest pulse coverage ds . In the worst case
scenario we are looking for the lowest maximum produced by the SPD algorithm. This is achieved
when both the lower and upper boxcar produce the same RSNR. Therefore the condition for the
worst case detection is
dL U
s = ds . (25)
For clarity it is useful to split the range L − Ls < S < L + Ls into two further cases L − Ls <
S < L and L ≤ S < L + Ls . For range L − Ls < S < L we can express pulse coverage for both
boxcars as
dL
s = S − αLs ,
(26)
dU
s = L − Ls + αLs ,
where 0 ≤ α ≤ 1 is a shift of the pulse in time. Using condition (25) together with equations (26)
we can get the expression for the parameter α. Substituting α into one of the equations (26) will
give pulse coverage
< S − L + Ls
ds = S − , (27)
2
where ⌊ ⌋ represents the floor function which rounds the argument down to nearest lower integer.
Rounding is necessary because we are working with discrete data. This is depicted in Figure 33.
For the second case where L ≤ S < L + Ls , we can express pulse coverage for both boxcars as
dL
s = L − αLs ,
(28)
dU
s = S − Ls + αLs .
Solving equation (25) using (28) for α and then substituting into dL
s gives

L + Ls − S
d>
s = L− . (29)
2
43
20 20
15 15
RSNRmin(S;L)
RSNRmin(S)
10 10
C=16 C=16
5 S<L-Ls 5 RSNRmin for Ls=L
S>L+Ls RSNRmin for Ls=L/2
RSNRmin(S;L=32) S<L-Ls
RSNRmin(S;L=32) S>L+Ls
L-Ls<S<L+Ls for Ls=L/2 L-Ls<S<L+Ls for Ls=L/2
L-Ls<S<L+Ls for Ls=L L-Ls<S<L+Ls for Ls=L
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Signal width S Signal width S
Figure 34: Left: The worst case response RSNRmin (S; L) of the boxcar filter of width L = 32 for
two values of Ls = L (black) and Ls = L/2 (grey). We see that a change in Ls has a noticeable
effect on the value and behaviour of RSNRmin (S; L), thus to increase (decrease) RSNRmin (S; L)
we need to decrease (increase) the value of Ls . Right: The response of the whole SPD algorithm
RSNRmin (S), where boxcar widths are powers of two B = {1, 2, . . . , 2t }, and Ls = L is in black
and Ls = L/2 is in grey.
We can get the same expression using dU

s .
To summarize we have the following cases:
√ √
• S ≤ L − Ls then ds = S and RSNR = ( SC)/ L
√
• L − Ls < S < L then d<
s = S − ⌊(S − L + Ls )/2⌋, and RSNR = ds C/ LS
√
• L ≤ S < L + Ls then d>
s = L − ⌊(L + Ls − S)/2⌋, and RSNR = ds C/ LS
√ √
• S > L + Ls then ds = L and RSNR = ( LC)/ S
After the investigation into all possible cases we can define the worst case response of the
boxcar filter of width L to a rectangular pulse of different widths RSNRmin (S; L) as
 √
√S
 LC : S ≤ L − Ls


 d< √C

: L − Ls < S < L
s LS
RSNRmin (S; L) = > √C (30)
 ds LS
: L ≤ S < L + Ls

 √
 √L C

:S ≥L+L
S s
where d< >

s and ds are given by equations (27) and (29).
The worst case response of the whole SPD algorithm RSNRmin (S) is then given by

RSNRmin (S) = max RSNRmin (S; L) . (31)
L∈B
The behavior of RSNRmin (S; L) and RSNRmin (S) for two different values of Ls is shown in
Figure 34. We see that RSNRmin (S) is also influenced by how many different widths of boxcar
filters are used. Furthermore, we see that it is perhaps even more influenced by the distance
between boxcar filters Ls .
44

Single Pulse Detection Algorithms For Real-Time

Uploaded by

Copyright:

Available Formats

Single Pulse Detection Algorithms For Real-Time

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Single Pulse Detection Algorithms For Real-Time

Uploaded by

Copyright:

Available Formats

Single Pulse Detection Algorithms for Real-time Fast Radio

Burst Searches using GPUs

Karel Adámek1 and Wesley Armour ∗ 1

April 28, 2020

2 Single pulse detection

2.1 Idealized signal model

2.2 Sensitivity analysis

SNRT − RSNR RSNR(S, L, ds )

Ψl (S) = 1 − RSNRmax (S)/SNRT , (9)

which is always introduced by the SPD algorithm for a pulse of width S.

2.2.1 Parameters of the SPD algorithm

2.2.2 Importance of maximum and minimum RSNR

Effect of decreasing distance between boxcar filters Ls on RSNRmin

2.2.3 Error in detected pulse width and time

3 Single pulse detection algorithms

• xi input time-series for given iteration (x0 = x is the initial time-series)

Y i [n] = maxi SNR(XLi [n]) ,

Calculating boxcar filters on decimated data xi ;

The width of the boxcar filter calculated by iteration i is given as

where nis = n/D(i) .

3.1.1 BoxDIT GPU implementation

B[n − wm ; wm + w] = B[n − wm ; wm ] + B[n; w] . (16)

where j = ceil(Lim /wm ).

3.2 IGRID algorithm

Algorithm 3: Decimation in time (DIT) algorithm. Function Calculate SNR calculates

one layer xi+1

Effect of increasing number of boxcar widths on R-SNRmax

3.2.1 IGRID GPU implementation

3.3 Computational and memory complexity

B0 and Bd represent boxcar filters;

th0 t 1 t2 t3 t4 t5 t6 t7 t8 t9 .

4.1 Measurement of signal loss

effects of interstellar medium.

4.2 Signal loss for idealized signal

4.3 Other pulse shapes

5 samples 10 samples 20 samples

To produce these pulses we have sampled the normal distribution given by

Error in measured width and time [%]

Recovered fraction of signal power

4.4 Time-series with white noise

Time sample Time sample

Time sample Time sample

Number of DM trials in real-time

Number of DM trials in real-time

Table 4: Compute/memory Utilization of used GPU’s per configuration as reported by the

4.6.1 De-dispersion transform

Figure 23: Comparison between DM steps used by Heimdall’s and AstroAccelerate.

4.6.2 Calculation of mean and standard deviation

Data generated by fake P=22.69ms; DM=49.8pc/m3; SNR=8

4.6.3 Comparison to Heimdall pipeline

100 100 300

A.0.1 Best detection case

A.0.2 Worst detection case

C=16 RSNRS<L for L=32

We can get the same expression using dU

where d< >

You might also like

th0 t 1 t2 t3 t4 t5 t6 t7 t8 t9 .