Training Full Spike Neural Networks Via Auxiliary Accumulation Pathway

Training Full Spike Neural Networks via Auxiliary Accumulation Pathway
Guangyao Chen 1 2 Peixi Peng 1 2 Guoqi LI 3 Yonghong Tian 1 2
Abstract 1. Introduction
In the past few years, Artificial Neural Networks (ANNs)
have achieved great success in many tasks (Krizhevsky et al.,
arXiv:2301.11929v1 [cs.NE] 27 Jan 2023
Due to the binary spike signals making converting 2012; Simonyan & Zisserman, 2015; Szegedy et al., 2015;
the traditional high-power multiply-accumulation Girshick et al., 2014; Liu et al., 2016; Redmon et al., 2016;
(MAC) into a low-power accumulation (AC) avail- Chen et al., 2021b;a; 2020; Ma et al., 2022; Chen & Chen,
able, the brain-inspired Spiking Neural Networks 2018). However, with ANNs getting deeper and larger,
(SNNs) are gaining more and more attention. computational and power consumption are growing rapidly.
However, the binary spike propagation of the Hence, Spiking Neural Networks (SNNs), inspired by bi-
Full-Spike Neural Networks (FSNN) with limited ological neurons, have recently received surging attention
time steps is prone to significant information loss. and are regarded as a potential competitor of ANNs due to
To improve performance, several state-of-the-art their high biological plausibility, event-driven property, and
SNN models trained from scratch inevitably bring low power consumption (Roy et al., 2019) on neuromorphic
many non-spike operations. The non-spike oper- hardware.
ations cause additional computational consump-
tion and may not be deployed on some neuro- To obtain an effective SNN, several works (Kim et al., 2018;
morphic hardware where only spike operation is Xing et al., 2019; Hwang et al., 2021; Hu et al., 2018b;
allowed. To train a large-scale FSNN with high Sengupta et al., 2019; Han et al., 2020; Lee et al., 2020;
performance, this paper proposes a novel Dual- Zheng et al., 2021; Samadzadeh et al., 2020; Rathi & Roy,
Stream Training (DST) method which adds a de- 2020; Rathi et al., 2020) are proposed to convert the trained
tachable Auxiliary Accumulation Pathway (AAP) ANN to SNN by replacing the raw activation layers (such as
to the full spiking residual networks. The accu- ReLU) with spiking neurons. Although this type of method
mulation in AAP could compensate for the infor- could achieve state-of-the-art accuracy on many image clas-
mation loss during the forward and backward of sification datasets, they often require a large number of
full spike propagation, and facilitate the training time steps which causes high computational consumption
of the FSNN. In the test phase, the AAP could be and also limit the application of SNN, and the direct pro-
removed and only the FSNN is remained. This motion of exploring the characteristics of SNN is limited.
not only keeps the lower energy consumption but Hence, a series of methods are proposed to train SNN from
also makes our model easy to deploy. Moreover, scratch. Based on the surrogate gradient backpropagation
for some cases where the non-spike operations are method (Tavanaei et al., 2019), several recent works train
available, the APP could also be retained in test deep SNN by improving the Batchnorm (Zheng et al., 2021)
inference and improve feature discrimination by or residual connection structure (Fang et al., 2021a; Zhou
introducing a little non-spike consumption. Ex- et al., 2023; Xiao et al., 2022; Deng et al., 2022), and nar-
tensive experiments on ImageNet, DVS Gesture, row the gap between SNN and ANN effectively. To obtain
and CIFAR10-DVS datasets demonstrate the ef- high performance, several SOTA SNN models (Fang et al.,
fectiveness of DST. 2021a; Zheng et al., 2021; Zhou et al., 2023) inevitably
bring many non-spike operations with ADD residual con-
nections. Although effective, these methods may suffer
two main drawbacks: Firstly, the energy efficiency advan-
*
Equal contribution 1 Department of Computer Science and tage of SNNs mainly comes from the fact that the binary
Technology, Peking University 2 Peng Cheng Laborotory 3 Institute spike signals make converting the traditional high-power
of Automation, Chinese Academy of Sciences. Correspondence multiply-accumulation (MAC) into a low-power accumula-
to: Peixi Peng <[email protected]>, Yonghong Tian <yh- tion (AC) available, the non-spike operations don’t fit this
[email protected]>.
characteristic and will bring high computation consumption.
Copyright 2023 by the author(s). Secondly, several neuromorphic hardware only supports
Training Full Spike Neural Networks with Auxiliary Accumulation Pathway
spiking operation, and these models cannot be deployed activation is first trained, then converts the ANN to an
directly (Horowitz, 2014). SNN by replacing ReLU with spiking neurons and adding
scaling operations like weight normalization and threshold
Hence, it is necessary to develop a full-spike neural network
balancing. Recently, several ANN2SNN methods (Han
(FSNN) that only contains spike operations. However, the
et al., 2020; Han & Roy, 2020; Deng & Gu, 2021; Li et al.,
binary spike propagation with limited time steps is prone to
2021) have achieved near loss-less accuracy for VGG-16
significant information loss, and limits the performance of
and ResNet. However, the converted SNN needs longer
FSNN. For example, the Spiking ResNet in Figure 1 is prone
time steps to rival the original ANN, and increases the
to vanishing gradient problems in deep networks (Fang et al.,
SNN’s computational consumption (Rueckauer et al.,
2021a). To compensate for the loss of information forward
2017). One of the backpropagation methods (Kim et al.,
and backward from full spike propagation, we propose a
2020) computes the gradients of the timings of existing
novel Dual-stream Training (DST) method, where the whole
spikes with respect to the membrane potential at the spike
network contains a full spike propagation stream and a aux-
timing (Comsa et al., 2020; Mostafa, 2017; Kheradpisheh
iliary spike accumulation stream. The former includes full-
& Masquelier, 2020; Zhou et al., 2021; Zhang & Li,
spike inference of FSNN, while the latter is a plug-and-play
2020). Another kind of backpropagation method gets the
Auxiliary Accumulation Pathway (AAP) to the FSNN. In
gradient by unfolding the network over the simulation
training, the AAP is able to compensate for the loss of infor-
time-steps (Lee et al., 2016; Huh & Sejnowski, 2018;
mation forward and backward from full spike propagation
Wu et al., 2018; Shrestha & Orchard, 2018; Lee et al.,
by spike accumulation, which could help alleviate the van-
2020; Neftci et al., 2019). As the gradient with respect
ishing gradient problem of the Spiking ResNet and improve
to the threshold-triggered firing is non-differentiable, the
the performance of the FSNN. Although the accumulation
surrogate gradient is often used.
in APP causes non-spike operation, the AAP could be re-
moved and only the FSNN remains in the test phase. In
other words, our model could act as an FSNN inference in 2.2. Spiking Residual Networks
practical application. This not only keeps the lower energy For ANN2SNN with ResNet, several methods made spe-
consumption but also makes our model easy to deploy in the cific normalizations for conversion. The residual structure in
neuromorphic hardware. Moreover, for some cases where ANN2SNN with scaled shortcuts is applied in SNN to match
the non-spike operations are available, the AAP could also the activations of the original ResNet (Hu et al., 2018a). Pre-
be retained in test inference and further improve feature vious ANN2SNN methods noticed the distinction between
discrimination by introducing a small number of non-spike plain feedforward ANNs and residual ANNs, and made spe-
MAC operations. It is notable that FSNN+AAP only brings cific normalizations for conversion. Then, Spike-Norm (Sen-
a linear rise in the non-spike AC computation with the model gupta et al., 2019) is proposed to balance the threshold of the
depth, resulting in less computational consumption. Spiking Neural Model and verified their method by convert-
We evaluate DSNN on both the static ImageNet ing ResNet to SNNs. Moreover, existing backpropagation-
dataset (Deng et al., 2009) and the neuromorphic DVS Ges- based methods use nearly the same structure as ResNet.
ture dataset (Amir et al., 2017), CIFAR10-DVS dataset (Li Several custom surrogate methods (Sengupta et al., 2019)
et al., 2017), CIFAR100 (Krizhevsky et al., 2009). The are evaluated on shallow ResNets. Threshold-dependent
experiment results are consistent with our analysis, indicat- batch normalization (td-BN) is proposed to replace naive
ing that the DST could improve the previous ResNet-based batch normalization (BN) (Ioffe & Szegedy, 2015) and suc-
and Transformer-based FSNNs to higher performance by cessfully trained Spiking ResNet-34/50 directly with surro-
simply increasing the network’s depth, and keeping efficient gate gradient by adding td-BN in shortcuts. SEW ResNet
computation consumption simultaneously. is the first method to increase the SNNs to more than 100
layers, but its use of additive operations in residual connec-
tions leads to a none-spike structure in the deep layer of
2. Related Work the network bringing higher computational consumption.
2.1. Spiking Neural Networks (Deng et al., 2022) introduces the temporal efficient train-
ing approach to compensate for the loss of momentum in
To train deep SNNs, ANN to SNN conversion the gradient descent with SG so that the training process
(ANN2SNN) (Hunsberger & Eliasmith, 2015; Cao can converge into flatter minima with better generalizability.
et al., 2015; Rueckauer et al., 2017; Sengupta et al., 2019; (Xiao et al., 2022) proposes online training through time for
Han et al., 2020; Han & Roy, 2020; Deng & Gu, 2021; SNNs, which enables forward-in-time learning by tracking
Stöckl & Maass, 2021; Li et al., 2021) and backpropagation presynaptic activities and leveraging instantaneous loss and
with surrogate gradient (Neftci et al., 2019) are the two gradients. Recently, (Zhou et al., 2023) considers leveraging
mainstream methods. For ANN2SNN, an ANN with ReLU both self-attention capability and biological properties of
𝑡 𝑡 𝑡 𝑡 𝑡 𝑡 𝑡 𝑡 𝑡 𝑡 𝑡 𝑡
𝑜𝑙−1 𝑜𝑙−1 = 𝑔(𝑠𝑙−1 , 𝑜𝑙−2 ) 𝑎𝑙−1 𝑜𝑙−1 = 𝑠𝑙−1 + 𝑜𝑙−2 𝑜𝑙−1 = 𝑔(𝑠𝑙−1 , 𝑜𝑙−2 ) 𝑎𝑙−1
Auxiliary Accumulation Pathway
Auxiliary Accumulation Pathway

SSA SSA
𝑠𝑙𝑡
Full-Spike Propagation
Full-Spike Propagation
Conv+BN Conv+BN
+
sn sn
+ 𝑔
Conv+BN Conv+BN
𝑠𝑙𝑡 MLP MLP 𝑡
𝑠𝑙+1
+ +
+sn +sn + 𝑔
𝑦𝑙 𝑜𝑙𝑡 𝑎𝑙𝑡 𝑜𝑙𝑡 𝑡

𝑜𝑙+1 𝑎𝑙𝑡
(a) Spiking ResNet (b) Spiking ResNet with DST (c) Spike Transformer (d) Spike Transformer with DST
Multiply Accumulate Operator Accumulation Operator
Figure 1. Residual blocks and Dual-stream blocks in Spiking ResNet (Fang et al., 2021a) and Spike Transformer (Zhou et al., 2023). (a)
the Spiking ResNet by replacing ReLU activation layers with spiking neurons. (b) The Spiking ResNet with Dual-stream training. (c)
The input of the Spike Transformer with Spiking Self-Attention (SSA) is non-spike data due to residual addition between blocks, which
brings additional multiply accumulative computation to SNN. (d) Spike Transformer with a dual-stream structure is divided into a spike
propagation pathway and a plug-and-play auxiliary accumulation pathway (AAP), which could compensate for full spike propagation.
Note AAP could be either retained or removed in model inference.
SNNs and proposes the Spiking Transformer (Spikformer). 3.2. Computational Consumption
With the sparsity of firing and the short simulation period,
3. Preliminaries SNN can achieve the calculation with about the same num-
ber of synaptic operations (SyOPs) (Rueckauer et al., 2017)
3.1. Spiking Neuron Model
rather than FLOPs, The number of synaptic operations per
The spiking neuron is the fundamental computing unit of layer of the network can be easily estimated for an ANN
SNNs. The dynamics of all kinds of spiking neurons can be from the architecture of the convolutional and linear lay-
described as follow: ers. For the ANN, a multiply-accumulate (MAC) computa-
tion takes place per synaptic operation. On the other hand,
Ht = SN (Vt−1 , Xt ), (1) specialized SNN hardware would perform an accumulated
St = Θ(Ht − Vth ), (2) computation (AC) per synaptic operation only upon the re-
Vt = Ht (1 − St ) + Vreset St , (3) ceipt of an incoming spike. Hence, the total number of
AC operations occurring in the SNN would be represented
where Xt is the input at time-step t, Ht and Vt denote the by the dot-product of the average cumulative neural spike
membrane potential after neuronal dynamics and after the count for a particular layer and the corresponding number of
trigger of a spike at time-step t, respectively. Vth is the synaptic operations. With the deepening of SNN and ANN,
firing threshold, Θ(x) is the Heaviside step function and the relative energy ratio gradually approaches a fixed value,
is defined by Θ(x) = 1 for x ≥ 0 and Θ(x) = 0 for which could be calculated as follows:
x < 0. St is the output spike at time-step t, which equals E(SNN) Eac
1 if there is a spike and 0 otherwise. Vreset denotes the ≈ T · fr · , (6)
E(ANN) Emac
reset potential. The function SN (·) in Eq. (1) describes the
neuronal dynamics and takes different forms for different where T and f r represent the simulation time and the aver-
spiking neuron models, which include the Integrate-and- age firing rate.
Fire (IF) model (Eq. (4)) and Leaky Integrate-and-Fire (LIF)
Eq. (6) assumes the SNN only contains {0, 1} spike AC
model (Eq. (5)):
operators. However, several SOTA SNNs employ many non-
Ht = Vt−1 + Xt , (4) spike MAC and their computation consumption could not
be estimated by Eq. (6) accurately. To estimate the compu-
1
Ht = Vt−1 + (Xt − (Vt−1 − Vreset )), (5) tation consumption of different SNNs more accurately, we
τ calculate the number of computations required in terms of
where τ represents the membrane time constant. Eq. (2) and AC and MAC synaptic operations, respectively. The main
Eq. (3) describe the spike generation and resetting processes, energy consumption of non-spike signals comes from the
which are the same for all kinds of spiking neuron models. MAC between neurons. The contribution from one neuron
to another requires a MAC for each timestep, multiplying vanishing and exploding gradient problems.
each non-spike activation with the respective weight before
adding it to the internal sum. In contrast, a transmitted otl = SN(fl (otl−1 )) + otl−1 = stl + otl−1 , (9)
spike requires only an accumulation at the target neuron,
where stl denotes the residual mapping learned as stl =
adding weight to the potential, and where spikes may be
SN(fl (otl−1 )). While this design brings performance im-
quite sparse. Therefore, for any SNN network F, the the-
provements, it inevitably brings in non-spike data and thus
oretical computational consumption can be determined by
MAC operations. In particular, for the ADD function, if
the number of AC and MAC operations (Oac , Omac ):
both stl and otl−1 are spike signals, its output otl will be a non-
E(F) = T · (f r · Eac · Oac + Emac · Omac ). (7) spike signal whose value range is {0, 1, 2}. As the depth of
the network increases, the range of signals transmitted to the
Moreover, we developed and open-sourced a tool to calcu- next layer of the network will also expand. Convolution re-
late the dynamic consumption of SNNs as syops-counter1 , quires much more computational overhead of multiplication
which can compute the theoretical amount of AC and MAC and addition when dealing with these non-spiking signals
operations. as shown in Figure 1(b). In this case, the network will incur
additional high computational consumption.
3.3. Spiking Residual Blocks
There are two main types of residual blocks for existing
4. Dual-stream Training
SNNs. The one replaces ReLU activation layers with spik- 4.1. Dual-stream SNN
ing neurons, which constructs FSNN but is prone to vanish-
ing gradient problems. The second uses the same addition Basic Block. Instead of the residual Block containing only
operation as the ANN, but leads to additional computation one path with respect to the input spike x, we initialize
consumption due to the appearance of non-spike signals. the input to two consistent paths st0 = at0 = x, where
stl represents the spike signal propagated between blocks,
Vanishing Gradient Problems of Spiking ResNet. Con- and atl represents the spike accumulation carried out on the
sider a Spiking ResNet with k sequential blocks to transmit output of each block. As illustrated in Figure 1(b) and (d),
stl , and the identity mapping condition is met by using the IF the Dual-stream Block can be formulated as:
neurons for residual connection with 0 < Vth ≤ 1, then we
(
t SN(fl (otl−1 ) + otl−1 ) = SN (stl + otl−1 )
have stl = stl+1 = ... = stl+k−1 = otl+k−1 . The gradient of ol = (10)
g(SN(fl (otl−1 )), otl−1 ) = g(stl , otl−1 )
the output of the (l + k − 1)-th residual block with respect
to the input of the l-th residual block could be calculated atl = stl + atl−1 , (11)
layer by layer:
where g represents an element-wise function with two spikes
∂otl+k−1 k−1
Y ∂otl+i k−1
Y tensors as inputs. Note that here we restrict g to be only the
= = Θ0 (stl+i − Vth ) corresponding logical operation function as shown in Ta-
∂stl ∂stl+i
i=0
(
i=0
(8) ble 1, so as to ensure that the input and output of g function
0 are spike trains. Here, Eq.(10) is the full-spike propaga-
0, if 0 < Θ (stl − Vth ) < 1
→ , tion pathway, which could be either of two types of spiking
1, if Θ0 (stl − Vth ) = 1
ResNet. Eq.(11) denotes the plug-and-play Auxiliary Accu-
where Θ(x) is the Heaviside step function and Θ0 (x) is mulation Pathway (AAP). During the inference phase, the
defined by the surrogate gradient. The second equality hold auxiliary accumulation could be removed as needed.
as otl+i = SN(stl+i ). In view of the fact that stl could only
take 0 or 1 with identity mapping, Θ0 (stl − Vth ) = 1 is not Downsampling Block. Remarkably, when the input and
satisfied for commonly used surrogate functions mentioned output of one block have different dimensions, the shortcut
in (Neftci et al., 2019). When using the common Sigmoid is set as convolutional layers with stride > 1, rather than the
(Neftci et al., 2019) function as the Heaviside step function, identity connection, to perform downsampling. The Spiking
the gradient vanishing problem would be prone be happen. ResNet utilize {Conv-BN} without ReLU in the shortcut.
SEW ResNet (Fang et al., 2021a) adds an SN in shortcut as
Spiking Residual Blocks with ADD. As illustrated in shown in Figure 2(b). Figure 2(b) shows the downsampling
Figure 1(c), the residual block for SNN can be formulated of auxiliary accumulation, that the overhead of multiply
with an ADD function (Fang et al., 2021a; Zhou et al., 2023), accumulation in Dual-stream Blocks mainly comes from the
which can implement identity mapping and overcome the downsampling of the spike accumulation signal. Fortunately,
the number of downsampling in a network is always fixed,
1
github.com/iCGY96/syops-counter so the MAC-based downsampling operation of the spike
𝑡 𝑡 𝑡 𝑡
𝑡
𝑜𝑙−1 𝑡
= 𝑔(𝑠𝑙−1 𝑡
, 𝑜𝑙−2 ) 𝑡
𝑎𝑙−1 𝑜𝑙−1 = 𝑔(𝑠𝑙−1 , 𝑜𝑙−2 ) 𝑎𝑙−1
Table 1. List of element-wise functions g.
Conv+BN
Operator g(xtl , stl )
sn Conv+BN 𝑓𝑙 Conv+BN AND stl ∧ xtl = stl · xtl
Conv+BN
sn relu
IAND (¬stl ) ∧ xtl = (1 − stl ) · xtl
𝑠𝑙𝑡 𝑠𝑙𝑡 OR stl ∨ xtl = stl + xtl − (stl · xtl )
sn
+ +
𝑔 𝑔 XOR stl ⊕ xtl = stl · (1 − xtl ) + xtl · (1 − stl )
𝑜𝑙𝑡 𝑎𝑙𝑡 𝑜𝑙𝑡 𝑎𝑙𝑡
(a) SEW Block with DST (b) Downsample Block with DST
The Element-wise Logical Residuals. For the second
Multiply Accumulate Accumulation
spike propagation of Eq.(10), different element-wise func-
Figure 2. The SEW Block (Fang et al., 2021a) and its Downsample tions g in Table 1 satisfy identity mapping. Specifically, for
blocks with a dual-stream structure. IAND, OR and XOR as element-wise functions g, iden-
tity mapping could be achieved by setting stl ≡ 0. Then
accumulation pathway will not increase with the increase of otl = g(stl , otl−1 ) = g(SN(0), otl−1 ) = g(0, otl−1 ) = otl−1 .
network depth. Therefore, the increased computational over- This is consistent with the conditions for auxiliary accumu-
head of DSNN with the increase of network depth mainly lation Eq.(11) to achieve identity mapping and is applicable
comes from the accumulation calculation. to all neuron models. In contrast, for AND as the element-
Training. For backpropagation, the gradient could be back- wise function g, stl should be ones to get identity mapping.
propagated to these spiking neurons through the auxiliary Then otl = 1 ∧ otl−1 = otl−1 . Although the input parameters
accumulation to prevent the vanishing gradient caused by of spiking neuron models can be adjusted to practice identity
deeper layers. Therefore, in the training phase, the out- mapping, it is conflicted with the auxiliary accumulation
puts of both auxiliary accumulation and spike propagation pathway for identity mapping. This is not conducive to
are used as the final output and the gradient is calculated maintaining the consistency of signal propagation between
according to a consistent objective function Lc : the two pathways, thus affecting the effects of training and
final recognition. Meanwhile, it is hard to control some
spiking neuron models with complex neuronal dynamics to
L(x, y) = Lc (Os , y) + Lc (Oa , y), (12)
generate spikes at a specified time step.
where Os and Oa are the final outputs of auxiliary accumu- 4.3. Vanishing Gradient Problem
lation and spike propagation respectively.
Auxiliary Accumulation Pathway. The gradient for the
∂atl+k−1
4.2. Identity Mapping auxiliary accumulation pathway is calculated as ∂atl
=
∂atl
As stated in (He et al., 2016b), satisfying identity mapping ∂atl
= 1 with identity mapping. Since the above gradient
is crucial to training a deep network. is a constant, the auxiliary accumulation path of Eq. (11)
could also overcome the vanishing gradient problems.
Auxiliary Accumulation Pathway. For auxiliary accu-
mulation of Eq.(11), identity mapping is achieved by when The Spiking Neurons Residuals. Moreover, consider a
stl ≡ 0, which can be implemented by setting the weights Spiking ResNet with AAP, the gradient of the output of
and the bias of the last BN layer in fl to zero. the (l + k − 1)-th residual block with respect to the input
of the l-th residual block with identity mapping could be
calculated:
The Spiking Neurons Residuals. The first spike propa-
k−1
gation of Eq.(10) based on Spiking ResNet could imple- ∂otl+k−1 ∂atl+k−1 Y ∂otl+i ∂atl
ment identity mapping by partial spiking neurons. When + = +
∂stl ∂atl ∂stl+i ∂atl
fl (ot−1
l ) ≡ 0, otl = SN(ot−1l ) 6= ot−1
l . To transmit ot−1
l
i=0
t−1 t−1 k−1
and make SN(ol ) = ol , the last spiking neuron (SN) Y ∂atl
in the l-th residual block needs to fire a spike after receiving = Θ0 (stl+i − Vth ) + (13)
i=0
∂atl
a spike and keep silent after receiving no spike at time-step (
t. It works for IF neurons described by Eq. (4). Specifi- 1, if 0 < Θ0 (stl − Vth ) < 1
→ ,
cally, we can set 0 < Vth ≤ 1 and Vt−1 = 0 to ensure that 2, if Θ0 (stl − Vth ) = 1
ot = 1 leads to Ht ≥ Vth , and ot = 0 leads to Ht < Vth .
Hence, we replaced all the spiking neurons at the residual Since the above gradient is a constant, the Spiking ResNet
connections with IF neurons. with AAP could alleviate the vanishing gradient problems.
The Element-wise Logical Residuals. When the identity parameters are consistent with SEW ResNet (Fang et al.,
mapping is implemented for the spiking propagation path 2021a). As shown in Table 2, two variations of our DST are
of Eq.(10), the gradient of the output of the (l + k)-th dual- evaluated: “FSNN (DST)” means the FSNN is trained by
stream block with respect to the input of the l-th DSNN the proposed DST and AAP is removed in the test phase, and
block could be calculated layer by layer: “DSNN” means the APP is retained to improve the discrimi-
nation of features. “FSNN-18” and ”DSNN-18” represents
k
∂otl+k−1 Y ∂g(stl+i , xtl+i ) the SNN is designed based on ResNet-18, and so on. Three
= types of methods are compared respectively: “A2S” repre-
∂xtl i=0
∂xtl+i
sents ANN2SNN methods, “FSNN” and “MPSNN” means
∂(1·xtl+i )
Q
k

 i=0 ∂xtl+i , if g = AND
Full Spike Neural Networks and Mixed-Precision Spike
Neural Networks respectively. These notations keep the

 t
 k ∂((1−0)·x l+i )
 Q
, if g = IAND

i=0 ∂xtl+i same in the below.
= Qk ∂((1+0−0)·xt ) = 1.
l+i


 i=0 ∂xtl+i
, if g = OR As shown in Table 2, we can obtain 3 following key findings:

Qk ∂((1+0)·x t
l+i )


i=0 ∂xt
, if g = XOR First, the SOTA ANN2SNN methods (Li et al., 2021; Hu
l+i
(14) et al., 2018b) achieve higher accuracies than FSNN as well
The second equality holds as identity mapping is achieved as other SNNs trained from scratch, but they use 64 and
by setting stl+i ≡ 1 for g = AND, and stl+i ≡ 0 for 87.5 times as many time-steps as FSNN respectively, which
g = IAND/OR/XOR. Since the gradient in Eq. (14) is a means that they require more computational consumption.
constant, the spiking propagation path of Eq.(10) overcomes Since most of these methods do not provide the trained
the vanishing gradient problems. models and their DCs are not available, we only used EC to
evaluate the consumption. From Table 2, these models with
5. Experiments larger time steps also have larger computational consump-
tion. (Meng et al., 2022) also achieves good performance
5.1. Computational Consumption with ResNet-18, but its SNN acquisition process is relatively
complex. It needs ANN to pre-train the model first and then
To estimate the computational consumption of different retrain the SNN. Although its time steps are less than other
SNNs, we calculate the number of computations required ANN2SNN methods, it is still 12.5 times that of DSNN.
in terms of AC and MAC synaptic operations, respectively. Due to ANN2SNN being a different type of method from
Moreover, (Rueckauer et al., 2017) integrates batch nor- ours, the comparisons are listed just for reference.
malization (BN) layers into the weights of the preceding
layer with loss-less conversion. Therefore, we ignore BN Second, for full-spike neural networks, FSNN (DST) out-
operations when calculating the number of MAC operations, performs Spiking ResNet (Zheng et al., 2021; Deng et al.,
resulting in a more efficient inference consumption. See 2022) even with lower computational consumption. The
Appendix A for details of the fusion process of convolution performance of FSNN (DST) has a major advantage over
with BN. To quantitatively estimate energy consumption, other FSNNs and also improves with the increasing depth
we evaluate the computational consumption based on the of the network. It indicates that the proposed DST is indeed
number of AC and MAC operations and the data for various helpful to train FSNN.
operations in 45nm technology (Horowitz, 2014), where Finally, the performance of DSNN is further improved when
EMAC and EAC are 4.6pJ and 0.9pJ respectively. Here, AAP is added to FSNN at the inference phase. At the same
we calculate the Dynamic Consumption (DC) of the SNN time, the DSNN offers superior performance compared to
by Eq.(8) based on its spike firing rate on the target dataset. the mixed-precision SEW ResNet (Fang et al., 2021a). Note
Moreover, we use the Estimated Consumption (EC) to es- that the performance gap between SEW ResNet-34 and
timate the theoretical consumption range from [0, 100%] SEW ResNet-50 is not big, and the computational consump-
spike firing rate, that is tion of SEW ResNet-34 is higher than SEW ResNet-50.
This phenomenon comes from that ResNet-34 and ResNet-
E(F) ∈ [T · Emac · Omac , T · (Eac · Oac + Emac · Omac )].
50 use BasicBlock (He et al., 2016a) and Bottleneck (He
(15)
et al., 2016a) as blocks respectively. As shown in Figure 1
(b), the first-layer convolution of each block is regarded as
5.2. ImageNet Classification a MAC computation operation due to the non-spike data
We validate the effectiveness of our Dual-stream Training generated by ADD function. However, the first-layer con-
method on image classification of ImageNet (Deng et al., volution of BasicBlock is much larger than the synaptic
2009) dataset. The IF neuron model is adopted for the static operation of Bottleneck, which causes the computational
ImageNet dataset. For a fair comparison, all of our training consumption of SEW ResNet-34 to be greater than that of
Table 2. Comparison with previous Spiking ResNet on ImageNet. † denotes the estimated dynamic consumption based on the spike firing
rate provided in the corresponding paper. A2S represents ANN2SNN methods, FSNN and MPSNN mean Full Spike Neural Networks
and Mixed-Precision Spike Neural Networks respectively. FSNN (DST) represent the FSNN is trained by the proposed DST and DSNN
means APP is retained in the test phase.
Network Methods Acc@1 T EC(mJ) Oac (G) Omac (G) DC(mJ)
PreAct-ResNet-18 (Meng et al., 2022) A2S 67.74 50 [1.43, 77.84] - - -
Spiking ResNet-34 (Rathi et al., 2020) A2S 61.48 250 [7.31, 805.79] - - -
Spiking ResNet-34 (Li et al., 2021) A2S 74.61 256 [7.45, 825.29] - - -
Spiking ResNet-34 with td-BN (Zheng et al., 2021) FSNN 63.72 6 [0.69, 19.85] 5.34 0.15 5.50 †
Spiking ResNet-50 with td-BN (Zheng et al., 2021) FSNN 64.88 6 [1.10, 22.59] 6.01 0.24 6.52 †
Spiking ResNet-34 (Deng et al., 2022) FSNN 64.79 6 [0.69, 19.85] 5.34 0.15 5.50 †
FSNN-18 (DST) FSNN 62.16 4 [0.55, 4.31] 1.69 0.12 2.07
FSNN-34 (DST) FSNN 66.45 4 [0.55, 7.64] 3.42 0.12 3.63
FSNN-50 (DST) FSNN 67.69 4 [0.55, 10.42] 3.14 0.12 3.38
FSNN-101 (DST) FSNN 68.38 4 [0.55, 20.64] 4.42 0.12 4.53
SEW ResNet-18 (ADD) (Fang et al., 2021a) MPSNN 63.18 4 [12.65, 16.40] 0.51 2.75 13.11
SEW-ResNet-34 (ADD) (Deng et al., 2022) MPSNN 68.00 4 [29.72, 36.78] 0.86 6.46 30.50 †
DSNN-18 MPSNN 63.46 4 [0.92, 4.67] 1.69 0.20 2.44
DSNN-34 MPSNN 67.52 4 [0.92, 8.00] 3.42 0.20 4.00
DSNN-50 MPSNN 69.56 4 [6.30, 16.17] 3.20 1.37 9.18
DSNN-101 MPSNN 71.12 4 [6.30, 26.39] 4.48 1.37 10.33
ResNet-50. In contrast, DSNN ensures that the input and

Table 3. Comparison with Spiking ResNet methods on DVS Ges-
output of each block are spike data through the dual-stream
ture and CIFAR10-DVS. Once the ADD function is used as g(·),
mechanism, so that the theoretical computational consump- it will bring non-spiking operation to SEW (Fang et al., 2021a).
tion increases linearly with the increase of network depth. The comparison with it is listed just for reference.
Networks g(·) DVS Gesture CIFAR10-DVS T
5.3. DVS Classification ACC/DC(mJ)
SEW (Fang et al., 2021a) ADD 97.92/17.09 74.4/16.71 16
DVS Gesture. We also compare our method with SEW SEW (Fang et al., 2021a) IAND 95.49/1.48 - 16
ResNet-7B-Net (Fang et al., 2021a) on the DVS Gesture td-BN (Zheng et al., 2021) - 96.87/- 67.8/- 40/10
dataset (Amir et al., 2017), which contains 11 hand gestures FSNN-7 (DST) AND 55.56/2.59 70.9/3.57 16
from 29 subjects under 3 illumination conditions. We use the FSNN-7 (DST) OR 96.18/1.04 74.8/3.51 16
FSNN-7 (DST) XOR 96.53/1.19 74.0/3.41 16
similar network structure 7B-Net in (Fang et al., 2021a). As FSNN-7 (DST) IAND 97.57/1.10 73.1/4.65 16
shown in Table 3, FSNN-7 (DST) with IAND obtains better
performance than other FSNN methods, such as SEW with
IAND and td-BN, demonstrating the DST is effective to
5.4. Further Analysis
train FSNN. In addition, even though SEW utilizes IAND as
g(·) which brings non-spike operations, our FSNN-7 (DST) Computational Consumption Here, we analyze the
still achieves comparable performance, and only requires a computational consumption advantages of DSNN. First,
tenth of the computational consumption of SEW. the networks in most ANN2SNN methods are full-spike,
where few MAC operations are from batch normalization
and conversion of images to spikes. However, their com-
CIFAR10-DVS. We also evaluate Spiking ResNet models putational consumption is still very large due to the large
on the CIFAR10-DVS dataset (Li et al., 2017), which is time steps. DSNN and FSNN has fewer time steps than
obtained by recording the moving images of the CIFAR-10 ANN2SNN methods, which means that its theoretical max-
dataset on an LCD monitor by a DVS camera. We use Wide- imum consumption is much smaller than ANN2SNN as
7B-DSNN which is a similar network structure Wide-7B- shown in Table 2. Second, the DSNN has lower EC and
Net in (Fang et al., 2021a). As shown in Table 3, FSNN-7 DC than Spiking ResNet based on addition (SEW ResNet),
(DST) achieves better performance and lower computational and achieves better performance. The additive-based SEW
consumption than the previous Spiking ResNet (Zheng et al., ResNet will increase its AC and MAC as the network gets
2021) and SEW ResNet (Fang et al., 2021a). deeper, which will bring about a multifold increase in com-
67.91
Spiking ResNet50 65.72
57.66
67
61.86
64.3
62.32 (a) FSNN (DST) (b) DSNN
50 55 60 65 70
Figure 4. The the t-SNE (Maaten & Hinton, 2008) plots of embed-
w/ DST + SA w/ DST w/o DST
ding features on DVS Gesture.
Figure 3. Training Spiking ResNet from scratch on ImageNet Spike Transformer. We also test the role of DST on the
with/without Dual-Stream Training. Transformer on CIFAR100. The latest method Spikeformer
(Zhou et al., 2023) utilizes ADD as its residual connec-
putational consumption. In contrast, the main MAC oper-
tion and achieve SOTA performance. However, the ADD
ation for DSNN comes from the downsampling process of
introduces additional computational consumption from non-
accumulated signals in spike accumulation. Therefore, a
spike data. As shown in Table 4, once the IAND is replaced
deeper DSNN only increases the AC operation, and its MAC
by IAND, the performance of the full-spike Spikeformer
operation is constant as shown in Table 2. Finally, it also
will drop. After introducing APP as shown in Figure 1, the
reveals for the first time the positive relationship between
Spikeformer with a full-spike signal achieves similar perfor-
computational consumption and the performance of SNNs.
mance to the original mixed-precision Spikeformer. Notably,
In addition, the performance of DSNN increases gradually
there is no downsampling in the Spikeformer, which fur-
with the increase in computational consumption. It indicates
ther exploits the advantages of AAP while not introducing
the added computational consumption of deeper DSNN is
additional MAC operations. The small amount of MAC op-
meaningful.
erations comes mainly from the image-to-spike conversion
and the classifier operations.
Vanishing Gradient Problem. As mentioned in Sec-
tion 3.3, Spiking ResNet suffers from a vanishing gradi-
ent problem. As shown in Figure 3, the performance of Table 4. Learning Spike Transformer (Zhou et al., 2023) with DST.
the Spiking ResNet gradually decreases as the number of ADD will bring additional computational consumption from non-
network layers increases. Network depth did not bring spike data. Spikformer (IAND) is the full-spike version of Spik-
additional gain to the original Spiking ResNet. With the former, and Spikformer (IAND) w/ DST means the full-spike
Spikformer is trained by our DST.
addition of DST, the performance of the Spiking ResNet
gradually increases with increasing network depth. Also, Networks Method Acc DC(mJ)
the performance is superior with the addition of AAP. As Spikformer (ADD) MPSNN 77.21 5.27
demonstrated in Eq.(13), DST could help alleviate the van- Spikformer (IAND) FSNN 75.54 0.82
ishing gradient problem of the Spiking ResNet. Spikformer (IAND) w/ DST FSNN 76.96 0.84
Feature Enhancement. Here, we analyze the effects of

auxiliary accumulation in terms of improving feature dis- 6. Conclusion & Outlook
crimination. Moreover, AAP increases the feature discrim-
inant of forward as shown in Figure 4. Under fewer time In this paper, we point out that the main contradiction of
steps, the spike feature is almost a binary feature, because FSNNs comes from information loss of full spike propaga-
its discrete property limits the discriminant of its feature. tion. Therefore, we propose the Auxiliary Accumulation
The discrimination of features from spike propagation is not Pathway with consistent identity mapping which is able to
enough, which affects the overall performance. On the other compensate for the loss of information forward and back-
hand, we could increase the number of deep spike features ward from full spike propagation, so as to keep computation-
by increasing time steps, so as to improve the discrimination ally efficient and high-performance recognition simultane-
by increasing the number of features, which often leads to ously. The experiments on the ImageNet, DVS Gesture, and
too high computational consumption as shown in Table 2. CIFAR10-DVS indicate that DST could improve the ResNet-
More importantly, AAP separates the computation of AC based and Transformer-baed SNNs trained from scratch in
operations from the MAC computation of recognition, al- both accuracy and computation consumption. Looking for-
lowing SNN to take full advantage of its low computational ward, this SNN with a dual-stream structure may provide a
consumption. reference for the design of neuromorphic hardware, which
could improve computational efficiency by separating spike Deng, S. and Gu, S. Optimal conversion of conventional
and non-spike computation. In addition, the proposed dual- artificial neural networks to spiking neural networks. In
streams mechanism is similar to the dual-streams object International Conference on Learning Representations
recognition pathway in the human brain, which may provide (ICLR), 2021. URL https://openreview.net/
a new potential direction for SNN structure design. forum?id=FZ1oTwcXchK.
Deng, S., Li, Y., Zhang, S., and Gu, S. Temporal effi-
References
cient training of spiking neural network via gradient re-
Amir, A., Taba, B., Berg, D., Melano, T., McKinstry, J., weighting. In International Conference on Learning Rep-
Di Nolfo, C., Nayak, T., Andreopoulos, A., Garreau, G., resentations, 2022.
Mendoza, M., Kusnitz, J., Debole, M., Esser, S., Del-
bruck, T., Flickner, M., and Modha, D. A low power, Ding, X., Guo, Y., Ding, G., and Han, J. Acnet: Strengthen-
fully event-based gesture recognition system. In Proceed- ing the kernel skeletons for powerful cnn via asymmetric
ings of the IEEE Conference on Computer Vision and convolution blocks. In Proceedings of the IEEE Interna-
Pattern Recognition (CVPR), pp. 7243–7252, 2017. tional Conference on Computer Vision, pp. 1911–1920,
2019.
Cao, Y., Chen, Y., and Khosla, D. Spiking deep convolu-
tional neural networks for energy-efficient object recog- Ding, X., Zhang, X., Ma, N., Han, J., Ding, G., and Sun,
nition. International Journal of Computer Vision, 113(1): J. Repvgg: Making vgg-style convnets great again. In
54–66, 2015. Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 13733–13742, 2021.
Chen, G. and Chen, Z. Saliency detection by superpixel-
based sparse representation. In Advances in Multimedia Fang, W., Yu, Z., Chen, Y., Huang, T., Masquelier, T., and
Information Processing–PCM 2017: 18th Pacific-Rim Tian, Y. Deep residual learning in spiking neural net-
Conference on Multimedia, Harbin, China, September works. Advances in Neural Information Processing Sys-
28-29, 2017, Revised Selected Papers, Part II 18, pp. tems, 34:21056–21069, 2021a.
447–456. Springer, 2018.
Fang, W., Yu, Z., Chen, Y., Masquelier, T., Huang, T., and
Chen, G., Qiao, L., Shi, Y., Peng, P., Li, J., Huang, T., Tian, Y. Incorporating learnable membrane time constant
Pu, S., and Tian, Y. Learning open set network with to enhance learning of spiking neural networks. In Pro-
discriminative reciprocal points. In Computer Vision– ceedings of the IEEE/CVF International Conference on
ECCV 2020: 16th European Conference, Glasgow, UK, Computer Vision (ICCV), pp. 2661–2671, 2021b.
August 23–28, 2020, Proceedings, Part III 16, pp. 507–
522. Springer, 2020. Fang, W., Chen, Y., Ding, J., Chen, D., Yu, Z.,
Zhou, H., Tian, Y., and other contributors. Spiking-
Chen, G., Peng, P., Ma, L., Li, J., Du, L., and Tian, Y.
jelly. https://github.com/fangwei123456/
Amplitude-phase recombination: Rethinking robustness
spikingjelly, 2022.
of convolutional neural networks in frequency domain. In
Proceedings of the IEEE/CVF International Conference
Girshick, R., Donahue, J., Darrell, T., and Malik, J. Rich fea-
on Computer Vision, pp. 458–467, 2021a.
ture hierarchies for accurate object detection and semantic
Chen, G., Peng, P., Wang, X., and Tian, Y. Adversarial segmentation. In Proceedings of the IEEE Conference
reciprocal points learning for open set recognition. IEEE on Computer Vision and Pattern Recognition (CVPR), pp.
transactions on pattern analysis and machine intelligence, 580–587, 2014.
2021b. doi: 10.1109/TPAMI.2021.3106743.
Han, B. and Roy, K. Deep spiking neural network: En-
Comsa, I. M., Potempa, K., Versari, L., Fischbacher, T., ergy efficiency through time based coding. In European
Gesmundo, A., and Alakuijala, J. Temporal coding in Conference on Computer Vision (ECCV), pp. 388–404,
spiking neural networks with alpha synaptic function. 2020.
In International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pp. 8529–8533. IEEE, 2020. Han, B., Srinivasan, G., and Roy, K. Rmp-snn: Residual
membrane potential neuron for enabling deeper high-
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, accuracy and low-latency spiking neural network. In Pro-
L. Imagenet: A large-scale hierarchical image database. ceedings of the IEEE/CVF Conference on Computer Vi-
In 2009 IEEE conference on computer vision and pattern sion and Pattern Recognition (CVPR), pp. 13558–13567,
recognition, pp. 248–255. Ieee, 2009. 2020.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual e2e5096d574976e8f115a8f1e0ffb52b-Paper.
learning for image recognition. In Proceedings of the pdf.
IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 770–778, 2016a. Krizhevsky, A., Hinton, G., et al. Learning multiple layers
of features from tiny images. 2009.
He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings
in deep residual networks. In European Conference on Krizhevsky, A., Sutskever, I., and Hinton, G. E.
Computer Vision (ECCV), pp. 630–645. Springer, 2016b. Imagenet classification with deep convolutional
neural networks. In Advances in Neural Infor-
Horowitz, M. 1.1 computing’s energy problem (and what we mation Processing Systems (NeurIPS), pp. 1097–
can do about it). In 2014 IEEE International Solid-State 1105, 2012. URL https://proceedings.
Circuits Conference Digest of Technical Papers (ISSCC), neurips.cc/paper/2012/file/
pp. 10–14. IEEE, 2014. c399862d3b9d6b76c8436e924a68c45b-Paper.
Hu, Y., Tang, H., and Pan, G. Spiking deep residual net- pdf.
works. IEEE Transactions on Neural Networks and
Lee, C., Sarwar, S. S., Panda, P., Srinivasan, G., and Roy, K.
Learning Systems, 2018a.
Enabling spike-based backpropagation for training deep
Hu, Y., Tang, H., Wang, Y., and Pan, G. Spiking deep resid- neural network architectures. Frontiers in Neuroscience,
ual network. arXiv preprint arXiv:1805.01352, 2018b. 14, 2020.
Huh, D. and Sejnowski, T. J. Gradient descent for Lee, J. H., Delbruck, T., and Pfeiffer, M. Training deep
spiking neural networks. In Advances in Neural spiking neural networks using backpropagation. Frontiers
Information Processing Systems (NeurIPS), pp. 1440– in Neuroscience, 10:508, 2016.
1450, 2018. URL https://proceedings.
neurips.cc/paper/2018/file/ Li, H., Liu, H., Ji, X., Li, G., and Shi, L.
185e65bc40581880c4f2c82958de8cfe-Paper. Cifar10-dvs: An event-stream dataset for object
pdf. classification. Frontiers in Neuroscience, 11:309,
2017. ISSN 1662-453X. doi: 10.3389/fnins.2017.
Hunsberger, E. and Eliasmith, C. Spiking deep networks 00309. URL https://www.frontiersin.org/
with lif neurons. arXiv preprint arXiv:1510.08829, 2015. article/10.3389/fnins.2017.00309.
Hwang, S., Chang, J., Oh, M.-H., Min, K. K., Jang, T., Li, Y., Deng, S., Dong, X., Gong, R., and Gu, S. A free
Park, K., Yu, J., Lee, J.-H., and Park, B.-G. Low-latency lunch from ann: Towards efficient, accurate spiking neu-
spiking neural networks using pre-charged membrane po- ral networks calibration. In International Conference on
tential and delayed evaluation. Frontiers in Neuroscience, Machine Learning (ICML), volume 139, pp. 6316–6325,
15:135, 2021. 2021. URL https://proceedings.mlr.press/
Ioffe, S. and Szegedy, C. Batch normalization: Accelerating v139/li21d.html.
deep network training by reducing internal covariate shift.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu,
In International conference on machine learning, pp. 448–
C.-Y., and Berg, A. C. Ssd: Single shot multibox detector.
456. PMLR, 2015.
In European Conference on Computer Vision (ECCV), pp.
Kheradpisheh, S. R. and Masquelier, T. Temporal backprop- 21–37. Springer, 2016.
agation for spiking neural networks with one spike per
neuron. International Journal of Neural Systems, 30(06): Loshchilov, I. and Hutter, F. SGDR: stochastic gradient de-
2050027, 2020. scent with warm restarts. In International Conference on
Learning Representations (ICLR), 2017. URL https:
Kim, J., Kim, H., Huh, S., Lee, J., and Choi, K. Deep neural //openreview.net/forum?id=Skq89Scxx.
networks with weighted spikes. Neurocomputing, 311:
373–386, 2018. Ma, L., Peng, P., Chen, G., Zhao, Y., Dong, S., and Tian,
Y. Picking up quantization steps for compressed image
Kim, J., Kim, K., and Kim, J.-J. Unifying activation- classification. IEEE Transactions on Circuits and Systems
and timing-based learning rules for spiking neu- for Video Technology, 2022.
ral networks. In Advances in Neural Informa-
tion Processing Systems (NeurIPS), pp. 19534– Maaten, L. v. d. and Hinton, G. Visualizing data using
19544, 2020. URL https://proceedings. t-sne. Journal of machine learning research, 9(Nov):
neurips.cc/paper/2020/file/ 2579–2605, 2008.
Meng, Q., Xiao, M., Yan, S., Wang, Y., Lin, Z., and Luo, Z.- Sengupta, A., Ye, Y., Wang, R., Liu, C., and Roy, K. Going
Q. Training high-performance low-latency spiking neural deeper in spiking neural networks: Vgg and residual
networks by differentiation on spike representation. In architectures. Frontiers in neuroscience, 13:95, 2019.
Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 12444–12453, 2022. Shrestha, S. B. and Orchard, G. Slayer: Spike layer error
reassignment in time. Advances in neural information
Mostafa, H. Supervised learning based on temporal coding processing systems, 31, 2018.
in spiking neural networks. IEEE Transactions on Neural
Simonyan, K. and Zisserman, A. Very deep convolu-
Networks and Learning Systems, 29(7):3227–3235, 2017.
tional networks for large-scale image recognition. In
Neftci, E. O., Mostafa, H., and Zenke, F. Surrogate gradient International Conference on Learning Representations
learning in spiking neural networks: Bringing the power (ICLR), 2015. URL http://arxiv.org/abs/
of gradient-based optimization to spiking neural networks. 1409.1556.
IEEE Signal Processing Magazine, 36(6):51–63, 2019. Stöckl, C. and Maass, W. Optimized spiking neurons can
classify images with high accuracy through temporal cod-
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
ing with two spikes. Nature Machine Intelligence, 3(3):
Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
230–238, 2021.
L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Rai-
son, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,
L., Bai, J., and Chintala, S. Pytorch: An imperative style, Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-
high-performance deep learning library. In Advances in novich, A. Going deeper with convolutions. In Proceed-
Neural Information Processing Systems (NeurIPS), pp. ings of the IEEE/CVF Conference on Computer Vision
8026–8037, 2019. URL https://proceedings. and Pattern Recognition (CVPR), pp. 1–9, 2015. doi:
neurips.cc/paper/2019/file/ 10.1109/CVPR.2015.7298594.
bdbca288fee7f92f2bfa9f7012727740-Paper.
pdf. Tavanaei, A., Ghodrati, M., Kheradpisheh, S. R., Masque-
lier, T., and Maida, A. Deep learning in spiking neural
Rathi, N. and Roy, K. Diet-snn: Direct input encoding with networks. Neural Networks, 111:47–63, 2019.
leakage and threshold optimization in deep spiking neural
Wu, Y., Deng, L., Li, G., Zhu, J., and Shi, L. Spatio-
networks. arXiv preprint arXiv:2008.03658, 2020.
temporal backpropagation for training high-performance
Rathi, N., Srinivasan, G., Panda, P., and Roy, K. Enabling spiking neural networks. Frontiers in Neuroscience, 12:
deep spiking neural networks with hybrid conversion and 331, 2018.
spike timing dependent backpropagation. In Interna- Xiao, M., Meng, Q., Zhang, Z., He, D., and Lin, Z. On-
tional Conference on Learning Representations (ICLR), line training through time for spiking neural networks.
2020. URL https://openreview.net/forum? In Advances in Neural Information Processing Systems,
id=B1xSperKvH. 2022.
Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. You Xing, F., Yuan, Y., Huo, H., and Fang, T. Homeostasis-
only look once: Unified, real-time object detection. In based cnn-to-snn conversion of inception and residual
Proceedings of the IEEE Conference on Computer Vision architectures. In International Conference on Neural
and Pattern Recognition (CVPR), pp. 779–788, 2016. Information Processing, pp. 173–184. Springer, 2019.
Roy, K., Jaiswal, A., and Panda, P. Towards spike-based ma- Zhang, W. and Li, P. Temporal spike sequence
chine intelligence with neuromorphic computing. Nature, learning via backpropagation for deep spiking
575(7784):607–617, 2019. neural networks. In Advances in Neural Infor-
mation Processing Systems (NeurIPS), pp. 12022–
Rueckauer, B., Lungu, I.-A., Hu, Y., Pfeiffer, M., and Liu, 12033, 2020. URL https://proceedings.
S.-C. Conversion of continuous-valued deep networks to neurips.cc/paper/2020/file/
efficient event-driven networks for image classification. 8bdb5058376143fa358981954e7626b8-Paper.
Frontiers in Neuroscience, 11:682, 2017. pdf.
Samadzadeh, A., Far, F. S. T., Javadi, A., Nickabadi, A., Zheng, H., Wu, Y., Deng, L., Hu, Y., and Li, G. Go-
and Chehreghani, M. H. Convolutional spiking neural ing deeper with directly-trained larger spiking neu-
networks for spatio-temporal feature extraction. arXiv ral networks. In Proceedings of the AAAI Confer-
preprint arXiv:2003.12346, 2020. ence on Artificial Intelligence, volume 35, pp. 11062–
11070, 2021. URL https://ojs.aaai.org/

index.php/AAAI/article/view/17320.
Zhou, S., Li, X., Chen, Y., Chandrasekaran, S. T.,
and Sanyal, A. Temporal-coded deep spiking neu-
ral network with easy training and robust perfor-
mance. In Proceedings of the AAAI Conference on
Artificial Intelligence, volume 35, pp. 11143–11151,
2021. URL https://ojs.aaai.org/index.
php/AAAI/article/view/17329.
Zhou, Z., Zhu, Y., He, C., Wang, Y., Yan, S., Tian, Y., and
Yuan, L. Spikformer: When spiking neural network meets
transformer. 2023.
A. The Fusion of Conv + BN

The fusion of convolution and batch normalization. It should be noted that the spike signal becomes the floating point
once it has passed through the convolution layer, which leads to subsequent MAC operations from the Batch Normalization
(BN). However, the homogeneity of convolution allows the following BN and linear scaling transformation to be equivalently
fused into the convolutional layer with an added bias (Ding et al., 2019; 2021). Specifically, each BN and its preceding
convolution layer into a convolution with a bias vector. Let {W0 , B0 } be the kernel and bias converted from {W, µ, σ, γ, β},
we have
γ µγ
W0 = i W , B0i = − i i + β i . (16)
σi σi
Then it is easy to verify that,
bn(M ∗ W, µ, σ, γ, β) = (X ∗ W0 ) + B0 . (17)
Therefore, when calculating the theoretical computational consumption, ignore the consumption of the BN could be ignored.
B. Implementation Details
All experiments are implemented with SpikingJelly (Fang et al., 2022), which is an open-source deep learning framework
for SNNs based on PyTorch (Paszke et al., 2019). The source code is included in the supplementary. The hyper-parameters
of the DSNN for different datasets are shown in Table 5. The pre-processing data methods for three datasets are as follows:
Table 5. Hyper-parameters of the DSNN for ImageNet, DVS Gesture, and CIFAR10-DVS datasets. CA denotes the Cosine Annealing
(Loshchilov & Hutter, 2017).
Dataset Learning Rate Scheduler Epochs lr Batch Size T
ImageNet CA, Tmax = 320 320 0.1 128 4
DVS Gesture Step, Tstep = 64.γ = 0.1 192 0.005 16 16
CIFAR10-DVS CA, Tmax = 64 64 0.01 16 16
CIFAR100 CA 310 5e − 4 128 4
ImageNet. The data augmentation methods used in (He et al., 2016a) are also applied in our experiments. A 224×224 crop
is randomly sampled from an image or its horizontal flip with data normalization for train samples. A 224×224 resize and
central crop with data normalization are applied for test samples.
DVS Gesture. We utilize random temporal delete (Fang et al., 2021a) to relieve overfitting. Denote the sequence length as
T , we randomly delete T − Ttrain slices in the original sequence and use Ttrain slices during training. During inference we
use the whole sequence, that is, Ttest = T . We set Ttrain = 12, T = 16 in all experiments on DVS Gesture.
CIFAR10-DVS. We use the AER data pre-processing in (Fang et al., 2021b) for DVS128 Gesture. We do not use random
temporal delete because CIFAR10-DVS is obtained by static images.
CIFAR100. All hyperparameters are consistent with (Zhou et al., 2023).
35
35
30 30
25 25
20 20
15 15
10 10
5 5
0 0
1 2 3 4 5 6 7 1 2 3 4 5 6 7
OR XOR IAND AND OR XOR IAND AND
(a) Firing rates of 𝑠𝑙 (b) Firing rates of 𝑜𝑙
Figure 5. Firing rates of Dual-stream Blocks on DVS Gesture.

C. More Analysis
C.1. Analysis of Spiking Response of Dual-Stream Block
Furthermore, we analyze the firing rates of DSNN, which are closely related to computational power consumption. Figure 5(a)
shows the firing rates of ol for the Dual-Stream Block on DVS Gesture, where all spiking neurons in dual-stream blocks
have low firing rates, and the spiking neurons in the first two blocks even have firing rates of almost zero. All firing rates in
the dual-stream blocks are not larger than 0.5, indicating that all neurons fire on average not more than two spikes. The
firing rates of ol in the first few blocks are at a low level, verifying that most dual-stream blocks act as identity mapping. As
the depth of the network accelerates, the fire rate increases, increasing the ability to express features for better recognition.
C.2. Evaluation of Different Element-wise Functions

We evaluate all kinds of element-wise functions g on CIFAR10-DVS and DVS Gesture as shown in Table 3. As mentioned
above, both pathways of DSNN should achieve identity mapping under the same conditions, and IAND, OR, and XOR
(except AND) satisfy this. As shown in Table 3, the performance of AND is lower than other functions as expected. It also
indicates the necessity of identity mapping. Moreover, IAND, OR, and XOR all achieved relatively good performance
on CIFAR10-DVS and DVS Gesture. On the other hand, Figure 5 shows the fire rates for different element-wise g(·)
functions. For different element-wise g(·) functions, the output sl of each block is not obviously different. However, for
the fire rate after the element-wise g(·) function, the output ol gap of each block is obvious. As shown in Figure 5(b),
OR > XOR > IAND > AND for the firing rate. For datasets with different complexity, different firing rates can be
needed for better recognition. For example, IAND achieves the best performance for a simple DVS dataset. However, for
the slightly complex CIFAR10-DVS, OR has a higher rate to improve the feature expression ability, thus achieving the best
performance.
C.3. Gradients Check on Deeper DSNN

Eq. (11) and Eq. (12) analyze the gradients of multiple blocks with identity mapping. To verify that DSNN can overcome
the vanishing/exploding gradient, we check the gradients of the deeper DSNN with 50 layers. In this paper, the surrogate
gradient method (Neftci et al., 2019) is used to define Θ0 (x) , σ 0 (x) during error back-propagation, with σ(x) denote
the surrogate function. The surrogate gradient function we used in all experiments is σ(x) = π1 arctan( π2 αx) + 12 , thus
σ 0 (x) = 2(1+( απ αx)2 ) .
2
(a) Firing rate of ol (b) Firing rate of sl
Figure 6. The initial firing rates of output ol and sl in l-th block on 50 layer network.
Initial Firing Rates As the gradients of SNNs are significantly influenced by initial firing rates (Fang et al., 2021a), he
we analyze the firing rate. Figure 6 shows the initial firing rate of l-th block’s output ol , where the mutations occur due to
downsampling blocks. As shown in Figure 6(a), the silence problem happens in the DSNN with AND (yellow curve). When
using AND, otl = SN(f l (otl−1 )) ∧ otl−1 ≤ otl−1 . Since it is hard to keep SN(f l (otl−1 )) = 1 at each time-step t, the silence
problem may frequently happen in DSNN with AND. In contrast, compared with AND, using OR, XOR and IAND could
easily maintain a certain firing rate. Figure 6(b) shows the firing rate of sl = SN(f l (otl−1 )), which represents the output of
the last SN in l-th block. It shows that although the firing rate of ol in DSNN with OR, XOR and IAND could increase
constantly with the depth of networks in theory, the last SN in each block still maintains a stable firing rate in practice.
Vanishing Gradient To analysis the vanishing gradient, we set Vth = 1 and α = 2 in the surrogate function σ(x). In this
case, σ 0 (x) ≤ σ 0 (0) = σ 0 (1 − Vth ) = 1 and σ 0 (0 − Vth ) = 0.092 < 1, and transmitting spikes to SNs is prone to causing
∂L
vanishing gradient. The gradient amplitude ∂S l of each block is shown in Figure 7(a-d), and DSNN do not be affected no
matter what g we choose.This is caused that in the identity mapping areas, sl is constant for all index l, and the gradient
also becomes a constant as it will not flow through SNs. Compared with the case where only spike propagation exists
(Figure 7(e-h)), the gradient in DSNN is more stable. In addition, DSNN even could significantly improve AND which
would limit the neuron firing and affect the gradient. It indicates the effectiveness of the DSNN for the vanishing gradient.
Exploding Gradient Similar, here we set Vth = 1, α = 3 in the surrogate function σ(x) to analysis the exploding
gradient, where σ 0 (1 − Vth ) = 1.5 > 1 and transmitting spikes to SNs is prone to causing exploding gradient. As shown
in Figure 8 (a-d), exploding gradient problem in DSNN with OR, XOR, IAND, and AND is not serious. Compared to
Spiking ResNet without spike accumulation (Figure 8 (d-h)), DSNN has a more gradual gradient.
Silence Problem of AND Moreover, some vanishing gradient happens in the Spiking ResNet with AND as shown in
Figure 7(h) and 8(h) (The gradient of AND is 0), which is caused by the silence problem. The attenuation of the gradient
caused by the silent problem can be alleviated by the backpropagation of the spike accumulation path as shown in igure 7
and 8. The gradients of DSNN with OR, XOR, IAND, and AND increase slowly when propagating from deeper layers to
shallower layers.
In general, DSNN could overcome the vanishing or exploding gradient problem well through spike accumulation. At the
same time, spike accumulation accumulates the output spikes of each block, thus increasing the discriminability of the
features in network inference.
(a) OR (b) XOR (c) IAND (d) AND
(e) OR w/o SA (f) XOR w/o SA (g) IAND w/o SA (h) AND w/o SA
∂L
Figure 7. Gradient amplitude ∂sl
of l-th block when Vth = 1, α = 2 in the surrogate function σ(x).
(a) OR (b) XOR (c) IAND (d) AND
(e) OR w/o SA (f) XOR w/o SA (g) IAND w/o SA (h) AND w/o SA
∂L
Figure 8. Gradient amplitude ∂sl
of l-th block when Vth = 1, α = 3 in the surrogate function σ(x).

Training Full Spike Neural Networks Via Auxiliary Accumulation Pathway

Uploaded by

Copyright:

Available Formats

Training Full Spike Neural Networks Via Auxiliary Accumulation Pathway

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Training Full Spike Neural Networks Via Auxiliary Accumulation Pathway

Uploaded by

Copyright:

Available Formats

Training Full Spike Neural Networks via Auxiliary Accumulation Pathway

Guangyao Chen 1 2 Peixi Peng 1 2 Guoqi LI 3 Yonghong Tian 1 2

Auxiliary Accumulation Pathway

Auxiliary Accumulation Pathway

𝑦𝑙 𝑜𝑙𝑡 𝑎𝑙𝑡 𝑜𝑙𝑡 𝑡

ResNet-50. In contrast, DSNN ensures that the input and

Feature Enhancement. Here, we analyze the effects of

11070, 2021. URL https://ojs.aaai.org/

A. The Fusion of Conv + BN

OR XOR IAND AND OR XOR IAND AND

(a) Firing rates of 𝑠𝑙 (b) Firing rates of 𝑜𝑙

Figure 5. Firing rates of Dual-stream Blocks on DVS Gesture.

C.2. Evaluation of Different Element-wise Functions

C.3. Gradients Check on Deeper DSNN

(a) Firing rate of ol (b) Firing rate of sl

(a) OR (b) XOR (c) IAND (d) AND

(a) OR (b) XOR (c) IAND (d) AND

You might also like