4d Attention Based NN For Eeg Emotion Recognition

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Journal Title

Journal XX (XXXX) XXXXXX https://doi.org/XXXX/XXXX

4D Attention-based Neural Network for EEG


Emotion Recognition
Guowen Xiao1 • Mengwen Ye2 • Bowen Xu1 • Zhendi Chen1 • Quansheng Ren1*

Abstract
Electroencephalograph (EEG) emotion recognition is a significant task in the brain-computer
interface field. Although many deep learning methods are proposed recently, it is still
challenging to make full use of the information contained in different domains of EEG signals.
In this paper, we present a novel method, called four-dimensional attention-based neural
network (4D-aNN) for EEG emotion recognition. First, raw EEG signals are transformed into
4D spatial-spectral-temporal representations. Then, the proposed 4D-aNN adopts spectral and
spatial attention mechanisms to adaptively assign the weights of different brain regions and
frequency bands, and a convolutional neural network (CNN) is utilized to deal with the spectral
and spatial information of the 4D representations. Moreover, a temporal attention mechanism
is integrated into a bidirectional Long Short-Term Memory (LSTM) to explore temporal
dependencies of the 4D representations. Our model achieves state-of-the-art performance on
the SEED dataset under intra-subject splitting. The experimental results have shown the
effectiveness of the attention mechanisms in different domains for EEG emotion recognition.

Keywords: EEG, emotion recognition, attention mechanism, convolutional recurrent neural network

1
Department of Electronics, 1
Peking University, Beijing, China
2
School of Electrical Engineering,
Beijing Jiaotong University, Beijing, China
*Corresponding author: Quansheng Ren (Email: [email protected])
Introduction recognition. Zheng et al. introduce a deep belief network
(DBN) to investigate the critical frequency bands and EEG
Emotion plays an important role in daily life and is closely signal channels for EEG emotion recognition (Zheng and Lu
related to human behavior and cognition (Dolan 2002). As one 2015). Yang et al. propose a hierarchical network to classify
of the most significant research topics of affective computing, the DE features extracted from different frequency bands
emotion recognition has received increasing attention in (Yang et al. 2018b). Song et al. use a graph convolutional
recent years for its applications of disease detection (Bamdad neural network to classify the DE features (Song et al. 2020).
et al. 2015; Figueiredo et al. 2019), human-computer Ma et al. propose a multimodal residual Long Short-Term
interaction (Fiorinia et al. 2020; Katsigiannis and Ramzan Memory model (MMResLSTM) for emotion recognition,
2017), and workload estimation (Blankertz et al. 2016). In which shares temporal weights across the multiple modalities
general, emotion recognition methods can be divided into two (Jiaxin Ma et al. 2019). To learn the bi-hemispheric
categories (Mühl et al. 2014). One is based on external discrepancy for EEG emotion recognition, Yang et al. propose
emotion responses including facial expressions and a novel bi-hemispheric discrepancy model (BiHDM) (Li et al.
gestures(Yan et al. 2016), and the other is based on internal 2020). All those deep learning methods outperform the
emotion responses including electroencephalograph (EEG) shallow models.
and electrocardiography (ECG) (Zheng et al. 2017). Although deep learning emotion recognition models have
Neuroscientific researches have shown that some major brain achieved higher accuracy than shallow models, it is still
cortex regions are closely related to emotions, making it challenging to fuse more important information on different
possible to decode emotions based on EEG (Brittona et al. domains and capture discriminative local patterns in EEG
2006; Lotfia and Akbarzadeh-T 2014). EEG is non-invasive, signals. In the past decades, many researchers have
portable, and inexpensive so that it has been widely used in investigated the critical frequency bands and channels for
the field of brain-computer interfaces (BCIs) (Pfurtscheller et EEG emotion recognition. Zheng et al. demonstrate that
al. 2010). Besides, EEG signals contain various spatial, β[14~31 Hz] and γ[31~51 Hz] bands are more related to
spectral, and temporal information about emotions evoked by emotion recognition than other bands, and their model
specific stimulation patterns. Therefore, more and more achieves the best performance when combining all frequency
researchers concentrate on EEG emotion recognition recently bands. They also conduct experiments to select critical
(Alhagry et al. 2017; Li and Lu 2009). channels and propose the minimum pools of electrode sets for
Traditional EEG emotion recognition methods usually emotion recognition (Zheng and Lu 2015). To utilize the
extract hand-crafted features from EEG signals first and then spatial information of EEG signals, Li et al. propose a 2D
adopt shallow models to classify the emotion features. EEG sparse map to maintain the information hidden in the electrode
emotion features can be extracted from the time domain, placement (Li et al. 2018). Zhong et al. introduce a regularized
frequency domain, and time-frequency domain. Jenke et al. graph neural network (RGNN) to capture both local and global
conduct a comprehensive survey on EEG feature extraction relations among different EEG channels for emotion
methods by using machine learning techniques on a self- recognition (Zhong et al. 2020). The temporal dependencies
recorded dataset (Jenke et al. 2014). For classifying the in EEG signals are also important to emotion recognition. For
extracted emotion features, many researchers have adopted example, Ma et al. (Jiaxin Ma et al. 2019) apply LSTMs in
machine learning methods over the past few years (Kim et al. their models to extract temporal features for emotion
2013). Li et al. apply a linear support vector machine (SVM) recognition. Shen et al. transform the DE features of different
to classify emotion features extracted from the gamma channels into 4D structures to integrate the spectral, spatial,
frequency band (Li and Lu 2009). Duan et al. extract and temporal information simultaneously and then use a four-
differential entropy (DE) features, which are superior to dimensional convolutional recurrent neural network (4D-
representing emotion states in EEG signals (Shi et al. 2013), CRNN) to recognize different emotions (Shen et al. 2020).
from multichannel EEG data and combine a k-Nearest However, the differences among brain regions and frequency
Neighbor (KNN) with SVM to classify the DE features (Duan bands are not fully utilized in their work. To adaptively
et al. 2013). However, shallow models require lots of expert capture discriminative patterns in EEG signals, attention
knowledge to design and select emotion features, limiting mechanisms have been applied to EEG emotion recognition.
their performance on EEG emotion classification. For instance, Tao et al. introduce a channel-wise attention
Deep learning methods have been demonstrated to mechanism, assigning the weights of different channels
outperform traditional machine learning methods in many adaptively, along with an extended self-attention to explore
fields such as computer vision, natural language processing, the temporal dependencies of EEG signals (Tao et al. 2020).
and biomedical signal processing (Abbass et al. 2018; Craik et Jia et al. propose a two-stream network with attention
al. 2019) for the ability to learn high-level features from data mechanisms to adaptively focus on important patterns (Jia et
automatically (Krizhevsky et al. 2012). Recently, some al. 2020). From the above, it can be observed that it is critical
researchers have applied deep learning to EEG emotion

2
Journal XX (XXXX) XXXXXX Xiao et al

to integrate information on different domains and adaptively


4 s atial s ectral te oral
capture important brain regions, frequency bands, and
re resentation
timestamps in a unified network for EEG emotion recognition.
In this paper, we propose a four-dimensional attention-
based neural network named 4D-aNN for EEG emotion
ttention ased
recognition. First, we transform raw EEG signals into 4D
spatial-spectral-temporal representations which consist of
several temporal slices. Different brain regions and frequency
bands vary in the contributions to EEG emotion recognition,
ttention ased
and the temporal dependencies of 4D representations should
idirectional
also be considered. Therefore, we employ attention
mechanisms on both a CNN and a bidirectional LSTM
network to adaptively capture discriminative patterns. For the
CNN model, the attention mechanism is applied to the spatial lassi ier
and spectral dimensions of each temporal slice so that the
important brain regions and frequency bands could be
captured. As for the bidirectional LSTM model, the attention Fig. 1 The overall structure of 4D-aNN.
mechanism is applied to utilize long-range temporal
4D spatial-spectral-temporal representation
dependencies so that the importance of different temporal
slices in one 4D representation could be fully explored. The process of generating 4D representation is depicted in
The primary contribution of this paper are summarized as Fig. 2. As previous works do (Shen et al. 2020; Yang et al.
follows: a) We propose a four-dimensional attention-based 2018a), we split original EEG signals into 𝑇 seconds long
neural network, which fuses information on different domains segments without overlapping. Each segment is assigned with
and captures discriminative patterns in EEG signals based on the same label as the original EEG signals. Then we
the 4D spatial-spectral-temporal representation. b) We decompose each segment into five frequency bands (i.e. δ[1~4
conduct experiments on the SEED dataset, and the Hz], θ[4~8 Hz], α[8~14 HZ], β[14~31 Hz], and γ[31~51 Hz])
experimental results indicate that our model achieves state-of- with five-order Butterworth filters. The Differential Entropy
the-art performance under intra-subject splitting. (DE) features and Power Spectral Density (PSD) features of
The remainder of this paper is organized as follows. We all EEG channels, which have been proven to be effective for
describe our proposed method in the Method section. Dataset, emotion recognition (Zheng et al. 2017), are extracted from
experiment settings, results, ablation studies, and discussion five frequency bands respectively with a 0.5s window for each
are presented in the Experiment section. Finally, conclusions segment.
are given in the Conclusion section. PSD is defined as
ℎ𝑃 (𝑋) = 𝐸[𝑥 2 ] (1)
Method where 𝑥 is formally a random variable and in this context, the
signal acquired from a certain frequency band on a certain
Figure 1 illustrates the overall structure of 4D-aNN for EEG
EEG channel.
emotion recognition. It consists of the 4D spatial-spectral-
DE feature is capable of discriminating EEG patterns
temporal representation, the attention-based CNN, the
between low and high frequency energy, which is defined as
attention-based bidirectional LSTM, and the classifier. We
will describe the details of each part in sequence. ℎ𝐷 (𝑋) = − ∫ 𝑓(𝑥) log(𝑓(𝑥)) 𝑑𝑥 (2)
𝑋
where 𝑓(𝑥) is the probability density function of 𝑥 . If 𝑥
obeys the Gaussian distribution 𝑁(𝜇, 𝜎 2 ), DE can simply be
calculated by the following formulation:
 1 (𝑥 − 𝜇)2 1 (𝑥 − 𝜇)2
ℎ𝐷 (𝑋) = − ∫ 𝑒𝑥𝑝 log 𝑒𝑥𝑝 𝑑𝑥
− √2𝜎
2 2𝜎 2 √2𝜎 2 2𝜎 2
1
= log 2𝑒𝜎 2 (3)
2
where 𝑒 and 𝜎 are Euler’s constant and standard deviation
of 𝑥, respectively.
Thus, We extract a 3D feature tensor 𝐹𝑛  𝑅𝑐2𝑓2𝑇 , 𝑛 =
1, 2, . . . , 𝑁 from each segment, where 𝑁 is the number of total
segments, 𝑐 is the number of EEG channels, 2𝑓 represents
DE and PSD features of 𝑓 frequency bands, and 2𝑇 is

3
derived by the 0.5s window without overlapping. To utilize map, respectively. The 2D sparse map of all the c channels
the spatial information of electrodes, we organize all the 𝑐 with zero-padding is shown in Fig. 3, which preserves the
channels as a 2D sparse map so that the 3D feature tensor 𝐹𝑛 topology of different electrodes. In this paper, we set ℎ = 19,
is transformed into a 4D representation 𝑋𝑛  𝑅ℎ𝑤2𝑓2𝑇 , 𝑤 = 19, and 𝑓 = 5.
where ℎ and 𝑤 are the height and width of the 2D sparse

Fig. 2 The generation of 4D spatial-spectral-temporal representation. For each Ts EEG signal segment, we extract DE and PSD features
from different channels and frequency bands with a 0.5s window. Then, the features are transformed into a 4D representation which consists
of 2T temporal slices.

4
Fig. 3 The 2D sparse map with zero-padding of 62 channels. The purpose of the organization is to preserve the positional relationships among
different electrodes.

Attention-based CNN
For a 4D spatial-spectral-temporal representation 𝑋𝑛 , we
extract the spatial and spectral information from each temporal
slice 𝑆𝑖  𝑅ℎ𝑤2𝑓 , 𝑖 = 1, 2, . . . , 2𝑇 with a CNN, explore the
discriminative local patterns in spatial and spectral domains
with a convolutional attention module, and finally get its
spatial and spectral representation. The attention module here
is similar to what Woo et al. propose (Woo et al. 2018), which
is originally used to improve the representation power of CNN
networks.
The structure of the attention-based CNN is shown in Fig.
4. It contains four convolutional layers, four convolutional
attention modules, one max-pooling layer, and one fully-
connected layer. The four convolutional layers have 64, 128,
256, and 64 feature maps with the filter size of 5  5, 5  5,
5  5 , and 3  3 , respectively. Specifically, a convolutional
attention module is used after each convolutional layer to
utilize the spatial and spectral attention mechanisms, and the
details will be given later. We only use one max-pooling layer
with a filter size of 2  2 after the last convolutional attention
module to preserve more information and enhance the
robustness of the network. Finally, outputs of the max-pooling
layer are flattened and fed to the fully-connected layer with
150 units. Thus, for each temporal slice 𝑆𝑖 , we take the final
output 𝑃𝑖  𝑅150 as its spatial and spectral representation.

Fig. 4 The structure of the attention-based CNN. The upper half of


the blocks in the figure is the type of layers and the lower denotes the
shape of its output tensors.

5
Convolutional attention module where 𝑊1𝑆 and 𝑊2𝑆 are learnable parameters,  denotes the
The convolutional attention module is applied after each element-wise addition, and 𝐴𝑠𝑝𝑒𝑐𝑡𝑟𝑎𝑙  𝑅1×1×𝑐𝑣 is the spectral
convolutional layer to adaptively capture important brain attention. The elements of 𝐴𝑠𝑝𝑒𝑐𝑡𝑟𝑎𝑙 represent the importance
regions and frequency bands. The structure of the of the corresponding 2D feature maps of the spectral domain.
convolutional attention module is shown in Fig. 5. It consists After generating the spectral attention 𝐴𝑠𝑝𝑒𝑐𝑡𝑟𝑎𝑙 , the output of
of two sub-modules, i.e. the spatial attention module and the the spectral attention module can be defined as:
spectral attention module. 𝑉 ′ = 𝐴𝑠𝑝𝑒𝑐𝑡𝑟𝑎𝑙  𝑉 (11)
For each convolutional layer above, its output is a 3D where 𝑉′ denotes the refined 3D feature tensor, and 
feature tensor 𝑉  𝑅ℎ𝑣 × 𝑤𝑣 × 𝑐𝑣 , where ℎ𝑣 , 𝑤𝑣 , and 𝑐𝑣 are the represents the element-wise multiplication.
height of the 2D feature maps of 𝑉, the width of the 2D feature The spatial attention module is applied to identify valuable
maps of 𝑉 , and the number of the 2D feature maps of 𝑉 , brain regions for emotion recognition. Firstly, we shrink the
respectively. We take 𝑉 as the input of the convolutional spectral dimension of 𝑉 ′ by spectral-wise average pooling
attention module. and spectral-wise maximum pooling, which is defined as:
The spectral attention module is applied to identify valuable 𝑐𝑣
1 ′
frequency bands for emotion recognition. The average pooling 𝑆𝑃𝐴𝑎𝑣𝑔,(ℎ,𝑤) = ∑ 𝑆ℎ,𝑤 (𝑐) , ℎ = 1, 2, . . . , ℎ𝑣 ; 𝑤
𝑐𝑣
has been widely used to aggregate spatial information and the 𝑐=1
maximum pooling has been commonly adopted to gather = 1, 2, . . . , 𝑤𝑣 (12)

distinctive features. Therefore, we shrink the spatial 𝑆𝑃𝐴𝑚𝑎𝑥,(ℎ,𝑤) = 𝑚𝑎𝑥(𝑆ℎ,𝑤 ), ℎ = 1, 2, . . . , ℎ𝑣 ; 𝑤
dimension of 𝑉 by a spatial-wise average pooling and a = 1, 2, . . . , 𝑤𝑣 (13)
spatial-wise maximum pooling, which are defined as:

where 𝑆ℎ,𝑤  𝑅𝑐𝑣 denotes the channel in the h-th row and w-th
ℎ𝑣 𝑤𝑣 column of 𝑉 ′ , 𝑆𝑃𝐴𝑎𝑣𝑔,(ℎ,𝑤) represents the element in the h-th
1
𝐶𝑎𝑣𝑔,𝑖 = ∑ ∑ 𝑉𝑖 (ℎ, 𝑤) , 𝑖 = 1, 2, … , 𝑐𝑣 (4) row and w-th column of the spectral average representation
ℎ𝑣 × 𝑤𝑣
ℎ=1 𝑤=1 𝑆𝑃𝐴𝑎𝑣𝑔  𝑅ℎ𝑣×𝑤𝑣×1 and 𝑆𝑃𝐴𝑚𝑎𝑥,(ℎ,𝑤) is the element in the h-
𝐶𝑚𝑎𝑥,𝑖 = 𝑚𝑎𝑥(𝑉𝑖 ), 𝑖 = 1, 2, . . . , 𝑐𝑣 (5) th row and w-th column of the spectral maximum
where 𝑉𝑖  𝑅ℎ𝑣 × 𝑤𝑣 denotes the 2D feature map in the i-th representation 𝑆𝑃𝐴𝑚𝑎𝑥  𝑅ℎ𝑣×𝑤𝑣×1 . In the following, we
channel of 𝑉, 𝐶𝑎𝑣𝑔,𝑖 represents the element in the i-th channel implement the spatial attention with a convolutional layer and
of the spatial average representation 𝐶𝑎𝑣𝑔  𝑅𝑐𝑣 , 𝑚𝑎𝑥(𝑍) a sigmoid activation function, which is defined as:
returns the largest element in 𝑍, and 𝐶𝑚𝑎𝑥,𝑖 is the element in 𝑆𝑃𝐴 = 𝐶𝑎𝑡(𝑆𝑃𝐴𝑎𝑣𝑔 , 𝑆𝑃𝐴𝑚𝑎𝑥 ) (14)
the i-th channel of the spatial maximum representation 𝐴𝑠𝑝𝑎𝑡𝑖𝑎𝑙 = 𝑆𝑖𝑔𝑚𝑜𝑖𝑑(𝐶𝑜𝑛𝑣(𝑆𝑃𝐴)) (15)
𝐶𝑚𝑎𝑥  𝑅𝑐𝑣 . Subsequently, we implement the spectral attention where 𝐶𝑎𝑡(𝑆𝑃𝐴𝑎𝑣𝑔 , 𝑆𝑃𝐴𝑚𝑎𝑥 ) denotes the concatenation of
by two fully-connected layers, a Relu activation function and 𝑆𝑃𝐴𝑎𝑣𝑔 and 𝑆𝑃𝐴𝑚𝑎𝑥 along the spectral dimension,
a sigmoid activation function, which is defined as: 𝐶𝑜𝑛𝑣(𝑆𝑃𝐴) represents the convolutional layer for 𝑆𝑃𝐴, and
𝐴𝑠𝑝𝑒𝑐𝑡𝑟𝑎𝑙,𝑎𝑣𝑔 = 𝑊2𝑆 (𝑅𝑒𝑙𝑢(𝑊1𝑆 𝐶𝑎𝑣𝑔 ) (6) 𝐴𝑠𝑝𝑎𝑡𝑖𝑎𝑙  𝑅ℎ𝑣×𝑤𝑣×1 is the spatial attention. The elements of
𝐴𝑠𝑝𝑒𝑐𝑡𝑟𝑎𝑙,𝑚𝑎𝑥 = 𝑊2𝑆 (𝑅𝑒𝑙𝑢(𝑊1𝑆 𝐶𝑚𝑎𝑥 ) (7) 𝐴𝑠𝑝𝑎𝑡𝑖𝑎𝑙 represent the importance of the corresponding regions
𝐴𝑠𝑝𝑒𝑐𝑡𝑟𝑎𝑙 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝐴𝑠𝑝𝑒𝑐𝑡𝑟𝑎𝑙,𝑎𝑣𝑔  𝐴𝑠𝑝𝑒𝑐𝑡𝑟𝑎𝑙,𝑚𝑎𝑥 ) (8) of the spatial domain. Subsequently, the output of the spatial
𝑅𝑒𝑙𝑢(𝑥) = max(𝑥, 0) (9) attention module can be defined as:
1
𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑥) = (10) 𝑉 ′′ = 𝐴𝑠𝑝𝑎𝑡𝑖𝑎𝑙  𝑉 ′ (16)
1 + 𝑒 −𝑥
where 𝑉 ′′  𝑅ℎ𝑣 × 𝑤𝑣 × 𝑐𝑣
denotes the final output 3D feature
tensor of the convolutional attention module.

6
Fig. 5 The top block is the overall structure of the convolutional attention block, it consists of the spectral attention module and
the spatial attention module. The middle block represents the generation of spectral attention. The bottom block denotes the
generation of spatial attention.
𝑃 𝑁 = (𝑃2𝑇 , 𝑃2𝑇−1 , . . ., 𝑃1 ) as the input sequence. The outputs
Attention-based bidirectional LSTM of the i-th node of the unidirectional LSTMs are 𝑌𝑖𝑃  𝑅36
For each temporal slice 𝑆𝑖  𝑅ℎ𝑤2𝑓 , 𝑖 = 1, 2, . . . , 2𝑇 , the and 𝑌𝑖  𝑅 𝑃, 𝑖 = 1, 2,
𝑁 36
. . . , 2𝑇 , respectively. Then, we
𝑁
final output of the attention-based CNN is 𝑃𝑖  𝑅 . Since the
150 concatenate 𝑌𝑖 and 𝑌2𝑇 + 1 − 𝑖 as the output of the i-th node

variation between different temporal slices contains temporal of the bidirectional LSTM 𝑌𝑖  𝑅72 . Different from traditional
information for emotion recognition, we utilize an attention- ways that only use the output of the last node of an LSTM for
based bidirectional LSTM to explore the importance of classification or other applications, we take the outputs of all
different slices, as shown in Fig. 6. the bidirectional LSTM nodes 𝑌  𝑅2𝑇×72 into consideration
A bidirectional LSTM connects two unidirectional LSTMs and explore the importance of different temporal slices by the
with opposite directions to the same output. Comparing with temporal attention mechanism.
a unidirectional LSTM, a bidirectional LSTM preserves The temporal attention mechanism is implemented with two
information from both past and future, making it understand fully-connected layers, a Relu activation function, and a
the context better. In this paper, the bidirectional LSTM softmax activation function, which is defined as:
comprises two unidirectional LSTMs with 36 memory cells. 𝑇𝑒𝑚𝑖 = 𝑊2𝑇 (𝑅𝑒𝑙𝑢(𝑊1𝑇 Yi + 𝑏1𝑇 )) + 𝑏2𝑇 (17)
The unidirectional LSTM for positive time direction, LSTMP 𝐴𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑇𝑒𝑚) (18)
takes the output sequence of the attention-based CNN 𝑃 𝑃 = 𝑒𝑥𝑝(𝑥)
𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑥) = (19)
(𝑃1 , 𝑃2 , . . ., 𝑃2𝑇 ) as the input sequence, while the other for ∑ 𝑒𝑥𝑝(𝑥)
negative time direction, LSTMN takes the reverse sequence

7
where 𝑊1𝑇 , 𝑊2𝑇 , 𝑏1𝑇 , and 𝑏2𝑇 are learnable parameters, 𝑇𝑒𝑚𝑖 function to predict the label of the 4D sample 𝑋𝑛 , which can
represents the i-th element of 𝑇𝑒𝑚  𝑅2𝑇×1 which projects be defined as follows:
𝑌  𝑅 2𝑇×72 to a lower dimension, and 𝐴𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙  𝑅2𝑇×1 is 𝑃𝑟𝑒 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑊 𝑝 𝐿𝑛 + 𝑏 𝑝 ) (21)
the temporal attention. The elements of 𝐴𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 represent where 𝑊 , 𝑏 are learnable parameters and 𝑃𝑟𝑒  𝑅𝐶
𝑝 𝑝

the importance of the corresponding temporal slices. denotes the probability of 𝑋𝑛 belonging to all the 𝐶 classes.
Subsequently, the high-level representation of the 4D sample Specifically, the class of the largest probability is the predicted
𝑋𝑛 can be defined as: label of 4D-aNN.
𝐿𝑛 (𝑒) = ∑ 𝐴𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙  𝑌𝑒 , 𝑒 = 1, 2, … , 72 (20)
Experiment
where 𝑌𝑒  R2T×1 denotes the e-th column of 𝑌  R2T×72
and 𝐿𝑛 (𝑒) is the e-th element of the high-level representation In this section, we firstly introduce a widely used dataset.
𝐿𝑛  𝑅 72 , which integrates spatial, spectral, and temporal
Then, the experiment settings are described. Finally, the
information of 𝑋𝑛 . results on the dataset are reported and discussed.

SEED Dataset

SEED dataset (Zheng and Lu 2015) contains 3 different


categories of emotion data: positive, neutral, and negative. For
each kind of emotion, 5 film clips that are about 4 minutes
long and can elicit the desired target emotion are selected. 15
healthy subjects (7 males and 8 females, with age (23.27 
2.37)) take part in the EEG signals collection experiment. 3
groups of experiments are conducted for each subject, and
each experiment consists of 15 clips viewing processes. Each
clip viewing process can be divided into four stages, including
a 5 seconds hint of start, a 4 minutes clip period, a 45 seconds
self-assessment, and a 15 seconds rest period. The order of the
15 clips is arranged so that two clips eliciting the same
emotion are not shown consecutively. The EEG signals in the
experiments are recorded by a 62-channel’s E I euro can
system and down-sampled to 200 Hz. Besides, the EEG
signals seriously contaminated by electromyography (EMG)
and electrooculography (EOG) are removed manually. Then,
a bandpass filter between 0.3 to 50 Hz is applied to filter the
noise.

Settings

The proposed 4D-aNN takes a 4D segment 𝑋𝑛  𝑅ℎ𝑤2𝑓2𝑇


as the input. In this paper, we adopt the 2D sparse map with
ℎ = 19 and 𝑤 = 19 to maintain the positional relationship of
electrodes. As shown in previous works, the combination of
all the 5 bands can contribute to better results so that we set
𝑓 = 5. For each experiment, we set the length of segments 𝑇
as 3, obtaining about 1128 samples per experiment. Then, we
conduct a fivefold cross-validation on each experiment and
Fig. 6 The top block is the structure of the bidirectional LSTM.
calculate the average classification accuracy (ACC) and
We concatenate the outputs of LSTMP and LSTMP as the output of
the bidirectional LSTM, 𝑌  𝑅 2𝑇×72. The middle block represents the standard deviation (STD) of 3 experiments for each subject.
projection of the outputs of the bidirectional LSTM. The bottom The average ACC and STD of all subjects are taken as the final
block denotes the generation of temporal attention. performances of our method. We train the 4D-aNN on an
NVIDIA GTX 1080 GPU. The Adam optimization is applied
Classifier to minimize the loss function. We set the learning rate as
0.0003, the batch size as 12, and the maximum of epochs as
Based on the high-level representation 𝐿𝑛 of EEG signals,
150.
we apply a fully-connected layer and a softmax activation

8
Journal XX (XXXX) XXXXXX Xiao et al

Baseline Models classification accuracy. 4D-CRNN takes 4D DE feature maps


containing spatial, spectral, and temporal information as
⚫ HCNN (Li et al. 2018): It uses a hierarchical CNN
inputs, reaching 94.74% on classification accuracy. SST-
architecture for EEG emotion recognition, taking 2D DE
EmotionNet uses a two-stream network with the attention
feature maps extracted from γ band as inputs. HCNN
mechanisms, reaching 96.02% on classification accuracy.
only considers the spatial information of EEG signals.
However, the data size of each input sample of SST-
⚫ BiHDM (Li et al. 2020): It considers the asymmetric
EmotionNet is about 4 times larger than 4D-aNN. Comparing
differences between two hemispheres for EEG emotion
with the baseline models, the proposed 4D-aNN achieves the
recognition.
state-of-the-art performance on the SEED dataset under intra-
⚫ RGNN (Zhong et al. 2020): It takes the biological
subject splitting. The average ACC of all subjects is 96.10%.
topology among different brain regions into
The performances on each subject are shown as Fig. 7, and
consideration to capture both global and local relations
there are 9 subjects (#5, #6, #8, #9, #10, #11, #12, #13, and
among different EEG channels.
#15) whose performances are better than the average ACC.
⚫ 4D-CRNN (Shen et al. 2020): It builds DE features
Specifically, to make a fair comparison with 4D-CRNN, we
extracted from EEG signals into 4D feature structures
conduct experiments on 4D-aNN (DE) and 4D-aNN (PSD),
and uses a convolutional recurrent neural network to
which represents the 4D-aNN only takes DE features as inputs
extract spatial features, spectral features, and temporal
and only takes PSD features as inputs, respectively. The
features for EEG emotion recognition.
accuracy of 4D-aNN (DE) exceeds that of 4D-CRNN by
⚫ SST-EmotionNet (Jia et al. 2020): It uses a two-stream
0.65%, indicating the superiority of the proposed 4D-aNN.
network to extract spatial, spectral, and temporal features.
When compared with 4D-aNN (DE) and 4D-aNN (PSD), 4D-
Besides, SST-EmotionNet adopts the attention
aNN displays the best performance, which indicates the
mechanisms to improve its performance on EEG
effectiveness of the combination of different features.
emotion recognition.
Table 1 The performance (average ACC and STD (%)) of the
Results
compared models.
We compare our model with 5 baseline models on SEED SEED
Model
dataset. Table 1 presents the average ACC and STD of these ACC (%) STD (%)
models for EEG emotion recognition. HCNN uses the HCNN (Li et al. 2018) 88.60 2.60
hierarchical CNN architecture to classify emotion, but only BiHDM (Li et al. 2020) 93.12 6.06
considers the spatial information of EEG signals, reaching RGNN (Zhong et al. 2020) 94.24 5.95
88.60% on classification accuracy. BiHDM (Li et al. 2020) 4D-CRNN (Shen et al. 2020) 94.74 2.32
applies four directed RNNs to obtain the deep representation SST-EmotionNet (Jia et al. 2020) 96.02 2.17
o all the EEG electrodes’ signals, reaching 93.12% on 4D-aNN 96.10 2.61
classification accuracy. RGNN considers the biological 4D-aNN (DE) 95.39 3.05
topology among different brain regions, reaching 94.24% on 4D-aNN (PSD) 90.49 7.97

Fig. 7 The performance of 4D-aNN on each subject. In the SEED dataset, 3 experiments are conducted for each subject. We evaluate the
performance of each experiment and also present the average classification accuracy for each subject.

9
To verify the importance of the attention mechanisms in our the critical brain regions could vary with different subjects,
model, we conduct an additional experiment for ablation time, and emotions so that the attention mechanisms that
studies on SEED dataset. The experiment is ablation on spatial, enable 4D-aNN to adaptively capture discriminative patterns
spectral, and temporal attention mechanisms. We evaluate the make sense for EEG emotion recognition.
performances of 4D-aNN when spatial, spectral, temporal,
and all the attention mechanisms are ablated respectively. As
shown in Fig. 8, when one of the attention mechanisms is
ablated, the classification accuracy decreases. 4D-aNN
without the spectral attention mechanism decreases by 0.63%,
4D-aNN without the spatial attention mechanism decreases by
0.47%, and 4D-aNN without the temporal attention
mechanism decrease by 1.19%. Specifically, 4D-aNN without
all the attention mechanisms decreases by 2.17%, which is the
worst among the models used for comparison. In conclusion,
the results indicate that the attention mechanisms make
contributions to EEG emotion recognition for the ability to
capture the discriminative local patterns in spatial, spectral,
and temporal domains.

Fig. 8 Ablation studies on different input features and attention


modules of 4D-a . “−” denotes the a lation on certain attention
modules.

In particular, to explore the critical brain regions for


different emotions, we separately depict the electrode activity
heatmaps in Fig. 9. We draw the heatmaps using Grad-
CAM++ (Chattopadhay et al. 2018), based on the experimental
results of subject #15. Grad-CAM++ uses the last
convolutional layer feature maps and the class scores of the
classifier to generate heatmaps. The heatmaps are able to
explain which input regions are important for predictions. In
this work, the size of each heatmap is 1919, which is the
same as the 2D sparse map. The elements in the heatmaps
represent the contributions of the corresponding brain regions
to the recognition of the target emotions. From Fig. 9, We can
observe the distinct distributions of important brain regions
with regard to different emotions: channels FC5, FC3, and C5 Fig. 9 The electrode activity heatmaps based on the experimental
are important for recognition of positive emotions, channels results of subject #15. Parts (a), (b), and (c) correspond to positive,
CP5, CP3, and CP1 are important for recognition of neutral neutral, and negative emotions, respectively. Dark red regions denote
emotions, and channels PO7, PO5, and P3 are important for more significant contributions to the recognition of the corresponding
recognition of negative emotions for subject #15. In particular, emotions.

10
on the outputs of the CNN module while the temporal
Discussion attention mechanism explores the importance of different
We conduct several experiments to investigate the use of temporal slices. The experiments on SEED dataset
4D-aNN which fuses the spatial-spectral-temporal demonstrate better performance than all baselines. In
information and the effectiveness of the attention mechanisms particular, the ablation studies on different attention modules
on different domains for EEG emotion classification. In this show the effectiveness of the attention mechanisms in our
section, we discuss three noteworthy points. model for EEG emotion recognition.
First, to deal with the spatial-spectral information, we apply
Reference
an attention-based CNN which consists of a CNN network, a
spectral attention module, and a spatial attention module. The Abbass SKGHA, Tan KC, Al-Mamun A, Thakor N,
CNN network extracts the spatial-spectral representation from Bezerianos A, Li J (2018) Spatio–Spectral
inputs first. Then, the spectral attention mechanism is applied Representation Learning for
to each spectral feature to explore the importance of different Electroencephalographic Gait-Pattern Classification
frequency bands and features. Besides, the spatial attention Ieee T Neur Sys Reh 26:1858-1867
mechanism is applied to each 2D feature map to adaptively doi:10.1109/TNSRE.2018.2864119
capture the critical brain regions. The critical brain regions and Alhagry S, Fahmy AA, El-Khoribi RA (2017) Emotion
recognition based on EEG using LSTM recurrent
frequency bands could vary with different individuals,
neural network International Journal of Advanced
emotions, and time so that the ability to capture discriminative Computer Science and Applications 8:335-358
patterns of the attention modules improves the performance of doi:10.14569/IJACSA.2017.081046
4D-aNN. Bamdad M, Zarshenas H, Auais MA (2015) Application of
Second, to explore the temporal dependencies in 4D spatial- BCI systems in neurorehabilitation: a scoping review
spectral-temporal representations, we utilize an attention- Disability and Rehabilitation: Assistive Technology
based bidirectional LSTM. The bidirectional LSTM extracts 10:355-364 doi:10.3109/17483107.2014.961569
high-level representations from the outputs of the attention- Blankertz B et al. (2016) The Berlin brain-computer interface:
based CNN. Different from traditional ways that only use the progress beyond communication and control Front
output of the last node of an LSTM for classifications or other Neurosci-Switz 10:530
applications, we consider outputs of all the nodes with the doi:10.3389/fnins.2016.00530
temporal attention mechanism. The temporal attention Brittona JC, Phan KL, Taylor SF, Welsh RC, Berridge KC,
Liberzon I (2006) Neural correlates of social and
mechanism adaptively assigns weights of different temporal
nonsocial emotions: An fMRI study Neuroimage
slices so that the dynamic content of emotions in 4D
31:397-409 doi:10.1016/j.neuroimage.2005.11.027
representations could be captured better. Chattopadhay A, Sarkar A, Howlader P, Balasubramanian VN
Third, to address the importance of the attention (2018) Grad-CAM++: Generalized Gradient-Based
mechanisms, we conduct ablation studies on different Visual Explanations for Deep Convolutional
attention modules. 4D-aNN without the spatial, spectral, and Networks. Paper presented at the 2018 IEEE Winter
temporal attention mechanism decreases by 0.47%, 0.63%, Conference on Applications of Computer Vision
and 1.19% on classification accuracy, respectively. In (WACV), Lake Tahoe, NV, USA, 12-15 March 2018
particular, 4D-aNN without all the attention mechanisms Craik A, He Y, Contreras-Vidal JL (2019) Deep learning for
decreases by 2.17%, which is the worst among the models in electroencephalogram (EEG) classification tasks: a
comparison. The experimental results demonstrate the review J Neural Eng 16:031001 doi:10.1088/1741-
effectiveness of the attention mechanisms to adaptively 2552/ab0ab5
capture discriminative patterns. Dolan RJ (2002) Emotion, cognition, and behavior Science
298:1191–1194 doi:10.1126/science.1076358
Conclusion Duan R-N, Zhu J-Y, Lu B-L (2013) Differential entropy
feature for eeg-based emotion classification. Paper
In this paper, we propose the 4D-aNN model for EEG presented at the 2013 6th International IEEE/EMBS
emotion recognition. The 4D-aNN takes 4D spatial-spectral- Conference on Neural Engineering (NER), San
temporal representations containing spatial, spectral, and Diego, CA, USA, 6-8 Nov. 2013
temporal information of EEG signals as inputs. We integrate Figueiredo GR, Ripka WL, Romaneli EFR, Ulbricht L (2019)
the attention mechanisms into the CNN module and the Attentional bias for emotional faces in depressed and
nondepressed individuals: an eye-tracking study.
bidirectional LSTM module. The CNN module deals with the
Paper presented at the 2019 41st Annual
spatial and spectral information of EEG signals while the
International Conference of the IEEE Engineering in
spatial and spectral attention mechanisms capture critical Medicine and Biology Society (EMBC), Berlin,
brain regions and frequency bands adaptively. The Germany, Germany, 23-27 July 2019
bidirectional LSTM module extracts temporal dependencies

11
Fiorinia L, Mancioppi G, Semeraro F, Fujita H, Cavallo F Shen F, Dai G, Lin G, Zhang J, Kong W, Zeng H (2020) EEG-
(2020) Unsupervised emotional state classification based emotion recognition using 4D convolutional
through physiological parameters for social robotics recurrent neural network Cogn Neurodynamics
applications Knowledge-Based Systems 190 14:815–828 doi:10.1007/s11571-020-09634-1
doi:10.1016/j.knosys.2019.105217 Shi L-C, Jiao Y-Y, Lu B-L (2013) Differential entropy feature
Jenke R, Peer A, Buss M (2014) Feature extraction and for eeg-based vigilance estimation. Paper presented
selection for emotion recognition from eeg IEEE at the 2013 35th Annual International Conference of
Transactions on Affective Computing 5:327-339 the IEEE Engineering in Medicine and Biology
doi:10.1109/TAFFC.2014.2339834 Society (EMBC), Osaka, Japan, 3-7 July 2013
Jia Z, Lin Y, Cai X, Chen H, Gou H, Wang J SST-EmotionNet: Song T, Zheng W, Song P, Cui Z (2020) EEG Emotion
Spatial-Spectral-Temporal based Attention 3D Recognition Using Dynamical Graph Convolutional
Dense Network for EEG Emotion Recognition. In: Neural Networks IEEE Transactions on Affective
Proceedings of the 28th ACM International Computing 11:532-541
Conference on Multimedia, Seattle, WA, USA, 2020. doi:10.1109/TAFFC.2018.2817622
Association for Computing Machinery, pp 2909– Tao W, Li C, Song R, Cheng J, Liu Y, Wan F, Chen X (2020)
2917. doi:10.1145/3394171.3413724 EEG-based Emotion Recognition via Channel-wise
Jiaxin Ma, Tang H, Zheng W-L, Lu B-L Emotion recognition Attention and Self Attention IEEE Transactions on
using multimodal residual LSTM network. In: Affective Computing:1-1
Proceedings of the 27th ACM International doi:10.1109/TAFFC.2020.3025777
Conference on Multimedia, Nice, France, 2019. Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam:
Association for Computing Machinery, New York, Convolutional block attention module. Computer
NY, USA, pp 176–183. Vision – ECCV 2018. Springer International
doi:10.1145/3343031.3350871 Publishing, Cham. doi:10.1007/978-3-030-01234-
Katsigiannis S, Ramzan N (2017) Dreamer: A database for 2_1
emotion recognition through eeg and ecg signals Yan J, Zheng W, Xu Q, Lu G, Li H, Wang B (2016) Sparse
from wireless low-cost off-the-shelf devices Ieee J kernel reduced-rank regression for bimodal emotion
Biomed Health 22:98-107 recognition from facial expression and speech IEEE
doi:10.1109/JBHI.2017.2688239 Transactions on Multimedia 18:1319-1329
Kim M-K, Kim M, Oh E, Kim S-P (2013) A review on the doi:10.1109/TMM.2016.2557721
computational methods for emotional state Yang Y, Wu Q, Fu Y, Chen X Continuous Convolutional
estimation from the human eeg Comput Math Neural Network with 3D Input for EEG-Based
Method M 2013 doi:10.1155/2013/573734 Emotion Recognition. In: Cheng L, Leung ACS,
Krizhevsky A, Sutskever I, Hinton GE Imagenet classification Ozawa S (eds) Neural Information Processing, 2018a.
with deep convolutional neural networks. In: Springer International Publishing, pp 433-433.
Advances in Neural Information Processing Systems, doi:10.1007/978-3-030-04239-4_39
2012. Curran Associates, Inc., pp 1097-1105 Yang Y, Wu QMJ, Zheng W-L, Lu B-L (2018b) EEG-based
Li J, Zhang Z, He H (2018) Hierarchical convolutional neural emotion recognition using hierarchical network with
networks for EEG-based emotion recognition Cogn subnetwork nodes Ieee T Cogn Dev Syst 10:408-419
Comput 10:368–380 doi:10.1007/s12559-017-9533- doi:10.1109/TCDS.2017.2685338
x Zheng W-L, Lu B-L (2015) Investigating critical frequency
Li M, Lu B-L (2009) Emotion classification based on gamma- bands and channels for EEG-based emotion
band EEG. Paper presented at the 2009 Annual recognition with deep neural networks IEEE
International Conference of the IEEE Engineering in Transactions on Autonomous Mental Development
Medicine and Biology Society, Minneapolis, MN, 7:162-175 doi:10.1109/TAMD.2015.2431497
USA, 3-6 Sept. 2009 Zheng W-L, Zhu J-Y, Lu B-L (2017) Identifying stable
Li Y et al. (2020) A Novel Bi-hemispheric Discrepancy Model patterns over time for emotion recognition from EEG
for EEG Emotion Recognition Ieee T Cogn Dev IEEE Transactions on Affective Computing 10:417-
Syst:1-1 doi:10.1109/TCDS.2020.2999337 429 doi:10.1109/TAFFC.2017.2712143
Lotfia E, Akbarzadeh-T M-R (2014) Practical emotional Zhong P, Wang D, Miao C (2020) EEG-Based Emotion
neural networks Neural Networks 59:61-72 Recognition Using Regularized Graph Neural
doi:10.1016/j.neunet.2014.06.012 Networks IEEE Transactions on Affective
Mühl C, Nijholt BAA, Chanel G (2014) A survey of affective Computing:1-1 doi:10.1109/TAFFC.2020.2994159
brain computer interfaces: principles, state-of-the-art,
and challenges Brain-Computer Interfaces 1:66-84
doi:10.1080/2326263X.2014.912881
Pfurtscheller G et al. (2010) The hybrid BCI Front Neurosci-
Switz 4:3 doi:10.3389/fnpro.2010.00003

12

You might also like