0% found this document useful (0 votes)
36 views13 pages

Bridge Graph Attention Based Graph Convolution Network With Multi-Scale Transformer for EEG Emotion Recognition

The document presents a novel approach for EEG emotion recognition using a Bridge Graph Attention-based Graph Convolution Network (BGAGCN) combined with a Multi-Scale Transformer (MT) to address the over-smoothing problem prevalent in deep graph networks. The proposed model enhances feature representation by effectively capturing both local and global dependencies in EEG data, achieving state-of-the-art accuracy across multiple benchmark datasets. This method leverages the unique spatial characteristics of EEG signals at different timescales, providing a robust solution for emotion recognition tasks.

Uploaded by

cowaket871
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views13 pages

Bridge Graph Attention Based Graph Convolution Network With Multi-Scale Transformer for EEG Emotion Recognition

The document presents a novel approach for EEG emotion recognition using a Bridge Graph Attention-based Graph Convolution Network (BGAGCN) combined with a Multi-Scale Transformer (MT) to address the over-smoothing problem prevalent in deep graph networks. The proposed model enhances feature representation by effectively capturing both local and global dependencies in EEG data, achieving state-of-the-art accuracy across multiple benchmark datasets. This method leverages the unique spatial characteristics of EEG signals at different timescales, providing a robust solution for emotion recognition tasks.

Uploaded by

cowaket871
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

2042 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 15, NO.

4, OCTOBER-DECEMBER 2024

Bridge Graph Attention Based Graph Convolution


Network With Multi-Scale Transformer for EEG
Emotion Recognition
Huachao Yan , Kailing Guo , Member, IEEE, Xiaofen Xing , Member, IEEE,
and Xiangmin Xu , Senior Member, IEEE

Abstract—In multichannel electroencephalograph (EEG) emo- Index Terms—EEG, emotion recognition, graph attention
tion recognition, most graph-based studies employ shallow graph network, multi-scale transformer, over-smoothing.
model for spatial characteristics learning due to node over-
smoothing caused by an increase in network depth. To address I. INTRODUCTION
over-smoothing, we propose the bridge graph attention-based
graph convolution network (BGAGCN). It bridges previous graph MOTION is a fundamental aspect of human cognition, and
convolution layers to attention coefficients of the final layer by
adaptively combining each graph convolution output based on the
graph attention network, thereby enhancing feature distinctive-
E the emotional recognition can enhance human-computer
interaction and collaboration [1]. Affective computing aims to
ness. Considering that graph-based network primarily focus on enable computers to understand emotional states like humans.
local EEG channel relationships, we introduce a transformer for Compared to extraction methods external information, such as
global dependency. Inspired by the neuroscience finding that neural facial expressions [2], [3], [4] and body postures [5], the obvious
activities of different timescales reflect distinct spatial connectivi- advantage of physiological signals (e.g., electrocardiogram
ties, we modify the transformer to a multi-scale transformer (MT)
by applying multi-head attention to multichannel EEG signals after
(ECG) [6], electroencephalogram (EEG) [7], electrodermal
1D convolutions at different scales. MT learns spatial features more activity (EDA) [8]), is that these signals can directly reflect
elaborately to enhance feature representation ability. By combining the potential human nervous system with more reliable
BGAGCN and MT, our model BGAGCN-MT achieves state-of-the- characteristics. Among these physiological signals, EEG signal
art accuracy under subject-dependent and subject-independent is particularly suitable for affective computing due to its
protocols across three benchmark EEG emotion datasets (SEED,
SEED-IV and DREAMER). Notably, our model effectively ad-
temporal fine-grained resolution of cognitive psychological
dresses over-smoothing in graph neural networks and provides an processes [9]. EEG signal is defined as the general reaction of
efficient solution to learning spatial relationships of EEG features the electrophysiological activities of neurons on the surface of
at different scales. the cerebral cortex and scalp. Generally, a set of EEG signals
are collected by placing multiple electrodes according to the
Manuscript received 29 November 2023; revised 17 February 2024; accepted 10–20 system [10], enabling the simultaneous capture of EEG
20 February 2024. Date of publication 30 April 2024; date of current version 18
November 2024. This work was supported in part by the Fundamental Research
signals from different regions of the brain.
Funds for Central Universities, SCUT, under Grant 2023ZYGXZR013, in part A typical task of multichannel EEG emotion recognition
by the Basic and Applied Basic Research Foundation of Guangzhou under comprises two main parts: feature extraction and classification.
Grant 202201010681, in part by the Fundamental Research Funds for Central
Universities, SCUT, under Grant 2023ZYGXZR086, in part by the National Key
For each EEG channel, the most used method of feature ex-
R&D Program of China under Grant 2022YFB4500600, in part by the National traction is to decompose the signal into five frequency bands
Natural Science Foundation of China under Grant 61802131, in part by the (δ (1–4 Hz), θ (4–8 Hz), α (8–14 Hz), β (14–30 Hz), and γ
Science and Technology Project of Guangdong under Grant 2022B0101010003,
in part by the Natural Science Foundation of Guangdong Province, China,
(30–50 Hz) [11]), and then extract features from each frequency
under Grant 2020A1515010781 and Grant 2019B010154003, in part by the band. Some studies [12], [13] have found that differential en-
Guangzhou Key Laboratory of Body Data Science, under Grant 201605030011, tropy (DE) and power spectral density (PSD) is closely associ-
in part by the Science and Technology Project of Zhongshan, under Grant
2019AG024, in part by the Guangdong Provincial Key Laboratory of Human
ated with emotional processes, making them commonly used for
Digital Twin, under Grant 2022B1212010004. Recommended for acceptance EEG emotion recognition. Traditional methods usually utilize
by Z. Zhang. (Corresponding author: Kailing Guo.) classical classifiers like support vector machine (SVM) and deep
Huachao Yan and Xiaofen Xing are with the South China University of
Technology, Guangzhou 510641, China.
belief network (DBN) [13] to classify the extracted features.
Kailing Guo is with the South China University of Technology, Guangzhou With the development of deep learning, feature extraction and
510641, China, and also with Pazhou Lab, Guangzhou 510335, China (e-mail: classification are merged into one process through neural net-
[email protected]).
Xiangmin Xu is with the South China University of Technology, Guangzhou
works [14], based on EEG signals or pre-extracted features.
510641, China, also with Pazhou Lab, Guangzhou 510335, China, and also with Inspired by the collaborative relationship between brain re-
the Institute of Artiffcal Intelligence, Heifei Comprehensive National Science gions, an increasing number of studies [1] are focusing on
Center, Hefei 230088, China.
Our code is available at https://github.com/LogzZ.
mapping the spatial relationships between EEG channels. Re-
Digital Object Identifier 10.1109/TAFFC.2024.3394873 current neural network (RNN) [15], [16], convolution neural

1949-3045 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: JADAVPUR UNIVERSITY. Downloaded on January 11,2025 at 09:34:30 UTC from IEEE Xplore. Restrictions apply.
YAN et al.: BRIDGE GRAPH ATTENTION BASED GRAPH CONVOLUTION NETWORK WITH MULTI-SCALE TRANSFORMER 2043

network (CNN) [17], [18], [19], and graph neural network to construct the graph. Liu et al. [38] generalize GraphSAGE
(GNN) [20], [21], [22] has received much attention for their to multiple traffic graphs by incorporating their temporal cor-
spatial relationship modeling ability. Some studies [15], [16], relation. DropEdges [37] randomly drop a percentage of edges,
[23] predefined sequences of EEG electrodes from different which is a classic strategy for solving over-smoothing. Some
placing directions and input them into RNN models to extract the studies improved DropEdges by setting the dropping probability
spatial structure of multichannel EEG. However, reshaping the with attention mechanisms [39] and edge importance [40]. How-
electrodes for RNN will disrupt the original spatial structure of ever, applying DropNodes or DropEdges to graph-based EEG
EEG channels. CNN [18], [19] can preserve the spatial structure emotion recognition will lose the interpretability of brain con-
of EEG channels when extracting spatial features from multi- nectivity. Motivated by CNN, Li et al. [25] computed the prox-
channel EEG data. However, CNN requires mapping the spatial imity similarity of point clouds to build a graph and introduced
distribution of EEG channels to a matrix and assigning values skip connections into GCN, partially mitigating over-smoothing.
to non-existent EEG channels, which may introduce irrelevant However, feature maps at different layers have different impor-
noise. Additionally, according to the findings in neuroimaging, tance [41]. The skip connection lacks discriminability across
different brain regions often exhibit unstructured relationships different depths of the network.
with each other under different emotional states [24]. To tackle the above issue, this paper proposes bridge graph
Graph-based methods are able to cope with unstructured attention based GCN, namely, BGAGCN for EEG emotion
data directly and hold promise in describing the intrinsic spa- recognition. BGAGCN consists of two parts: the deep graph
tial relationships between different nodes in the graph [25]. convolution path and the bridge graph attention (BGAT) block.
For multichannel EEG emotion recognition, many studies con- Skip layer connection is performed in the deep graph convolution
struct the adjacency matrix from different aspects, e.g., Gaus- path, which allows the model to have a deeper depth for learning
sian kernel [26], [27], correlations [28], [29], and traceability high-order topological representations from multichannel EEG.
method [30]. However, these studies ignore the functional ac- The BGAT block bridges previous graph convolution layers to
tivities of the brain. Inspired by neuroscience, Song et al. [31] attention coefficients of the final graph convolution layer by
and Ye et al. [22] constructed adjacency matrices according adaptively combining each graph convolution output based on
to functional regions of the brain and showed better results. graph attention network (GAT). BGAGCN captures the vary-
In terms of utilizing the adjacency matrix, earlier study [20] ing importance of spatial features at different depths of GCN
considered the advantages of flexibility, which used the dynamic and enriches node information of the final layer to overcome
strategy to update the adjacency matrix, and there are also some over-smoothing. Since the activities of the brain include both
follow-ups [22], [32]. Recently, considering the high stability regional and global activity, similar to previous studies [22],
and low computational complexity of static adjacency matrix, [31], this paper also uses multi-graph data to simulate these two
some studies [28], [29], [30] employed static adjacency matrix types of activities of the brain.
in GCN for spatial learning of multichannel EEG and achieved Graph-based methods can be considered as building the de-
better performance. pendency among EEG channels through spatial relationships of
However, the above graph-based methods predominantly em- local regions in the graph space. However, during the generation
ployed shallow GCN to investigate the spatial relationships of emotional EEG, neurons in the brain are stimulated to gen-
among EEG channels, and this will raise the issue of insufficient erate neural oscillation across the whole cerebral cortex [42],
representational ability. If simply increasing the depth of GCN, which means that multichannel EEG also has non-local charac-
more neighboring nodes are taken into account, which can lead teristics. Transformer [43] is capable of establishing dense corre-
to the over-smoothing problem [33]. lations to explore non-local characteristics among tokens and has
In EEG emotion recognition, most studies attempted to ad- achieved state-of-the-art (SOTA) results in image classification
dress the insufficient representation ability by combining shal- and natural language processing. Recently, the transformer is
low GCN with other models. Zhang et al. [34] and Lin et also employed for constructing global dependencies of EEG
al. [35] connected a graph convolution layer and multiple 1D in different domains such as the time domain [44], frequency
convolution layers to construct deeper models for EEG emotion domain [45], and spatial domain [46], [47]. Among them, Sun
recognition. Ye et al. [22] proposed a hierarchical dynamic GCN et al. [46] combined transformer after graph learning to build
and paralleled it with two 1D CNN models to achieve improved global spatial dependency of multichannel EEG for emotion
performance. However, the issue of over-smoothing in graph recognition, which demonstrated that transformer and graph
methods for EEG emotion recognition remains unresolved. In learning are complementary to each other. But these studies
fact, the essence of graph convolution can be seen as a localized have all overlooked the multi-scale EEG impact on the spatial
filter, repeatedly using this method can mix neighbor nodes relationship of multichannel EEG.
and make the graph more confusing. This especially holds for Inspired by the finding of neuroscience that brain activity at
small-scale datasets with fewer nodes [25] like EEG datasets. different timescales shows distinct spatial characteristics [48],
Some studies have proposed solutions to deal with over- we propose multi-scale transformer (MT) in our model to rem-
smoothing. GraphSAGE [36] aggregates neighbor nodes to edy this gap. In MT, we regard each EEG channel in each sample
downsample nodes, which can be viewed as dropping nodes as a token and introduce multi-scale convolution to act on each
to mitigate over-smoothing [37]. GraphSAGE has been applied token for feature extraction. Subsequently, multi-head attention
on fake news assessment by using inclusive relationship of text (MHA) is employed to establish global spatial dependency

Authorized licensed use limited to: JADAVPUR UNIVERSITY. Downloaded on January 11,2025 at 09:34:30 UTC from IEEE Xplore. Restrictions apply.
2044 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 15, NO. 4, OCTOBER-DECEMBER 2024

within different scales among all EEG channels. MT captures To address the sequencing issues, CNN is introduced as an
different spatial information from all the EEG channels with alternative method to learn the spatial features among multichan-
different scales after graph-based learning. nel EEG. The first step of this method is to map the multichannel
The contributions are summarized as follows: EEG into a matrix, and then employ 2D convolution to extract
1) The proposed BGAGCN treats information of different spatial features. Shen et al. [19] proposed a compact mapping of
depth in GCN for integration distinctly, which outperforms the 62 EEG channels into an 8×9 matrix for CNN learning. Li
other GNN methods in mitigating over-smoothing by pro- et al. [18] mapped the 62 EEG channels into a bigger mapping
ducing more distinguishable feature in the final layer of size of 20×20 matrix to cope with information loss caused by
GCN. pooling in deep CNN. Furthermore, Cui et al. [17] mapped the
2) Combining BGAGCN and MT, our model BGAGCN-MT 32 EEG channels into an 8×8 matrix and calculate the difference
has a strong representation capability that learns global of paired EEG channels to obtain the difference matrix, CNN
spatial dependency at different times-scales, which sur- is finally employed for capturing asymmetric features of the
passes the SOTA EEG emotion recognition methods. brain from the difference matrix. Although CNN can preserve
3) The feature extracted at different convolution scales in MT the spatial structure of multichannel EEG, it should be noted that
exhibits distinct spatial representations, which is consis- during the mapping process, nonexistent EEG channels need to
tent with the discovery in neuroscience. be zeroed or interpolated. This will introduce spatial noise that
is irrelevant to the actual spatial structure of multichannel EEG.

B. Graph Learning of EEG Emotion Recognition


II. RELATED WORKS
GNN has shown promising performance in handling non-
A. Regular Spatial Learning of Multichannel EEG Emotion
euclidean data, such as social networks [52], and bioprotein [53].
Recognition
In EEG emotion recognition, the GCN can reveal the intrinsic
The EEG-based method is an effective way for capturing topological relationship among EEG channels.
intrinsic emotional information from the brain. Feature ex- Constructing the adjacency matrix is a crucial part of the GCN.
traction and classification are two components of EEG emo- Song et al. [26] employed Gaussian kernel function to define
tion recognition. The extracted features from the EEG signal the adjacency matrix of GCN. Pearson correlation coefficient
can be divided into three types: time domain features (e.g., (PPC) [28], and euclidean distance [29], are utilized to measure
first difference and zero-crossing rate) [49], frequency domain the similarity between EEG channels to build an adjacency
features (e.g., PSD) [13], and time-frequency domain features matrix. Asadzaeh et al. [30] employed the result of standard-
(e.g., DE, wavelet transform (WT), short-time Fourier transform ized low-resolution electromagnetic tomography (sLORETA)
(STFT)) [50]. In fact, most studies [12], [13] commonly divided to construct spatial relationships between EEG channels. But
raw EEG data into five frequency bands (δ, θ, α, β, and γ) and these studies overlook the relationship among functional brain
then extracted relevant features from each band. Time-frequency regions of the brain. According to the finding of neuroimag-
domain features like DE and PSD features are widely employed ing brain activities that contain both global and local con-
in classification tasks due to their remarkable capability in nectivity [54]. Song et al. [31] and Ye et al. [22] construct
representing emotional EEG characteristics. the global and regional adjacency matrices to represent the
Recently, deep learning methods, which extract and classify global and functional connectivity of the brain, which achieved
features through neural networks, are popular in EEG emotion better results. Considering the benefits of adaptive learning,
recognition. Due to the remarkable achievements of RNN and early study [26] has used dynamic updating of the adjacency
CNN in natural language processing and computer vision, some matrix for representation learning, and subsequent studies [22],
studies have attempted to investigate the applicability of these [34] have explored it further. Compared to dynamic adjacency
methods for representing spatial structure of multichannel EEG matrices, static adjacency matrices have higher stability and
data. Zhang et al. [51] predefined two electrode sequences from lower computational complexity, many studies [28], [29], [30]
two placing directions and input them into a two-layer RNN have noticed the advantage of static adjacency matrices and
to extract the spatial structure of multichannel EEG. Then, employed them to define the spatial relationship of multichannel
Li et al. [15], [16] considered functional hemispheres of the EEG for emotion recognition. However, the above studies all
brain and predefined two electrode sequences within each hemi- employed one- or two-layer GCN models for spatial learning
sphere as inputs for the RNN to capture functional asymmetry. of multichannel EEG, which will cause a lack of representation
Li et al. [23] further designed a hierarchical model to learn the re- ability.
gion and global spatial structure of the brain. In this method, the Some studies attempt to combine GCN with other networks
electrode sequences in each functional region are predefined and to improve model performance. Lin et al. [35] employed 1D
fed into RNN to learn local spatial relationships. These locally convolution to capture the internal features of each EEG channel,
learned features are then sequenced into another RNN to obtain the convolved EEG channels are then used to build GCN for
the global spatial structure. However, using RNN-based methods learning inter-channel EEG features. Zhang et al. [34] stacked
to learn the spatial feature requires sequencing the multiple EEG a single-layer GCN and multiple 1D convolution layers to learn
channels, which damages the original spatial structure. topological and abstract features, and combined with BLS for

Authorized licensed use limited to: JADAVPUR UNIVERSITY. Downloaded on January 11,2025 at 09:34:30 UTC from IEEE Xplore. Restrictions apply.
YAN et al.: BRIDGE GRAPH ATTENTION BASED GRAPH CONVOLUTION NETWORK WITH MULTI-SCALE TRANSFORMER 2045

Fig. 1. The framework of BGAGCN-MT for multichannel EEG emotion recognition. BGAGCN-MT consists of two main parts. The first part is the BGAGCN,
which aims to deepen the graph network while overcoming node over-smoothing. The second part is the MT, which is designed to learn non-local spatial features
and simulate brain activity at different scales with different spatial characteristics to improve EEG emotion recognition.

strengthening the feature representations. Du et al. [55] em- of multichannel EEG and improve the accuracy of emotion
ployed four different depths of 1D CNN for extracting multi- recognition.
dimensional features of EEG channel, features under all di- However, the above studies constructs spatial dependence
mensions were concatenated and then fed into a single-layer only at a single scale of EEG, which overlooks the impact of
GCN for learning spatial relationships of multichannel EEG temporal scales on spatial characteristics.
data. Although the studies mentioned above achieved decent
performance by employing shallow GCNs or combining them
with other networks, the issue of over-smoothing in deep GCN III. THE PROPOSED MODEL
has been overlooked. Inspired by the attention mechanism, we In this section, we will introduce the proposed model, includ-
consider the important information of each layer in GCN and ing the construction of multigraph, BGAGCN, and MT. The
reinforce the importance of emotion-related nodes overcome the overall framework of our model can be seen in Fig. 1.
over-smoothing problem.

C. Transformer A. Preliminaries and Preprocessing

Basically, a transformer is composed of multiple encoders and In order to record EEG signal for emotion recognition, emo-
decoders. Both encoder and decoder have the same structure tion induction experiments composed of multiple trials are con-
which includes multi-head self-attention layer, add and nor- ducted for each subject. The recorded EEG signal from each trial
malize layer, and feed-forward layer. Transformer is originally is decomposed into different frequency bands. For each band, the
designed to solve the sequential-related problem of RNN in entire EEG signal is divided into one-second segments, and then
natural language processing (NLP). Vaswani et al. [43] first pro- feature extraction methods, such as DE and PSD, are applied to
posed the transformer model based on the multi-head attention each segment.
mechanism, which effectively addressed the non-parallelization The extracted features from a set of EEG signals in one
problem in the NLP model. Benefiting from transformer’s ability trial are of the size C × F × T , where C is the number of
for building dense dependencies between tokens, transformer electrodes (channels), F denotes the number of frequency bands,
has also achieved great success in image classification [56] and and T is the number of segments. To obtain sufficient samples
object detection [57]. for training, the feature in each trial is split into non-overlap
In EEG emotion recognition, some studies also use trans- segments with the temporal length of D. The final segment is
formers in different domains for classification tasks. Gong discarded if its length is less than D. We denote each segment
et al. [45] employed a transformer after spatial and spectral is X ∈ RC×F ×D . When only a single frequency band of feature
learning to construct temporal global dependence of EEG. is utilized, the size of input feature is C × 1 × D and can
Wang et al. [47] proposed a hierarchical transformer to cap- be further reshaped into the size of C × D by squeezing the
ture both channel-level and region-level spatial dependen- frequency dimension. When all the frequency bands are used,
cies from multichannel EEG. Ma et al. [44] employed three we flatten the last dimensions of an input feature and obtain
transformers to perceive global dependencies of multichan- X ∈ RC×FD . Subsequently, X is right multiplied by a learnable
nel EEG from multi-domains including spatial, spectral, and transformation matrix to fuse the emotional information among
frequency domains. Sun et al. [46] applied a transformer different frequency bands:
after dual branch graph-based learning, which demonstrates
that transformer can enhance the graph-based spatial feature  = XP + B,
X (1)

Authorized licensed use limited to: JADAVPUR UNIVERSITY. Downloaded on January 11,2025 at 09:34:30 UTC from IEEE Xplore. Restrictions apply.
2046 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 15, NO. 4, OCTOBER-DECEMBER 2024

Fig. 2. The region partition for 62-channel EEG and 14-channel EEG. The
EEG channels with the same color represent the same region.

 ∈ RC×D is denoted as an input sample of our model, P ∈


X
RF D×D is a a learnable transformation matrix and B ∈ RC×D
represents the bias matrix.

B. Construction of Mutligraph
According to neuroscience [54], the activity of the brain
can be regarded as two types of circuits: global dynamic and Fig. 3. The framework of the BGAGCN. GCL is the abbreviation of the graph
convolution layer.
regional functional connectivity. Inspired by this, we propose a
multigraph method for global and local parts to model the spatial
relationships of EEG channels.
For each sample, we construct a global graph GG = (V, EG ), and jth column of AR is defined as follows:
 ij
where V = {v 1 , v 2 , . . . , v C } represents the set of nodes (EEG aG , v i and v j are in the same region,
 and EG denotes the set aij = (5)
channels), i.e., v i is the ith row of X, R 0, otherwise.
of edges {(v i , v j )|i, j ∈ (1, C)}. Suppose AG ∈ RC×C is the
adjacency matrix for GG . The Gaussian kernel is commonly C. Bridge Graph Attention Based GCN
used for defining adjacency matrix but may result in noisy
To deepen the GCN and solve the node over-smoothing prob-
connections. It is typically believed that the connection between
lem at the same time, we design BGAGCN. As shown in Fig. 3,
two variables is stronger when they are accompanied by a high
BGAGCN consists of two parts: deep graph convolution path
correlation and a small spatial distance. Therefore, we modify
(DGCP) and bridge graph attention (BGAT) block.
the Gaussian kernel method by incorporating the use of PPC
Given the adjacency matrix A of a graph G, The normalized
and Manhattan Distance (MD) to filter out noisy connections. 1 1
Laplacian matrix can be expressed as L = In − D− 2 AD− 2 ,
PPC provides correlation information of two variables, and MD
In ∈ RC×C is an identity matrix, and D ∈ RC×C is degree
measures the distance between two vectors. The element in the
matrix. Since the operation for the global graph and region func-
ith row and jth column of AG is defined as follows:
tional graph are the same, we abuse L to denote the normalized
  
exp − ||v i − v j ||22 /2σ 2 , ρi,j ≥ ρτ , mi,j ≤ mτ , Laplacian matrix of either of them in the following.
ij
aG = The deep graph convolution path contains L1 graph convo-
0, otherwise, lution layers. K order Chebyshev polynomials [58] are em-
(2)
ployed in each convolution layer. Given the input of the lth
where σ is kernel size, and ρτ and mdτ are correlation and
Chebyshev graph convolution layer, the corresponding output
distance thresholds, respectively. The mathematical definitions
Ĥl+1 ∈ RC×D is given by:
of PCC and MD are given as follows: K−1

cov(v i , v j ) Ĥl+1 = σ1  Ĥl Wl ,
Tk (L) (6)
ρi,j = , (3) k
σv i σv j k=0

mi,j = v i − v j 1 , (4) where Wkl denotes a learnable transformation matrix, σ1 (·)


 is the Cheby-
denotes the ReLU activation function, and Tk (L)
where cov(v i , v j ) is covariance between v i and v j , and σvi and shev polynomial of order k evaluated at the scaled Laplacian
σvj are the standard deviations of v i and v j , respectively.  Here Tk (x) is computed by the stable recurrence relation
L.
We also define the region functional graph GR = (V, ER ) for Tk (x) = 2xTk−1 (x) − Tk−2 (x), where T0 = 1 and T1 = x, and
each sample, V is the set of nodes and ER is the set of edges. The  = 2L/λmax − In , in which λmax denotes the largest eigenvalue
L
detailed information can be seen in Fig. 2. For SEED, we divide of Laplace matrix L. For simplification, we use F(·, ·) to denote
62-channel system into 10 regions. Liksewise, for 14-channel
graph convolution operation and Ĥl+1 can be represented by:
system in Dreamer, we set its region label the same as the 62-
channel system and obtain 8 regions. The element in the ith row Ĥl+1 = F(Ĥl , Wkl ). (7)

Authorized licensed use limited to: JADAVPUR UNIVERSITY. Downloaded on January 11,2025 at 09:34:30 UTC from IEEE Xplore. Restrictions apply.
YAN et al.: BRIDGE GRAPH ATTENTION BASED GRAPH CONVOLUTION NETWORK WITH MULTI-SCALE TRANSFORMER 2047

Then, we adopt graph residual learning to enable training of


deep graph network:
Hl+1 = F(Ĥl , Wkl ) + Ĥl . (8)
Although the skip connection can deepen the depth of the net-
work, the over-smoothing problem remains unresolved. Since
over-smoothing makes node representations become nearly in-
distinguishable as the number of layers increases, motivated
by [41], we enhance the feature node-wisely with attention
mechanism by bridging rich information from previous layers.
GAT [59] is first employed in the BGAT block to obtain the
hidden attention features. The output of each graph convolution Fig. 4. The 1D convolution on each EEG channel and attention mechanism
between EEG channels in MT.
layer and graph residual learning is taken as the input of GAT.
For Ĥl+1 , the ith row corresponds to the ith node, denoted its
transposition by v̂ l+1
i . In the following, we ignore the superscript
of v̂ l+1
i for simplification. In GAT, a shared linear transformation obtain:

weight matrix W ∈ RD ×D is applied to every node to embed Ẑ = HL1 +1  BA , (13)
the feature into the dimension of D , and then a single-layer feed
forward network parameterized by a weight vector a ∈ R2D is

where  denotes Hadamard product.
utilized for the shared attention mechanism. The coefficients To distinguish the weighted graph convolution features of
computed by the attention mechanism can be expressed as: global and region functional graphs, we denote them with ẐG
and ẐR , respectively. The final output of BGAGCN is the sum
exp σ2 a [Wv̂ i Wv̂ j ] of ẐG and ẐR :
ξij = 
, (9)
k∈Ni exp σ2 a [Wv̂ i Wv̂ k ] Z = ẐG + ẐR . (14)
where represents concatenation operation, σ2 (·) is
LeakyReLU activate function with negative input slope D. Multi-Scale Transformer
0.2, and Ni is the first order neighborhood of node i in the The purpose of the MT is to establish connections among non-
graph. To stabilize the learning process, multi-head attention is local EEG channels out of graph while taking into account the
applied. Suppose the number of attention heads is H. The final spatial characteristics at different scales of multichannel EEG.
output of GAT for v̂ i is MT is composed of positional encoding (PE), L2 encoders,
⎛ ⎞ each encoder includes multi-scale convolution, MHA layers, an
H

v̂ i = σ2 ⎝ h
ξij Wh v̂ j ⎠ , (10) add and normalize layer, and a feed-forward layer. PE is applied
h=1 j∈Ni only before the first encoder. Multi-scale convolution path is
composed of three 1D convolution paths with different kernel
h
where ξij is the attention coefficient calculated in the hth self- sizes. 1D convolution can capture comprehensive information
attention head, and Wh represents the corresponding linear around the current point in time and does not introduce heavy
transformation matrix. The output feature matrix of the lth GAT calculations like RNN, and different convolution kernels can
layer Hl+1
A ∈R
C×HD 
is composed of the feature vectors of enrich the information on features at different scales Therefore,
all the nodes, i.e., Hl+1
A = [v̂ 1l+1 , v̂ 2l+1 , . . . , v̂ C
l+1 T
] . Then, the we employ multiple 1D convolution paths for extracting multi-
feature of each channel is squeezed by linear transformation scale features for each EEG channel.

l
parameterized by a weight matrix WA ∈ RHD ×Da for the lth Fig. 4 depicts the convolution for the short-scale path and
layer, and the squeezed features are obtained by attention mechanisms acting between the EEG channels. We
use ks ∈ R1×3 , km ∈ R1×5 and kl ∈ R1×7 to denote the kernel
SlA = Hl+1 l
A WA . (11) sizes for short, medium, and long kernel sizes are denoted
Batch normalization (BN) is applied to each channel of the as ConvS, ConvM, ConvL, respectively, as shown in Fig. 1.
features for stable training. Then, we integrate the squeezed For simplification, we use kt to denote one of the kernels
features of all the L1 layers by summation. ReLU and linear (ks , km , kl ). The output from a certain convolution path can
transformation are applied to the summed features, and the be expressed as:
integrated features are treated as the attention weights of BGAT:  t = BN(Conv(Z, kt )),
Z (15)
L
 1
where Conv denotes applying 1D convolution on each row of
BA = σ 1 BN(SiA ) WA , (12)
the matrix and BN denotes batch normalization. We apply zero
i=1
padding operation to maintain the output of different convolution
where WA ∈ R Da ×D
is the weight matrix to recover the feature paths with the same shape. We treat each row of Z  t as a token
dimension to D. The attention weights BA are employed to in the transformer and the rest part of the MT is in line with the
weight the output of the deep graph convolution path and we regular transformer [43]. Three weight matrices WQt , WKt ,

Authorized licensed use limited to: JADAVPUR UNIVERSITY. Downloaded on January 11,2025 at 09:34:30 UTC from IEEE Xplore. Restrictions apply.
2048 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 15, NO. 4, OCTOBER-DECEMBER 2024

and WV t of the size D × Dm are defined in the self-attention IV. EXPERIMENTS


mechanism to obtain the matrices of query Qt , key Kt , and
A. Dataset
value Vt , respectively. We have:
 t )WQt , In this section, we conduct extensive experiments on the
Qt = LN(Z (16) SEED [12], SEED-IV [60], and DREAMER [61] datasets to
 t )WKt ,
Kt = LN(Z (17) evaluate the performance of the proposed model.
SEED: SEED dataset is developed by the BCMI laboratory
 t )WV t ,
Vt = LN(Z (18) at SJTU [12]. The dataset includes 62-channel EEG signals
collected from 15 subjects (7 males and 8 females). Each subject
where LN(·) is layer normalization operation.
participates in three different periods corresponding to three
Each attention head can be expressed as:
 sessions. In each session, 15 movie clips are presented to induce

H t = Attention(Qt , Kt , Vt ) = softmax Q√ t Kt
Vt .
three types of emotional states (positive, negative, and neutral),
D i.e., five movie clips for each emotional state. Each movie clip
(19) corresponds to one EEG signal recording trial. SEED extracted
For MHA, we denote the attention result of the ith head as and released the five types of EEG features from 5 frequency
 h . To ensure uniformity in dimension across these paths, we
H bands (δ, θ, α, β, and γ). In this paper, we use the most commonly
t
maintain an identical number HA of heads for the multi-head used EEG emotion recognition features DE and PSD.
attention in each path. Each output of a single attention head is SEED-IV : Similar to the SEED dataset, SEED-IV [60]
concatenated along the feature dimension, and then the output of contains 15 subjects who participated in three sessions with
MHA is obtained by multiplying the concatenated matrix with 24 trials each. For each session, there are six movie clips for
a transformation matrix Wt ∈ RHA Dm ×Dc : each emotion and our emotional states are recorded including
⎛ ⎞ neutral, happy, sad, and fear. We also use the released DE and
HA
Zt = ⎝  h ⎠ Wt ,
H (20)
PSD features in our experiments.
t
DREAM ER: DREAMER, developed by scholars at
h=1
UWS [61], is a multimodal dataset that contains emotional EEG
By summing up the attention results of all the scales, we obtain and ECG data. The EEG signals are collected from 14 channels
the integrated multi-scale MHA result: with a sample rate of 128 Hz. We only use the EEG data in our
Z = Zs + Zm + Zl . (21) work. There are 23 participants (14 males and 9 females). For
each subject, 18 movie clips are used to induce emotional states
where Zs , Zm and Zl are the same size C × Dc . Then Z is fed such that there is a total of 414 trials. Unlike SEED, DREAMER
into add and normalized layer, and feed-forward layer to obtain assesses emotions through three dimensions: valence, arousal,
output. After L2 encoder learning, we obtain the final output of and dominance, each labeled with five levels of self-assessment.
MT, denoted by Z . Following prior works [31], [34], we split the levels into two
classes: the first two ratings for low arousal, valence, or domi-
E. Classification nance, and the last three for high arousal, valence, or dominance.
The learning criteria in our model is to provide the final DREAMER releases the EEG of subjects in the baseline state
emotion recognition results from the high-order representations and in the stimulated state. Following [31], [34], we only used
of multichannel EEG. Initially, we reshape the extracted features the last 60 s EEG signal of each trial for feature extraction.
into one-dimensional vectors and subsequently input them into
a multi-layer perceptron (MLP). The MLP consists of three B. Experimental Protocol
fully connected layers, with each layer being followed by a BN
layer and ReLU activation function, except for the last layer. Two types of experiments are conducted in this paper: subject-
Afterward, the Softmax function is applied to obtain the classi- dependent and subject-independent experiments.
fication probabilities, with the highest probability indicating the In subject-dependent experiments, the data from different
final classification of the input samples. We trained our model trials of the same subjects are used for training and testing.
by minimizing the cross-entropy loss between the classification Following [34], for each subject of the first session in SEED, the
results and the labels. The cross-entropy function is expressed first nine trials are used as the training data, and the remaining
as: six trials are used as the testing data. For SEED-IV, we also
N M use the first 2/3 number of trials of each subject in the first
1  session for training and leave the remaining trials for testing.
LCE = (−yn,m log(pn,m )−(1−yn,m log(pn,m ))),
N n=1 m=1 For the SEED and SEED-IV datasets, the final accuracy is
(22) derived by first obtaining the accuracy for each subject from
where pn,m represents the probability that the nth sample be- the last two sessions and then averaging these individual subject
longs to class m and yn,m ∈ {0, 1}, N is the total number of accuracies across all subjects to yield the final accuracy. For
input samples, and M represents the total number of classes. DREAMER, we follow previous works [26], [31], [34] and
After training, the classification score of an EEG sequence is use leave-one-trial-out cross-validation to evaluate the model
the average classification score of all its segments. performance.

Authorized licensed use limited to: JADAVPUR UNIVERSITY. Downloaded on January 11,2025 at 09:34:30 UTC from IEEE Xplore. Restrictions apply.
YAN et al.: BRIDGE GRAPH ATTENTION BASED GRAPH CONVOLUTION NETWORK WITH MULTI-SCALE TRANSFORMER 2049

TABLE I
COMPARISONS OF THE RESULTS (ACC(%) / STD(%)) OF DIFFERNET METHODS FOR SUBJECT-DEPENDENT EXPERIMENTS ON SEED DATASET

For subject-independent experiments in three datasets, the can be seen from the result that all methods achieve the highest
training and testing data are both obtained from different sub- accuracy when taking all bands as input. It is worth noting
jects. that BGAGCN-MT achieves the new SOTA result (96.82%)
We apply leave-one-subject-out cross-validation as described on SEED when all frequency bands are taken as input and are
in [16], [67] to evaluate the performance of the proposed model. significantly outperforms the second-best method V-IAG.
Specifically, the data from one subject are used for testing and 2) Subject-Independent Experiments: To demonstrate the
the data from the remaining subjects are used for training. This generalization ability of BGAGCN-MT, we conduct subject-
process is repeated for each subject, and the results are averaged independent experiments on SEED. The results are reported
over all subjects. in Table II. All the compared methods in subject-dependent
experiments are used for subject-independent experiments if
C. Implementation Details the results have been reported in the corresponding papers.
We add four transfer learning methods (subspace alignment
The number of graph convolution layers L1 is set to 9, and the (SA) [64], transfer component analysis (TCA) [63], prototyp-
Chebyshev kernel size K is set to 3. The number of self-attention ical representation-based pairwise learning (PR-PL) [65], and
heads in BGAT is set to 3. The number of attention heads HA unsupervised dynamic domain adaptation (UDDA) [66]) for
and layers L2 in MT are set to 3 and 6. Adam is utilized for performance comparison.
optimization. The batch size is set to 32, the learning rate is In Table II, BGAGCN-MT achieves the highest accuracy of
0.001 and the training epochs are 150. Following [26], [34], [46], 89.66% with the standard deviation of 4.72% when using all fre-
we use mean accuracy (ACC) and standard deviation (STD) to quency band features as input. In the δ band, our model performs
evaluate the performance. Our model is trained with PyTorch on slightly worse than RGNN, but in the higher frequency bands (θ,
a Geforce RTX 2080Ti GPU. α, β, and γ bands), it performs much better than the second best
as well as the other compared methods. We can also observe
D. Experiment Results on SEED that all the methods perform better in the higher frequency
1) Subject-Dependent Experiment: We compare BGAGCN- bands. This indicates that higher frequency bands contain more
MT with two baseline methods (SVM and DBN) [13], two subject-invariant information, which is also consistent with the
deep learning methods without graph structure (bi-hemisphere fact that the higher frequency bands are associated with more
domain adversarial neural network (BiDANN) [16], regional general cognitive processes [68]. BGAGCN-MT outperforms
to global spatial-temporal neural network (R2G-STNN) [23]), the other methods, even better than the latest transfer learning
and seven graph-based methods (graph convolution neural net- methods (PR-PL [65] and UDDA [66]), which suggests that it
work (GCNN) [26], dynamic GCNN (DGCNN) [26], regular- is able to learn more subject-invariant representations.
ized graph neural networks (RGNN) [21], GCBNet+BLS [34], 3) Performance Comparison With Different Features: To
variational instance-adaptive graph (V-IAG) [31], multi-Scale evaluate the impact of different features on our model, we
feature reconstruction graph convolutional network (MSFR- compare DE with another popular feature PSD. We present
GCN) [62], and graph-based multi-task self-supervised learning the results in Table III. We can see that, when replacing the
(GMSS) [27]). input feature from DE to PSD, all the methods degenerate.
The subject-dependent recognition results on SEED are However, BGAGCN-MT still outperforms the other methods.
shown in Table I. BGAGCN-MT outperforms the other methods This shows that the superiority of BGAGCN-MT is invariant to
when a single frequency band is used except for the θ band. input features.
However, it is the second best and the accuracy is significantly
higher than the other methods. The accuracy on α, β, and γ bands
E. Experiment Results on SEED-IV
are much better than that on δ and θ bands for all the methods.
It can be inferred that the high-frequency component contains We conduct both subject-dependent and subject-independent
more relevant emotional information, which is consistent with experiments on SEED-IV and report the results in Table IV.
previous studies [12], [16]. Nevertheless, the low-frequency We carefully choose the eleven models from two groups for
component still carries some useful emotional information. This further comparison, including two baseline methods (SVM [13],

Authorized licensed use limited to: JADAVPUR UNIVERSITY. Downloaded on January 11,2025 at 09:34:30 UTC from IEEE Xplore. Restrictions apply.
2050 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 15, NO. 4, OCTOBER-DECEMBER 2024

TABLE II
COMPARISONS OF THE RESULTS (ACC(%) / STD(%)) OF DIFFERNET METHODS FOR SUBJECT-INDEPENDENT EXPERIMENTS ON SEED DATASET

TABLE III
THE RESULTS (ACC(%) / STD(%)) OF SUBJECT-DEPENDENT EXPERIMENT
WITH DIFFERENT TYPES OF INPUT FEATURES ON SEED DATASET

Fig. 5. Confusion matrices on SEED dataset. (a) and (b) are the results of
subject-dependent and subject-independent experiments, respectively.
TABLE IV
COMPARISONS OF THE RESULTS (ACC(%) / STD(%)) OF DIFFERNET METHODS
FOR SUBJECT-DEPENDENT AND SUBJECT-INDEPENDENT EXPERIMENT ON
SEED-IV

Fig. 6. Confusion matrices on SEED-IV dataset. (a) and (b) are the results of
subject-dependent and subject-independent experiments, respectively.

emotion, indicating that it is much easier to recognize than


SA [64]) and nine deep learning methods (attention-LSTM the neutral and sad emotions for each subject. The results of
(A-LSTM) [69], DGCNN [26], BiDANN [16], BiHDM [15], subject-independent experiments in Fig. 5(b) are similar to those
RGNN [21], MWACN [71], PR-PL [65], HVF2N-DBR [70] and of subject-dependent experiments, which implies that happy
UDDA [66]). BGAGCN-MT significantly surpasses the other emotion contains more subject-invariant information. Our model
methods, reaching an accuracy of 82.86% for subject-dependent is comparable to or outperforms the SOTA methods MSFR-
and 75.78% for subject-independent experiments. Note that GCN [62], GMSS [27], V-IAG [31] on most emotions, and the
compared to SEED, SEED-IV is a much larger dataset which performance gains for sad emotion are significant.
with more emotional categories and samples. By leveraging the For the SEED-IV dataset, the accuracy of subject-dependent
advantages of deep graph learning and MT, our model also works experiments on the SEED-IV dataset is 84.48% for neutral
well on more complex datasets. We also perform experiments emotion, while the lowest accuracy is 80.76% for happy emotion
on our model with different frequency bands and features. The (Fig. 6(a)). This observation is different from the results on
results, shown in Table I of the Appendix, are consistent with SEED, suggesting that the presence of more types of emotions
previous results. potentially introduces more interference for each subject. In
addition, happy emotion is more likely to be confused with fear,
which may indicate that each subject produces similar emotional
F. Performance Comparison on Different Emotions manners in happy and fear emotions. In Fig. 6, for subject-
For deeper understanding of the classification performance of independent experiments on SEED-IV, happy and neutral emo-
BGAGCN-MT, the confusion matrices of our model on SEED tions are much easier to distinguish than sad emotions, which
and SEED-IV are shown in Figs. 5 and 6, respectively. is consistent with Fig. 5(b). In subject-independent experiments
For the SEED dataset, the accuracy of subject-dependent on SEED and SEED-IV, happy and neutral emotions exhibit a
experiments for all three types of emotional states is above smaller decline in accuracy compared to other emotions. This
95%. Specifically, the accuracy is up to 97.97% for the happy might indicate that happy and neutral emotion contains more

Authorized licensed use limited to: JADAVPUR UNIVERSITY. Downloaded on January 11,2025 at 09:34:30 UTC from IEEE Xplore. Restrictions apply.
YAN et al.: BRIDGE GRAPH ATTENTION BASED GRAPH CONVOLUTION NETWORK WITH MULTI-SCALE TRANSFORMER 2051

TABLE V TABLE VI
COMPARISONS OF THE RESULTS (ACC(%) / STD(%)) OF DIFFERNET METHODS COMPARISONS OF THE RESULTS OF AVERAGE EUDCLIDEAN DISTANCE (AED)
IN SUBJECT-DEPENDENT EXPERIMENT WITH PSD FEATURES ON DREAMER FOR ALL SAMPLES IN SEED FROM EACH NODE TO THE CENTER AFTER T-SNE
FOR DIFFERENT MODELS

TABLE VII
generalized emotional patterns. Compared to SOTA methods THE RESULTS (ACC(%) / STD(%)) OF ABLATION STUDIES IN
GMSS [27], MSFR-GCN [62], our model achieves comparable SUBJECT-DEPENDENT AND SUBJECT-INDEPENDENT EXPERIMENTS ON SEED
AND SEED-IV
or best performance on most emotions, and the superiority on
happy emotion is significant.

G. Experiment Results on DREAMER


The results of the three binary classification tasks on
DREAMER with PSD are given in Table V. Two baseline
methods (SVM [13], group sparse canonical correlation analysis
(GSCCA) [72]), and four graph-based methods (graph regu-
larized sparse linear discriminant analysis (GraphSLDA) [73], TABLE VIII
DGCNN [26], GCBNet+BLS [34], and V-IAG [31]) are se- THE RESULTS (ACC(%) / STD(%)) OF ABLATION STUDIES IN
lected as the competitors. Note that valance, arousal, and domi- SUBJECT-DEPENDENT AND SUBJECT-INDEPENDENT EXPERIMENTS WITH
GLOBAL OR REGION GRAPHS
nance describe affective states differently from discrete emotion
classification, but BGAGCN-MT still outperforms the other
comparison methods for all the classification tasks. Our model
has the best accuracy of 93.92%, 94.60%, and 94.75% in the
dimensions of valence, arousal, and dominance, respectively.
This improvement is by 1.1%, 1.51%, and 5.55%, respectively,
compared to the second-best methods. This demonstrates that
BGAGCN-MT is suitable for learning the intrinsic affective state
of EEG. We further conduct experiments on different frequency
bands and features and the results are also consistent with the SEED and SEED-IV. The results are listed in Table VII, in which
previous sections (refer to Table II in the Appendix). BGAGCN+Rtrans denotes BGAGCN with regular transformer,
i.e., full model without multi-scale convolution. Compared to
DGCP, w/ BGAGCN achieves significant improvement. This
H. Ablation Studies
shows that the proposed BGAT block can effectively enhance
1) The Effect of Mitigating Over-Smoothing: To gain features’ discriminative ability and improve the performance of
a clearer understanding of BGAT’s effect, we select the model. The results of w/ BGAGCN are lower than that of w/
five models from DropNodes (GraphSAGE [36] and BGAGCN+Rtrans. This means regular transformer has the abil-
SimSparseGCN [74]), DropEdges (ResGCN+Dropedge [37] ity to learn non-local relationships after graph-based learning for
and ResGCN+CruvDrop [75]), and skip connections (Res- performance improvement. Comparing w/ BGAGCN+Rtrans
GCN) [25] as the competitors for mitigating over-smoothing. and the full model, removing the multi-scale convolution lead
We remove MT from our model and conduct a comparison to performance degradation, which indicates the distinct spatial
by substituting BGAGCN with the aforementioned models. characteristics at different scales of multichannel EEG can be
For BGAGCN, the average euclidean distances of features Z effectively captured by the proposed MT.
for all samples in SEED from each node to the center after To study the effect of graph construction method, we also
t-distributed stochastic neighbor embedding (t-SNE) [76] are conduct ablation experiments by separately taking global and
calculated, and the results are shown in Table VI. The average regional graphs as input, denoted by BGAGCN+MT-R and
euclidean distance and accuracy reaches 0.51 and 94.15% for BGAGCN+MT-G, respectively. The results are shown in Ta-
BGAGCN, respectively, surpassing the current SOTA methods ble VIII.
for mitigating over-smoothing. The features Z are visualized Both BGAGCN+MT-R and BGAGCN+MT-G exhibit declin-
with t-SNE in Fig. 1 of the Appendix. ing accuracy compared to the full model, implying that com-
2) Module Analysis in BGAGCN-MT: To analyze the con- prehensively considering both global and regional functional
tribution of different modules in the proposed model compre- connectivity of the brain can enhance the performance of EEG
hensively, we perform ablation studies on our model using the emotion recognition.

Authorized licensed use limited to: JADAVPUR UNIVERSITY. Downloaded on January 11,2025 at 09:34:30 UTC from IEEE Xplore. Restrictions apply.
2052 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 15, NO. 4, OCTOBER-DECEMBER 2024

Fig. 7. Three types of region partition methods of 62 EEG channels. containing


7 regions, 10 regions, and 17 regions.

TABLE IX Fig. 8. Visualization of topographic maps of all EEG channels before and after
THE RESULTS OF DIFFERENT PARTITION METHODS IN SUBJECT-DEPENDENT convolution.
AND SUBJECT-INDEPENDENT EXPERIMENTS ON SEED

characteristics, which are consistent with existing neuroscience


discovery [48].

V. CONCLUSION
In this article, we propose a novel multichannel EEG emotion
3) Performance Comparison With Different Subgraphs: The
recognition model based on graph-learning methods and the
human brain consists of various functional regions that be-
transformer. We design the bridge graph attention based graph
come active during different emotional processes [14], different
convolution network to improve the importance of the nodes
partition methods of functional regions will lead to different
in the graph, addressing the over-smoothing problem caused
subgraphs. Following [31], we divide 62 EEG channels into 7
by increasing the depth of the graph model. Moreover, the
regions, 10 regions, and 17 regions (illustrated in Fig. 7) to study
multi-scale transformer is designed to learn non-local spatial
the impact of different partitioning methods on the performance
features among EEG channels and utilze spatial characteristics
of our model. The results are shown in Table IX. The 10 regions
at EEG different scales to improve emotion recognition ability.
partition method consistently achieves the highest accuracy
Comprehensive experiments are conducted on different types
when using different features as input. The results imply that
of datasets. The results indicate that our model successfully
more partitions may introduce more interference, leading to per-
overcomes the over-smoothing problem and effectively extracts
formance degradation. Fewer brain regions also show accuracy
distinct spatial characteristics at different scales of EEG. Our
reduction. This may be due to inadequate functional activity
model provides an effective approach for information mining in
connections.
multichannel EEG emotion recognition.
I. Analysis of Multi-Scale Convolution in Transformer
REFERENCES
The introduction of multi-scale convolution in MT is inspired
by neuroscience priors. To elucidate its relationship with exist- [1] Y. Wang et al., “A systematic review on affective computing: Emotion
models, databases, and recent advances,” Inf. Fusion, vol. 83/84, pp. 19–52,
ing neurophysiological discoveries, we analyze feature maps 2022.
before and after convolution. Multi-scale entropy (MSE), a [2] Y. Li, G. Lu, J. Li, Z. Zhang, and D. Zhang, “Facial expression recognition
well-established tool for exploring EEG patterns in various in the wild using multi-level features and attention mechanisms,” IEEE
Trans. Affect. Comput., vol. 14, no. 1, pp. 451–462, Jan.-Mar. 2023.
scenarios [48], is employed in this analysis. MSEs for correctly [3] Y. Li, Z. Zhang, B. Chen, G. Lu, and D. Zhang, “Deep margin-sensitive
categorized samples in the last encoder at three different scales representation learning for cross-domain facial expression recognition,”
both before and after convolution are calculated. The scales IEEE Trans. Multimedia, vol. 25, pp. 1359–1373, 2023.
[4] Y. Li, J. Huang, S. Lu, Z. Zhang, and G. Lu, “Cross-domain facial
of MSE align with the sizes of the convolution kernels. The expression recognition via contrastive warm up and complexity-aware
calculated MSE values are averaged for each EEG channel, and self-training,” IEEE Trans. Image. Process., vol. 32, pp. 5438–5450,
the results are visualized by the topographic map, as shown in 2023.
[5] A. Kleinsmith and N. Bianchi-Berthouze, “Affective body expression
Fig. 8. The after convolution results reveal enhanced functional perception and recognition: A survey,” IEEE Trans. Affect. Comput., vol. 4,
connectivity in specific brain regions. For short scales, connec- no. 1, pp. 15–33, Jan.-Mar. 2013.
tions in prefrontal regions are strengthened, with fewer changes [6] R. Harper and J. Southern, “A Bayesian deep learning framework for
end-to-end prediction of emotion from heartbeat,” IEEE Trans. Affect.
in the left parietal and right parietal lobes. For medium scales, the Comput., vol. 13, no. 2, pp. 985–991, Apr.-Jun. 2022.
prefrontal lobes exhibit more obvious strengthening, and internal [7] X. Zhang et al., “Fusing of electroencephalogram and eye movement with
connections among left parietal regions show increased strength group sparse canonical correlation analysis for anxiety detection,” IEEE
Trans. Affect. Comput., vol. 13, no. 2, pp. 958–971, Apr.-Jun. 2022.
and extent of connectivity. Large scales highlight enhanced [8] J. Shukla, M. Barreda-Angeles, J. Oliver, G. C. Nandi, and D. Puig,
strength and extent of internal connections in the prefrontal “Feature extraction and selection for emotion recognition from electro-
lobes, along with improved internal connectivity among the dermal activity,” IEEE Trans. Affect. Comput., vol. 12, no. 4, pp. 857–869,
Oct.-Dec. 2021.
both right and left parietal lobes regions. These findings indicate [9] X. Li et al., “EEG based emotion recognition: A tutorial and review,” ACM
that brain activity at different timescales shows distinct spatial Comput. Surv., vol. 55, no. 4, pp. 1–57, 2022.

Authorized licensed use limited to: JADAVPUR UNIVERSITY. Downloaded on January 11,2025 at 09:34:30 UTC from IEEE Xplore. Restrictions apply.
YAN et al.: BRIDGE GRAPH ATTENTION BASED GRAPH CONVOLUTION NETWORK WITH MULTI-SCALE TRANSFORMER 2053

[10] R. Oostenvelda and P. Praamstrac, “The five percent electrode system [32] R. Jenke, A. Peer, and M. Buss, “Feature extraction and selection for
for high-resolution EEG and ERP measurements,” Clin. Neurophysiol., emotion recognition from EEG,” IEEE Trans. Affect. Comput., vol. 5, no. 3,
vol. 112, pp. 713–719, 2001. pp. 327–339, Jul.-Sep. 2014.
[11] B. Garcia-Martinez, A. Martinez-Rodrigo, R. Alcaraz, and A. Fernandez- [33] Q. Li, Z. Han, and X.-M. Wu, “Deeper insights into graph convolutional
Caballero, “A review on nonlinear methods using electroencephalographic networks for semi-supervised learning,” in Proc. 32nd AAAI Conf. Artif.
recordings for emotion recognition,” IEEE Trans. Affect. Comput., vol. 12, Intell., vol. 32, 2018, pp. 3538–3545.
no. 3, pp. 801–820, Jul.-Sep. 2021. [34] T. Zhang, X. Wang, X. Xu, and C. L. P. Chen, “GCB-Net: Graph convo-
[12] R. Duan, J. Zhu, and B. Lu, “Differential entropy feature for EEG-based lutional broad network and its application in emotion recognition,” IEEE
emotion classification,” in Proc. 6th Int. IEEE/EMBS Conf. Neural Eng., Trans. Affect. Comput., vol. 13, no. 1, pp. 379–388, Jan.-Mar. 2022.
2013, pp. 81–84. [35] X. Lin, J. Chen, W. Ma, W. Tang, and Y. Wang, “EEG emotion recognition
[13] W. Zheng and B. Lu, “Investigating critical frequency bands and using improved graph neural network with channel selection,” Comput.
channels for EEG-based emotion recognition with deep neural net- Methods Programs Biomed., vol. 231, 2023, Art. no. 107380.
works,” IEEE Trans. Auton. Ment. Develop., vol. 7, no. 3, pp. 162–175, [36] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation
Sep. 2015. learning on large graphs,” in Proc. 30th Conf. Neural Inf. Process. Syst.,
[14] S. M. Alarcao and M. J. Fonseca, “Emotions recognition using EEG sig- 2017, pp. 1025–1035.
nals: A survey,” IEEE Trans. Affect. Comput., vol. 10, no. 3, pp. 374–393, [37] Y. Rong, W. Huang, T. Xu, and J. Huang, “DropEdge: Towards deep graph
Jul.–Sep. 2019. convolutional networks on node classification,” in Proc. 6th Int. Conf.
[15] Y. Li et al., “A novel bi-hemispheric discrepancy model for EEG emotion Learn. Representations, 2020, pp. 1–11.
recognition,” IEEE Trans. Cogn. Develop. Syst., vol. 13, no. 2, pp. 354– [38] T. Liu, A. Jiang, J. Zhou, M. Li, and H. K. Kwan, “GraphSAGE-based
367, Jun. 2021. dynamic spatial–temporal graph convolutional network for traffic predic-
[16] Y. Li, W. Zheng, Y. Zong, Z. Cui, T. Zhang, and X. Zhou, “A tion,” IEEE Trans. Intell. Transp. Syst., vol. 24, no. 10, pp. 11210–11224,
Bi-hemisphere domain adversarial neural network model for EEG Oct. 2023.
emotion recognition,” IEEE Trans. Affect. Comput., vol. 12, no. 2, [39] Y. Liu, Y. Deng, J. Su, R. Wang, and C. Li, “Multiple input branches
pp. 494–504, Apr.-Jun. 2021. shift graph convolutional network with dropedge for skeleton-based action
[17] H. Cui, A. Liu, X. Zhang, X. Chen, K. Wang, and X. Chen, “EEG- recognition,” in Proc. 21st Int. Conf. Image Anal. Process., 2022, pp. 584–
based emotion recognition using an end-to-end regional-asymmetric 596.
convolutional neural network,” Knowl. Based Syst., vol. 205, 2020, [40] C. Duong, L. Zhang, and C.-T. Lu, “HateNet: A graph convolutional
Art. no. 106243. network approach to hate speech detection,” in Proc. Int. Conf. Big Data
[18] J. Li, Z. Zhang, and H. He, “Hierarchical convolutional neural networks Smart Comput., 2022, pp. 5698–5707.
for EEG-based emotion recognition,” Cogn. Comput., vol. 10, no. 2, [41] Y. Zhao, J. Chen, Z. Zhang, and R. Zhang, “BA-Net: Bridge attention for
pp. 368–380, 2017. deep convolutional neural networks,” in Proc. Eur. Conf. Comput. Vis.,
[19] F. Shen, G. Dai, G. Lin, J. Zhang, W. Kong, and H. Zeng, “EEG-based 2022, pp. 297–312.
emotion recognition using 4D convolutional recurrent neural network,” [42] J. D. Semedo et al., “Feedforward and feedback interactions between visual
Cogn. Neurodynamics, vol. 14, no. 6, pp. 815–828, 2020. cortical areas use different population activity patterns,” Nat. Commun.,
[20] Z. Yin, M. Zhao, Y. Wang, J. Yang, and J. Zhang, “Recognition of emotions vol. 13, no. 1, 2022, Art. no. 1099.
using multimodal physiological signals and an ensemble deep learning [43] A. Vaswani et al., “Attention is all you need,” in Proc. 31st Conf. Neural
model,” Comput. Methods Programs Biomed., vol. 140, pp. 93–110, pp. Inf. Process. Syst., 2017, pp. 6000–6010.
2017. [44] Y. Ma, Y. Song, and F. Gao, “A novel hybrid CNN-transformer model for
[21] P. Zhong, D. Wang, and C. Miao, “EEG-based emotion recognition using EEG motor imagery classification,” in Proc. Int. J. Conf. Neural Netw.,
regularized graph neural networks,” IEEE Trans. Affect. Comput., vol. 13, 2022, pp. 1–8.
no. 3, pp. 1290–1301, Jul.-Sep. 2022. [45] L. Gong, M. Li, T. Zhang, and W. Chen, “EEG emotion recognition
[22] M. Ye, C. L. P. Chen, and T. Zhang, “Hierarchical dynamic graph con- using attention-based convolutional transformer neural network,” Biomed.
volutional network with interpretability for EEG-based emotion recogni- Signal Process. Control, vol. 84, 2023, Art. no. 104835.
tion,” IEEE Trans. Neural Netw. Learn. Syst., early access, Dec. 9, 2022, [46] M. Sun, W. Cui, S. Yu, H. Han, B. Hu, and Y. Li, “A dual-branch dynamic
doi: 10.1109/TNNLS.2022.3225855. graph convolution based adaptive transformer feature fusion network for
[23] L. Yang, Z. Wenming, W. Lei, Z. Yuan, and C. Zhen, “From regional to EEG emotion recognition,” IEEE Trans. Affect. Comput., vol. 13, no. 4,
global brain: A novel hierarchical spatial-temporal neural network model pp. 2218–2228, Oct.-Dec. 2022.
for EEG emotion recognition,” IEEE Trans. Affect. Comput., vol. 13, no. 2, [47] Z. Wang, Y. Wang, C. Hu, Z. Yin, and Y. Song, “Transformers for EEG-
pp. 568–578, Apr.-Jun. 2022. based emotion recognition: A hierarchical spatial information learning
[24] R. J. Davidson, H. Abercrombie, J. B. Nitschke, and K. Putnam, “Regional model,” IEEE Sens. J., vol. 22, no. 5, pp. 4359–4368, Mar. 2022.
brain function, emotion and disorders of emotion,” Curr. Opin. Neurobiol., [48] S. Li and X.-J. Wang, “Hierarchical timescales in the neocortex: Mathe-
vol. 9, no. 2, pp. 228–234, 1999. matical mechanism and biological insights,” Proc. Nat. Acad. Sci. USA,
[25] G. Li, M. Muller, A. Thabet, and B. Ghanem, “DeepGCNs: Can GCNs vol. 119, no. 6, 2022, Art. no. e2110274119.
go as deep as CNNs?,” in Proc. IEEE 17th Int. Conf. Comput. Vis., 2019, [49] H. Cui, A. Liu, X. Zhang, X. Chen, J. Liu, and X. Chen, “EEG-based
pp. 9267–9276. subject-independent emotion recognition using gated recurrent unit and
[26] T. Song, W. Zheng, P. Song, and Z. Cui, “EEG emotion recognition minimum class confusion,” IEEE Trans. Affect. Comput., vol. 14, no. 4,
using dynamical graph convolutional neural networks,” IEEE Trans. Affect. pp. 2740–2750, Oct.-Dec. 2023.
Comput., vol. 11, no. 3, pp. 532–541, Jul.-Sep. 2020. [50] W. Zheng, J. Zhu, and B. Lu, “Identifying stable patterns over time for
[27] Y. Li et al., “GMSS: Graph-based multi-task self-supervised learning for emotion recognition from EEG,” IEEE Trans. Affect. Comput., vol. 10,
EEG emotion recognition,” IEEE Trans. Affect. Comput., vol. 14, no. 3, no. 3, pp. 417–429, Jul.-Sep. 2019.
pp. 2512–2525, Jul.-Sep. 2023, doi: 10.1109/TAFFC.2022.3170428. [51] T. Zhang, W. Zheng, Z. Cui, Y. Zong, and Y. Li, “Spatial-temporal recurrent
[28] M. Li, M. Qiu, W. Kong, L. Zhu, and Y. Ding, “Fusion graph representation neural network for emotion recognition,” IEEE Trans. Cybern., vol. 49,
of EEG for emotion recognition,” Sensors (Basel), vol. 23, no. 3, 2023, no. 3, pp. 839–847, Mar. 2019.
Art. no. 1404. [52] J. Yu, H. Yin, J. Li, M. Gao, Z. Huang, and L. Cui, “Enhancing social
[29] T. Chen, Y. Guo, S. Hao, and R. Hong, “Exploring self-attention graph recommendation with adversarial graph convolutional networks,” IEEE
pooling with EEG-based topological structure and soft label for depression Trans. Knowl. Data. Eng., vol. 34, no. 8, pp. 3727–3739, Aug. 2022.
detection,” IEEE Trans. Affect. Comput., vol. 28, no. 4, pp. 2016–2118, [53] M. Zitnik and J. Leskovec, “Predicting multicellular function through
Oct.-Dec. 2022. multi-layer tissue networks,” Bioinformatics, vol. 33, no. 14, pp. i190–
[30] S. Asadzadeh, T. Y. Rezaii, S. Beheshti, and S. Meshgini, “Accu- i198, 2017.
rate emotion recognition utilizing extracted EEG sources as graph [54] M. Rubinov and O. Sporns, “Complex network measures of brain connec-
neural network nodes,” Cogn. Comput., vol. 15, pp. 176–189, tivity: Uses and interpretations,” Neuroimage, vol. 52, no. 3, pp. 1059–
2023. 1069, 2010.
[31] T. Song et al., “Variational instance-adaptive graph for EEG emotion [55] G. Du et al., “A multi-dimensional graph convolution network for
recognition,” IEEE Trans. Affect. Comput., vol. 14, no. 1, pp. 343–356, EEG emotion recognition,” IEEE Trans. Instrum. Meas., vol. 71, 2022,
Jan.-Mar. 2023. Art. no. 2518311.

Authorized licensed use limited to: JADAVPUR UNIVERSITY. Downloaded on January 11,2025 at 09:34:30 UTC from IEEE Xplore. Restrictions apply.
2054 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 15, NO. 4, OCTOBER-DECEMBER 2024

[56] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, and X. Zhai, Huachao Yan is currently working toward the PhD
“An image is worth 16x16 words: Transformers for image recognition at degree with the School of Electronic and Information
scale,” in Proc. 8th Int. Conf. Learn. Representations, 2020, pp. 1–22. Engineering, South China University of Technology.
[57] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. His research interests include EEG emotion recogni-
Zagoruyko, “End-to-end object detection with transformers,” in Proc. Eur. tion and affective computing.
Conf. Comput. Vis., 2020, pp. 213–229.
[58] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural
networks on graphs with fast localized spectral filtering,” in Proc. 30th
Conf. Neural Inf. Process. Syst., 2017, pp. 3844–3852.
[59] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio,
“Graph attention networks,” in Proc. 6th Int. Conf. Learn. Representations,
2018, pp. 1–12.
[60] W. L. Zheng, W. Liu, Y. Lu, B. L. Lu, and A. Cichocki, “EmotionMeter:
A multimodal framework for recognizing human emotions,” IEEE Trans.
Cybern., vol. 49, no. 3, pp. 1110–1122, Mar. 2019. Kailing Guo (Member, IEEE) received the PhD de-
[61] S. Katsigiannis and N. Ramzan, “DREAMER: A database for emotion gree from the South China University of Technology,
recognition through EEG and ECG signals from wireless low-cost off-the- Guangzhou, China. He is currently an associate pro-
shelf devices,” IEEE J. Biomed. Health Inform., vol. 22, no. 1, pp. 98–107, fessor with the School of Electronic and Information
Jan. 2018. Engineering, South China University of Technology.
[62] D. Pan et al., “MSFR-GCN: A multi-scale feature reconstruction graph His research interests include low-rank and sparse
convolutional network for eeg emotion and cognition recognition,” IEEE learning, deep learning optimization and model com-
Trans. Neural Syst. Rehabil. Eng., vol. 31, pp. 3245–3254, 2023. pression, multimodal human data processing.
[63] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via
transfer component analysis,” IEEE Trans. Neural Netw., vol. 22, no. 2,
pp. 199–210, Feb. 2011.
[64] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars, “Unsupervised
visual domain adaptation using subspace alignment,” in Proc. IEEE Int.
Conf. Comput. Vis., 2013, pp. 2960–2967.
[65] R. Zhou et al., “PR-PL: A novel prototypical representation based Xiaofen Xing (Member, IEEE) received the BS, MS,
pairwise learning framework for emotion recognition using EEG sig- and PhD degrees from the South China University
nals,” IEEE Trans. Affect. Comput., early access, Jun. 23, 2023, of Technology, Guangzhou, China, in 2001, 2004,
doi: 10.1109/TAFFC.2023.3288118. and 2013, respectively. Since 2017, she has been
[66] Z. Li et al., “Dynamic domain adaptation for class-aware cross-subject and an associate professor with the School of Electronic
cross-session EEG emotion recognition,” IEEE J. Biomed. Health Inform., and Information Engineering, South China University
vol. 26, no. 12, pp. 5964–5973, 2022. of Technology. Her main research interests include
[67] W. Zheng and B. Lu, “Personalizing EEG-based affective models with speech emotion analysis, image/video processing,
transfer learning,” in Proc. 34th Int. Joint Conf. Artif. Intell., 2016, and human computer interaction.
pp. 2732–2738.
[68] K. Yang, L. Tong, J. Shu, N. Zhuang, B. Yan, and Y. Zeng, “High gamma
band EEG closely related to emotion: Evidence from functional network,”
Front. Hum. Neurosci., vol. 14, 2020, Art. no. 89.
[69] T. Song, W. Zheng, C. Lu, Y. Zong, X. Zhang, and Z. Cui, “MPED: A multi-
modal physiological emotion database for discrete emotion recognition,” Xiangmin Xu (Senior Member, IEEE) received the
IEEE Access, vol. 7, pp. 12177–12191, 2019. PhD degree from the South China University of Tech-
[70] W. Guo, G. Xu, and Y. Wang, “Horizontal and vertical features fusion nology, Guangzhou, China. He is currently a full pro-
network based on different brain regions for emotion recognition,” Knowl. fessor with the School of Electronic and Information
Based Syst., vol. 247, 2022, Art. no. 108819. Engineering and the School of Future Technology,
[71] L. Zhu et al., “Multisource wasserstein adaptation coding network for EEG South China University of Technology. His recent
emotion recognition,” Biomed. Signal Process. Control, vol. 76, 2022, research focuses on image/video processing, human–
Art. no. 103687. computer interaction, computer vision, and machine
[72] W. Zheng, “Multichannel EEG-based emotion recognition via group learning.
sparse canonical correlation analysis,” IEEE Trans. Cogn. Develop. Syst.,
vol. 9, no. 3, pp. 281–290, Sep. 2017.
[73] L. Yang, Z. Wenming, C. Zhen, and Z. Xiaoyan, “A novel graph regularized
sparse linear discriminant analysis model for EEG emotion recognition,”
in Proc. 23 rd Int. Conf. Neural Inf. Process., 2016, pp. 175–182.
[74] G. Wu, S. Lin, Y. Zhuang, and J. Qiao, “Alleviating over-smoothing
via graph sparsification based on vertex feature similarity,” Appl. Intell.,
vol. 53, no. 17, pp. 20223–20238, 2023.
[75] Y. Liu et al., “CurvDrop: A RICCI curvature based approach to prevent
graph neural networks from over-smoothing and over-squashing,” in Proc.
ACM Web Conf., 2023, pp. 221–230.
[76] L. V. d. Maaten and G. Hinton, “Visualizing data using T-SNE,” J. Mach.
Learn. Res., vol. 90, no. 11, pp. 2579–2605, 2008.

Authorized licensed use limited to: JADAVPUR UNIVERSITY. Downloaded on January 11,2025 at 09:34:30 UTC from IEEE Xplore. Restrictions apply.

You might also like