Flexibly Utilizing Syntactic Knowledge in Aspect-Based Sentiment 2024
Flexibly Utilizing Syntactic Knowledge in Aspect-Based Sentiment 2024
Keywords: Aspect-based sentiment analysis (ABSA) refers to ascertaining the propensity of sentiment
Aspect-based sentiment analysis expressed in a text towards a particular aspect. While previous models have utilized dependency
BERT graphs and GNNs to facilitate information exchange, they face challenges such as smoothing
Syntax representation
of aspect representation and a gap between word-based dependency graphs and subword-
Syntax-guided transformer
based BERT. Taking into account the above deficiencies, we argue for a new approach
called SRE-BERT that flexibly utilizes syntax knowledge to enhance aspect representations
by relying on syntax representations. First, we propose a syntax representation encoder to
acquire the syntactic vector for each token. Then, we devise a syntax-guided transformer
that employs syntax representation to compute multi-head attention, thereby enabling direct
syntactic interaction between any two tokens. Finally, the token-level vectors derived from
the syntax-guided transformer are employed to enhance the semantic representations obtained
by BERT. In addition, during the aforementioned process, we introduced a Masked POS Label
Prediction (MPLP) method to pre-train the syntax encoder. A wide range of studies have been
undertaken on data collections covering three distinct fields, and the results indicate that our
SRE-BERT outperforms the second-ranked model by 1.97%, 1.55%, and 1.20% on the Rest14,
Lap14, and Twitter 3 datasets, respectively.
1. Introduction
Aspect-based sentiment analysis (ABSA) focuses on determining the emotion conveyed in a text towards designated objects.
Consider this review ‘‘The screen on this phone is too dark, but the shape is exquisite’’, it is evident that the commentator is not
very satisfied with the ‘‘screen’’, but is satisfied with the ‘‘shape’’.
Most previous research (Bao, Lambert, & Badia, 2019; Fan, Feng, & Zhao, 2018a; Gu, Zhang, Hou, & Song, 2018; He, Lee, Ng, &
Dahlmeier, 2018; Li, Liu, & Zhou, 2018a; Ma, Li, Zhang, & Wang, 2017; Majumder et al., 2018; Ruder, Ghaffari, & Breslin, 2016; Tay,
Tuan, & Hui, 2018; Wang, Huang, Zhu, & Zhao, 2016a; Wang et al., 2018) has employed attention-based strategies in monitoring
contextual words associated with designated objects. These studies commonly rely only on the semantic expression of the words
without using syntactic information, which may hinder the precise allocation of attention. For example, in Fig. 1, the grammatical
∗ Corresponding author.
E-mail addresses: [email protected] (X. Huang), [email protected] (J. Li), [email protected] (J. Wu), [email protected] (J. Chang),
[email protected] (D. Liu), [email protected] (K. Zhu).
https://doi.org/10.1016/j.ipm.2023.103630
Received 15 August 2023; Received in revised form 27 October 2023; Accepted 26 December 2023
Available online 10 January 2024
0306-4573/© 2024 Elsevier Ltd. All rights reserved.
X. Huang et al. Information Processing and Management 61 (2024) 103630
Fig. 1. Two examples of syntactic structures. Two sentences with exactly the same syntactic structure in (a) have exactly opposite emotional polarity for the
same aspect, and two sentences from different domains in (b) have exactly the same syntactic structure. It can be observed that we can infer which words
impose a predominant influence on the sentiment inclination of designated objects through the syntax relationships between words.
structures of the two sentences on the left are completely the same, yet the sentiment polarities are completely opposite. This
requires the model to give high attention to opinion words in both sentences that are in the same grammatical position but have
completely different sentiments, which is undoubtedly difficult to achieve by relying on the semantic expression of words. Similarly,
the sentiment polarities of the two sentences on the right in Fig. 1 are consistent, but the aspect terms come from different fields,
which is also difficult for models based on attention mechanisms to handle.
Recent researchers (Chen, Teng, Wang, & Zhang, 2022; Huang & Carley, 2019; Jin, Zhao, Zhang, Liu, & Yu, 2023; Li et al.,
2021; Liang, Wei, Mao, Wang, & He, 2022; Luo, Wu, Zhou, Zhang, & Wang, 2020; Pang, Xue, Yan, Huang, & Feng, 2021; Song,
Park, & Shin, 2019; Sun, Zhang, Mensah, Mao, & Liu, 2019; Tang, Ji, Li, & Zhou, 2020; Wang, Shen, Yang, Quan, & Wang, 2020;
Zhang, Li, & Song, 2019; Zhao, Liu, Zhang, Guo, & Chen, 2021) worked on using grammar information within ABSA. These studies
typically adopt parsers for gaining the dependency graph, then use GNN to make the words close in distance on the dependency
graph exchange information with each other, and finally obtain the aspect-based expression.
However, GNN-based methods often have several problems that are difficult to handle:
• The dependency graph represents the grammatical relationships between words, and a GNN used on this basis can only obtain
word-level word representation. Meanwhile, BERT (Devlin, Chang, Lee, & Toutanova, 2018) is based on sub-words. Usually,
these GNN-based models utilize the mean vector pertaining to the related sub-words of each word as the vector of the word,
but this may weaken the features of the sub-words.
• GNN performs depends heavily on the layer quantity. Too few layers may result in failure to capture word features that are
grammatically distant; too many layers will result in over-smoothing of features and significant performance degradation (Li,
Han, & Wu, 2018).
To tackle the above concerns, we present a syntax representation enhanced BERT (named SRE-BERT) to utilize grammar
knowledge more finely in ABSA. Specifically, we first assign part-of-speech (POS) labels to each BERT token based on the
correspondence between words and their BERT tokens, thus extending grammar information to the sub-word level. Then, we design
a syntax representation encoder that utilizes the relative position relationship of POS labels to obtain the syntax representation of
each token. We propose a Masked POS Label Prediction (MPLP) method to pre-train this syntax encoder. Meanwhile, we use BERT
to acquire the semantic representation of every token. Next, we feed both the syntax representation and the semantic representation
of each token into a syntax-guided transformer to produce the syntax-guided aspect-aware representation. Finally, we concatenate
the syntax-guided aspect-aware representation and the semantic-based aspect-aware representation to obtain the final aspect-based
representation. Fig. 2 illustrates the complete framework.
Briefly stated, our study outlines the subsequent advancements:
• We design a syntax representation encoder, and pre-train it using our proposed Masked POS Label Prediction (MPLP) task to
obtain token-level syntax information, which enables better collaboration with BERT.
• We design a syntax-guided transformer that utilizes syntax expressions of each token to compute multi-head attention, allowing
any two tokens to establish the syntax relationship, overcoming the limitations of GNNs being limited by the quantity of layers,
and guiding the model to obtain more accurate contextual attention.
• We have perform comprehensive experiments on three distinct ABSA datasets, and the findings indicate that our SRE-BERT
exhibits superior utilization of syntactical knowledge in ABSA.
The organization of this paper is as follows. Section 3 Research Objectives elucidates the research objectives of this article.
Section 4 Preliminaries explains the formal definition of the ABSA task and the computation process of multi-headed attention.
Section 5 Methodology presents the specific details of our SRE-BERT. Section 6 Experimentation describes the experimental setup.
Section 7 Results and Analysis discusses the experimental results. Section 8 Discussion discusses the theoretical and practical
implications of this paper. Section 9 Conclusion concludes the paper.
2
X. Huang et al. Information Processing and Management 61 (2024) 103630
Fig. 2. The complete framework of the syntax representation enhanced BERT (SRE-BERT) is shown, which first obtains the syntax representations of BERT
tokens from a pre-trained syntax representation encoder, and then feeds these syntax representations to a syntax-guided transformer to obtain the multi-head
attention of the tokens. Finally, the semantic representations obtained from syntax guided transformer and BERT are used together for the ABSA task.
2. Related work
In recent times, the utilization of neural networks has gained traction in addressing ABSA task, including attention-oriented
approaches (Fan et al., 2018a; Ma et al., 2017; Meškelė & Frasincar, 2020; Tang, Qin, & Liu, 2016; Wang, Huang, Zhu, & Zhao,
2016b; Yang, Zhang, Jiang, & Li, 2019), GNN-based (Graph Neural Network) approaches (Chen et al., 2022; Li et al., 2021; Liang
et al., 2022; Lu, Li, & Wei, 2022; Pang et al., 2021; Wang et al., 2020; Zhang et al., 2019), PLM-based (Pretrained Language Model)
approaches (Lengkeek, van der Knaap, & Frasincar, 2023; Zhang et al., 2022; Zhu, Kuang, & Zhang, 2023). Next, we will focus on
introducing attention-oriented approaches and GNN-based approaches.
The connections between different words and aspect terms in the text are not balanced. Typically, only a subset of words exhibit
sentiment orientation towards aspect terms. Attention-oriented approaches attempt to capture this imbalance by applying different
levels of attention to different words, thereby better capturing sentiment information. The work by Wang et al. (2016b) focuses
on enhancing the attention structure of LSTM. By introducing an attention mechanism into LSTM, their model is able to pay more
attention to information related to sentiment words. Tay et al. (2018) introduce concepts of recursive convolution and recursive
correlation into a differentiable neural attention framework. This approach effectively analyzes the correlation between words
and targets, further improving the accuracy and effectiveness of sentiment analysis. He et al. (2018) propose leveraging syntactic
knowledge to better understand sentence structure and semantic relationships, thereby capturing the importance of sentiment
words and contextual information. Li, Liu, and Zhou (2018b) design a multi-level attention structure that incorporates positional
information to generate target-aware representations for each word. This approach fully considers the positional information of
words in the text, allowing the model to better distinguish sentiment words at different positions and improve the precision
of sentiment analysis. Fan, Feng, and Zhao (2018b) develop an attention network with different granularity levels to capture
the interaction between aspects and context. By utilizing multiple levels of attention mechanisms, the model can better capture
the relationship between aspects and context, thus enhancing the expressive power of sentiment analysis. Yang et al. (2019)
simultaneously focuses on goal-oriented attention and contextual attention to learn more effective contextual expressions. Meškelė
3
X. Huang et al. Information Processing and Management 61 (2024) 103630
Fig. 3. Comparison of the structure of the GNN and the structure of the syntax-guided transformer presented in this paper. The left side shows the structure of
GNN, which can be viewed as a transformer structure. Many elements in its attention matrix are set to zero because many words are beyond the layer limit of
GNN. In contrast, the attention matrix in the syntax-guided transformer on the right is denser and richer as each attention value is calculated using the syntactic
vectors of the words.
and Frasincar (2020) employs bidirectional contextual attention to calculate the influence of each word on the aspect-oriented
sentiment.
The attention mechanism that only utilizes the semantic information of individual words fails to fully exploit the structural
information within sentences, making it difficult to handle long texts. When a sentence contains multiple opinion words with
different sentiments, the attention mechanism may capture incorrect opinion words for aspect terms, resulting in erroneous attention
distributions.
Text is a sequence composed of individual words arranged from left to right, but this order does not necessarily fully reflect the
relationships between words. For example, in this sequence, a given aspect term may be far away from its corresponding opinion
word. Therefore, some researchers have attempted to explore textual structural information beyond the sequence to strengthen the
connection between aspect terms and their opinion words. As a result, dependency graphs and Graph Neural Networks (GNNs) have
been naturally introduced into this line of research.
Among these approaches, the combination of GNN models and sentence dependency trees performed well Within the study. Wang
et al. (2020) employs dependency types as features rather than simple dependency connections. Tang et al. (2020) uses both the
GCN (Graph convolutional network) and the transformer to obtain contextual expression of words and fuses the two expressions to
obtain aspect-aware expressions. Li et al. (2021) employs two GCNs to extract syntax and semantic features and then fuses these
features for producing aspect-oriented perceptual representations. Pang et al. (2021) uses GCN to make the syntactic and semantic
features of sentences interact with each other to finally obtain aspect-aware expressions. Chen et al. (2022) attempts to derive
potential opinion tree models from the attention scores as an alternative to explicit dependency trees. Liang et al. (2022) introduces
a bilingual syntax-aware GNN that utilizes both constituency and dependency tree dual syntactic information. This model effectively
captures contextual perception by considering cross-aspect relationships. Lu et al. (2022) proposes a heterogeneous graph neural
network that simultaneously encodes multiple prior knowledge, such as dependency graphs and sentiment dictionaries. However,
these methods are limited by the structure of GNN, which makes it hard to model words with distant grammatical relations. In this
article, we present a technique to first obtain syntax representations and then directly model the grammatical relations between
arbitrary words by a syntax-guided transformer.
Although GNNs have reduced the distance between aspect terms and opinion words in some cases, enhancing their information
interaction, the structure of GNNs themselves also leads to certain issues. On one hand, having too many layers in a GNN can result
in feature smoothing, while on the other hand, having too few layers can lead to insufficient interaction between words.
From a structural perspective, GNNs can be seen as transformers that incorporate prior knowledge. However, this prior knowledge
can only reflect the syntactic relationships between words within a local scope. The objective of this paper is to use pre-trained
word syntax vectors to more accurately model the syntactic relationships between arbitrary words, in order to obtain more precise
aspect-level representations.
4
X. Huang et al. Information Processing and Management 61 (2024) 103630
3. Research objectives
Most existing models that utilize grammatical knowledge in ABSA tasks combine dependency graphs with GNN for research.
However, they face two main issues: (1) the dependency graph of the text reflects the syntactic relationships between words, and
the GNN built on this basis can only obtain word-level representations. When these models need to be combined with BERT, there
is a gap between word-level representations and the representations generated by BERT’s tokens. (2) The performance of GNN is
highly influenced by its number of layers: too many layers result in smoothed representations, while too few layers limit information
transfer to a small scope.
To address these issues, our study focuses on the following research objectives:
1. Explore how to obtain token-level grammatical knowledge to eliminate the gap encountered when using BERT.
2. Investigate the use of syntax-guided transformers to establish syntactic connections between any two tokens, without being
limited by the number of GNN layers.
3. Evaluate the performance of our model on datasets from different domains.
GNN is essentially a type of transformer that incorporates prior knowledge. In GNN-based ABSA models, the prior knowledge
refers to the syntax relationships, specifically the dependency relationships between words.
As shown in Fig. 3, many values in the attention matrices of each layer in GNN are set to zero because there are no direct
dependency relationships between these words. However, this leads to the information flow of GNN being limited to nodes within
a certain number of layers, and no information can be obtained for nodes beyond that limit.
To address the limitation of GNN layers while preserving the prior syntax knowledge of transformers, we attempt to generate a
syntax representation vector for each word in order to establish direct syntactic connections between any two words.
To achieve this, we use a syntax representation encoder to obtain the syntax representation vectors for each word, and based on
them, compute a dense attention matrix to replace the sparse attention matrix in GNN.
If we consider GNN as a transformer that incorporates syntactic prior knowledge, it is evident that the syntactic prior knowledge
is sparse and not sufficiently accurate due to the limitation of GNN layers. Therefore, this paper aims to explore a transformer that
injects dense and rich syntactic knowledge beyond GNN. Our approach is capable of overcoming the limitation of GNN layers and
the loss of information in the dependency graph.
As shown in Fig. 3, we replace GNN with a syntax-guided transformer. Specifically, the differences between them are as follows:
• The attention matrix in GNN is relatively sparse with many zeros, resulting in the loss of a significant amount of syntactic
information and only allowing for the acquisition of local word information. In contrast, the attention matrix in the
syntax-guided transformer is denser and contains all token’s syntactic information.
• The performance of GNN is limited by its layer number, while the syntax-guided transformer is almost unaffected by the
number of layers. The specific results can be found in 7.4 Effect of Structure of Syntax-guided Transformer in this paper.
• The syntactic knowledge in GNN is obtained from the dependency graph, reflecting the syntactic relationships between words,
but may not necessarily adapt to the ABSA task. In contrast, the syntactic knowledge in the syntax-guided transformer is
obtained through pre-training and participation in the ABSA task, making it more closely aligned with the requirements of the
ABSA task.
4. Preliminaries
The multi-headed attention mechanism plays a crucial role in the Transformer model, enabling direct interaction of feature
information between elements at any position in the input sequence.
ℎ𝑒𝑎𝑑𝑠 ×1×𝑑 𝑞 ℎ𝑒𝑎𝑑𝑠 ×𝑛×𝑑 𝑘
Provided an input sequence 𝑠 = {𝑡1 , 𝑡2 , … , 𝑡𝑛 }, we first acquire the Query 𝑄𝑘 ∈ R𝑛 , Key 𝐾𝑘 ∈ R𝑛 , and Value
𝑛 ℎ𝑒𝑎𝑑𝑠 ×𝑛×𝑑 𝑣 ℎ𝑒𝑎𝑑𝑠 𝑞 𝑘
𝑉𝑘 ∈ R corresponding to 𝑡𝑘 . Here, 𝑛 stands for the quantity of attention heads, while 𝑑 and 𝑑 are equal.
Next, we calculate the output 𝑧𝑘 of the multi-headed attention layer corresponding to 𝑡𝑘 using the following method:
( 𝑇
)
𝑄𝑘𝑖 𝐾𝑘𝑖
𝑧𝑘𝑖 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥 √ 𝑉𝑘𝑖 , 𝑖 = 1, … , 𝑛heads (1)
𝑑𝑘
5
X. Huang et al. Information Processing and Management 61 (2024) 103630
( ( ))
𝑧𝑘 = 𝐿𝑖𝑛𝑒𝑎𝑟 𝐶𝑜𝑛𝑐𝑎𝑡 𝑧𝑘1 , 𝑧𝑘2 , … , 𝑧𝑘𝑛ℎ𝑒𝑎𝑑𝑠 (2)
For the sake of convenience, we summarize the above process with the following equation:
( )
𝑧𝑘 = 𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑𝐴𝑡𝑡𝑒𝑛 𝑄𝑘 , 𝐾𝑘 , 𝑉𝑘 (3)
Through the above process, we have demonstrated the computation process of the multi-headed attention layer from the
perspective of an element in the input sequence.
5. Methodology
Our SRE-BERT overall architecture, as depicted in Fig. 2, comprising four distinct constituents: a syntax representation encoder,
a semantic representation encoder, a syntax-guided transformer, and the final aspect-based representation.
Specifically, we first use the syntax representation encoder in pursuit of gaining the syntax representation of every token in
the input text, and use the semantic representation encoder in pursuit of gaining the semantic representation of every token.
Then, in the syntax-guided transformer, we use the syntax representation of every token as the Query and Key, and the semantic
representation as the Value, to calculate the syntax-guided semantic representation for each token. Finally, we concatenate the
syntax-guided semantic representation corresponding to the aspect term and the semantic representation to acquire aspect-based
representation.
Our syntax representation encoder consists of one basic syntax-aware layer and three layers of high-level syntax-aware
layers, aiming to generate the syntax representation vectors for the input text. Both types of syntax-aware layers employ a
transformer encoder structure. It is worth noting that the basic syntax-aware layer utilizes multi-headed attention, while the
high-level syntax-aware layers adopt multi-headed self-attention.
Specifically, we first tokenize the input text into a BERT token sequence and attempt to obtain the POS label for each BERT token.
Then, the basic syntax-aware layer utilizes the POS label of each token and the relative distance between tokens to derive the
syntax information for each token. Finally, the high-level syntax-aware layers allows the syntax information of tokens to interact
further, resulting in the syntax representation vector for each token.
To pretrain the syntax representation encoder, we propose the masked POS label prediction (MPLP) task. The relevant details
will be extensively discussed in the concluding section of this segment.
denotes the basic POS of 𝑡𝑠 𝑖 . Specifically, the basic POS here refers to the tags ‘‘ADJ’’, ‘‘PROUN’’ and so on. Then, a sequence of
{ }
BERT tokens 𝑇 𝑏 = 𝑡𝑏 1 , 𝑡𝑏 2 , … , 𝑡𝑏 𝑚 is obtained using the tokenizer of BERT. Note that since the BERT tokenizer may divide a word
into several sub-words, this is different from that of the parser, so 𝑚 and 𝑛 may not be equal.
We then assign each BERT token a extended POS label. Specifically, we establish a mapping relationship between all elements in
𝑇 𝑠 and all elements in 𝑇 𝑏 based on the position of each BERT token in 𝑇 and the position of each basic token in 𝑇 . In this mapping
relationship, each BERT token corresponds to only one basic token, but a basic token may contain multiple BERT tokens. For a
BERT token 𝑡𝑏 𝑖 , its corresponding basic token 𝑡𝑠 𝑗 has basic POS 𝑝𝑠 𝑗 , and if 𝑡𝑏 𝑖 is the first of the BERT tokens contained in 𝑡𝑠 𝑗 , then
the POS label of 𝑡𝑏 𝑖 will be marked as ‘‘𝑝𝑠 𝑗 _B’’, otherwise marked as ‘‘𝑝𝑠 𝑗 _I’’. For example, if 𝑝𝑠 𝑗 is ‘‘ADJ’’, then we will mark the POS
label of 𝑡𝑏 𝑖 as ‘‘ADJ_B’’ or ‘‘ADJ_I’’.
{ }
By the above way, we obtain the POS label of each BERT token, and we use 𝑃 𝑏 = 𝑝𝑏 1 , 𝑝𝑏 2 , … , 𝑝𝑏 𝑚 to denote the sequence of
POS labels of 𝑇 𝑏 .
𝑝 𝑝
POS Embedding Matrix. We construct a POS embedding matrix 𝐸 𝑝 ∈ R𝑛 ×𝑑 , 𝑛𝑝 denotes the total quantity of types of
extended POS labels and 𝑑 denotes the dimension of extended POS labels. By looking up 𝐸 𝑝 , we obtain the POS embedding matrix
𝑝
𝑝
𝐸 𝑝𝑏 ∈ R𝑚×𝑑 corresponding to 𝑃 𝑏 .
Relative Position Embedding Matrix. In the text, the influence of one word on another word is strongly related to the
relative position of these two words.
𝑟𝑝 𝑝
We construct a relative position matrix 𝐸 𝑟𝑝 ∈ R𝑛 ×𝑑 , and 𝑛𝑟𝑝 symbolizes the quantity of relative positions. For the convenience
of calculation, the dimension here is consistent with the dimension of POS embedding.
1 Available at https://spacy.io/.
6
X. Huang et al. Information Processing and Management 61 (2024) 103630
Fig. 4. The procedure for computing the multi-head attention of the 𝑘th POS label in the input POS label sequence.
Basic Syntax Representation. Sending the relative position embedding and POS embedding for tokens into the basic
syntax-aware layer, we will acquire the basic syntax representations of tokens.
𝑝 𝑝
We denote the embedding of relative position 𝑖 by 𝑒𝑟𝑝 𝑖 ∈ R1×𝑑 , and the POS embedding corresponding to 𝑝𝑏 𝑘 by 𝑒𝑝𝑏 𝑘 ∈ R1×𝑑 .
Then for 𝑝𝑏 𝑘 , its corresponding Query 𝑄𝑘 can be calculated by the following equation:
( )[ ]
𝑄𝑘 = 𝑒𝑝𝑏 𝑘 ⊕ 𝑒𝑟𝑝 0 𝑊 𝑄 1 , … , 𝑊 𝑄 𝑛ℎ𝑒𝑎𝑑𝑠 (4)
𝑝 𝑞
where ⊕ denotes element-wise summation, 𝑊 𝑄 𝑖 ∈ R𝑑 ×𝑑 is a trainable parameter matrix, 𝑛ℎ𝑒𝑎𝑑𝑠 refers to the quantity of attention
ℎ𝑒𝑎𝑑𝑠 ×1×𝑑 𝑞
heads, and 𝑄𝑘 ∈ R𝑛 is the 𝑄𝑢𝑒𝑟𝑦 corresponding to 𝑝𝑏 𝑘 .
For 𝑝𝑏 𝑘 , the influence of other POS labels in 𝑃 𝑏 depends on the type of those POS labels themselves and their own position
relative to 𝑝𝑏 𝑘 . Thus, the 𝐾𝑒𝑦 of 𝑝𝑏 𝑘 will be obtained in the following way:
( )[ ]
𝐾𝑘 = 𝐸 𝑝𝑏 ⊕ 𝐸 𝑟𝑝 𝑘 𝑊 𝐾 1 , … , 𝑊 𝐾 𝑛ℎ𝑒𝑎𝑑𝑠 (5)
7
X. Huang et al. Information Processing and Management 61 (2024) 103630
{ }
where 𝐸 𝑟𝑝 𝑘 = 𝑒𝑟𝑝 1−𝑘 , 𝑒𝑟𝑝 2−𝑘 , … , 𝑒𝑟𝑝 0 , … 𝑒𝑟𝑝 𝑚−𝑘 denotes the relative position embedding of all POS labels in 𝑃 𝑏 with respect to 𝑝𝑏 𝑘 .
𝐾 𝑑 𝑝 ×𝑑 𝑞 ℎ𝑒𝑎𝑑𝑠 ×𝑚×𝑑 𝑞
𝑊 𝑖∈R indicates a weights matrix, 𝐾𝑘 ∈ R𝑛 indicates the 𝐾𝑒𝑦 corresponding to 𝑝𝑏 𝑘 .
𝑏
Next, the Value corresponding to 𝑝 𝑘 can be obtained according to the following equation:
( )[ ]
𝑉𝑘 = 𝐸 𝑝𝑏 ⊕ 𝐸 𝑟𝑝 𝑘 𝑊 𝑉 1 , … , 𝑊 𝑉 𝑛ℎ𝑒𝑎𝑑𝑠 (6)
𝑝 𝑣 𝑑 𝑝 ℎ𝑒𝑎𝑑𝑠 𝑣
where 𝑊 𝑉 𝑖 ∈ R𝑑 ×𝑑 symbolizes a trainable parameter matrix, while 𝑑 𝑣 = 𝑛ℎ𝑒𝑎𝑑𝑠 , 𝑉𝑘 ∈ R𝑛 ×𝑚×𝑑 that is, the 𝑉 𝑎𝑙𝑢𝑒 corresponding
𝑏
to 𝑝 𝑘 .
After calculating 𝑄𝑘 , 𝐾𝑘 and 𝑉𝑘 (see Fig. 4), we can calculate the output 𝑧𝑘 of the multi-head layer.
{ }
Then, by feeding 𝑧𝑘 into the FFN, the base syntax representation ℎ𝑘 corresponding to 𝑝𝑏 𝑘 is obtained. We use 𝐻 = ℎ1 , ℎ2 , … , ℎ𝑚
to denote basic syntax representations of text 𝑇 .
5.1.3. Pre-training
We design a masked POS label prediction (MPLP) approach to pre-train our syntax representation encoder (see Fig. 5).
Specifically, we first process the input POS label sequence 𝑃 𝑏 by adopting the following strategy:
1. For any POS label in 𝑃 𝑏 , there is a 15% probability that it will be marked as ‘‘masked POS’’.
2. Any POS label marked as ‘‘masked POS’’ is replaced by ‘‘[MASK]’’ at an 80% likelihood, by another POS label at a 10%
likelihood, and kept unchanged at a 10% likelihood.
{
By feeding the masked POS label sequence into the syntax representation encoder, we obtain the output 𝑂𝑠𝑦𝑛_𝑚𝑎𝑠𝑘 = 𝑂1𝑠𝑦𝑛_𝑚𝑎𝑠𝑘 , … ,
}
𝑂𝑠𝑦𝑛_𝑚𝑎𝑠𝑘 𝑚 , and then calculate the prediction probability distribution of each POS label using the following equation:
( ( ))
𝑝𝑠𝑦𝑛_𝑚𝑎𝑠𝑘
𝑘
= 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥 𝐿𝑖𝑛𝑒𝑎𝑟 𝑂𝑠𝑦𝑛_𝑚𝑎𝑠𝑘 𝑘 ,
(7)
𝑘 = 1, 2, … , 𝑚
Consequently, compute the loss:
( )
1∑
𝑚
𝑇
𝑠𝑦𝑛_𝑚𝑎𝑠𝑘 = −𝑙𝑜𝑔 𝑦𝑠𝑦𝑛_𝑚𝑎𝑠𝑘
𝑘
⋅ 𝑝𝑠𝑦𝑛_𝑚𝑎𝑠𝑘
𝑘
(8)
𝑚 𝑘=1
where 𝑦𝑠𝑦𝑛_𝑚𝑎𝑠𝑘
𝑘
is a one-hot vector of 𝑝𝑏𝑘 .
8
X. Huang et al. Information Processing and Management 61 (2024) 103630
We use BERT as the semantic representation encoder to calculate the semantic representation of the text 𝑇 :
( )
𝑂𝑠𝑒𝑚 = 𝐵𝐸𝑅𝑇 𝑇 𝑏 (9)
𝑏 𝑠𝑒𝑚
{ 𝑠𝑒𝑚 𝑠𝑒𝑚
}
As mentioned before, 𝑇 is the BERT token sequence of the text 𝑇 . 𝑂 = 𝑂 1 , … , 𝑂 𝑚 denotes the semantic representation
of the text 𝑇 .
Human understanding of each word in a text is highly influenced by the syntactic relationships among words. For instance,
when a noun follows an adjective, that adjective is likely to contain an emotional tendency toward the noun. This suggests that
the distribution of human attention in comprehending words should be strongly correlated with the syntactic relationships between
words.
Therefore, we employ syntax representation to compute the attention distribution among words and semantic representation
to convey the semantic features of the words. Our syntax-guided transformer adopts the transformer encoder architecture, but in
computing multi-head attention, the source of 𝑄𝑢𝑒𝑟𝑦 and 𝐾𝑒𝑦 is the syntax representation 𝑂𝑠𝑦𝑛 , and the source of 𝑉 𝑎𝑙𝑢𝑒 is the
semantic representation 𝑂𝑠𝑒𝑚 .
Specifically, in each layer of the syntax-guided transformer, the multi-headed attention layer is computed as follows:
[ 𝑙 𝑙
]
𝑄𝑙𝑘 = 𝑂𝑘𝑠𝑦𝑛 𝑊 𝑄 1 , … , 𝑊 𝑄 𝑛ℎ𝑒𝑎𝑑𝑠 , (10)
[ 𝑙 𝑙
]
𝐾𝑘𝑙 = 𝑂𝑘𝑠𝑦𝑛 𝑊 𝐾 1 , … , 𝑊 𝐾 𝑛ℎ𝑒𝑎𝑑𝑠 , (11)
[ 𝑙 𝑙
]
𝑉𝑘𝑙 = 𝑂𝑘𝑙−1 𝑊 𝑉 1 , … , 𝑊 𝑉 𝑛ℎ𝑒𝑎𝑑𝑠 , (12)
( )
𝑧𝑙𝑘 = 𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑𝐴𝑡𝑡𝑒𝑛 𝑄𝑙𝑘 , 𝐾𝑘𝑙 , 𝑉𝑘𝑙 ,
(13)
𝑘 = 1, … , 𝑚
where 𝑄𝑙𝑘 , 𝐾𝑘𝑙 and 𝑉𝑘𝑙 denote the 𝑄𝑢𝑒𝑟𝑦, 𝐾𝑒𝑦 and 𝑉 𝑎𝑙𝑢𝑒 of the 𝑘th token at the 𝑙th layer in syntax-guided transformer, respectively.
Where 𝑂𝑘𝑙−1 represents the result corresponding to the 𝑘th token of the (𝑙 − 1)-th layer of the syntax-guided transformer, 𝑂𝑘0 is
𝑂𝑘𝑠𝑒𝑚 . It can be seen that the calculation process of the attention distribution for each layer relies on the syntax-aware representation
𝑂𝑠𝑦𝑛 .
In brief, by sending 𝑂𝑠𝑦𝑛 and 𝑂𝑠𝑒𝑚 into the syntax-guided transformer, we gain the syntax-guided semantic representation
{ }
𝑅𝑠𝑦𝑛_𝑠𝑒𝑚 = 𝑅𝑠𝑦𝑛_𝑠𝑒𝑚 1 , … , 𝑅𝑠𝑦𝑛_𝑠𝑒𝑚 𝑚 .
We denote by 𝐼 𝑎 the index of all tokens belonging to aspect 𝑎 in 𝑇 𝑏 . Syntax-guided aspect-aware representation is obtained by
averaging pooling:
( )
𝑅𝑠𝑦𝑛_𝑠𝑒𝑚
𝑎 = 𝐴𝑣𝑒𝑟𝑎𝑔𝑒𝑃 𝑜𝑜𝑙𝑖𝑛𝑔 𝑖∈𝐼 𝑎 𝑅𝑠𝑦𝑛_𝑠𝑒𝑚 𝑖 (14)
The perception based on syntactic relationships was not completely reflecting the meaning of the text, therefore, we simultaneously
acquired semantic-based aspect-aware representation:
( )
𝑅𝑠𝑒𝑚
𝑎 = 𝐴𝑣𝑒𝑟𝑎𝑔𝑒𝑃 𝑜𝑜𝑙𝑖𝑛𝑔 𝑖∈𝐼 𝑎 𝑂𝑠𝑒𝑚 𝑖 (15)
( )
= −𝑙𝑜𝑔 𝑦 ⋅ 𝑝𝑇 (18)
9
X. Huang et al. Information Processing and Management 61 (2024) 103630
Table 1
Details of the ABSA datasets.
Dataset Positive Negative Neutral
Rest14(Train) 2164 807 637
Rest14(Test) 728 196 196
Lap14(Train) 994 870 464
Lap14(Test) 337 127 168
Twitter(Train) 1561 1560 3127
Twitter(Test) 173 173 346
Algorithm 1 SRE-BERT
Input: text 𝑇 , aspect term 𝑎
1: for 𝑒𝑝𝑜𝑐ℎ = 1 to 𝑁 do
2: produce BERT tokens 𝑇 𝑏 and POS labels 𝑃 𝑏 for 𝑇
3: obtain syntax representation 𝑂𝑠𝑦𝑛 by sending 𝑃 𝑏 into the syntax representation encoder
4: obtain semantic representation 𝑂𝑠𝑒𝑚 by sending 𝑇 𝑏 into the semantic representation encoder
5: obtain syntax-guided semantic representation 𝑅𝑠𝑦𝑛_𝑠𝑒𝑚 by sending 𝑂𝑠𝑦𝑛 and 𝑂𝑠𝑒𝑚 into the syntax-guided transformer
6: obtain syntax-guided aspect-aware representation 𝑅𝑠𝑦𝑛_𝑠𝑒𝑚
𝑎 and semantic-based aspect-aware representation 𝑅𝑠𝑒𝑚
𝑎 through
average pooling
7: obtain aspect-based representation 𝑅𝑎 by concatenating 𝑅𝑠𝑦𝑛_𝑠𝑒𝑚
𝑎 with 𝑅𝑠𝑒𝑚
𝑎
8: calculate the loss using Eq. (17) and Eq. (18)
9: update parameters
10: end for
6. Experimentation
6.1. Datasets
MPLP Datasets. To pre-train the syntax representation encoder, we extracted 110,000 reviews from Yelp and Amazon
reviews, and then converted these reviews into 110,000 POS label sequences according to the method described in Section 5.1.1.
Of these 110,000 POS label sequences, 100,000 are taken as train dataset and another 10,000 as test dataset. The entire train set
consists of 3,481,849 POS labels, and the entire test set consists of 353,028 POS labels. These comments involve public opinions on
various things such as restaurants, supermarkets, computers, clothes, and so on. The writing styles of these comments vary greatly
and can be seen as encompassing a wide range of grammatical structures.
ABSA Datasets. To observe the performance of SRE-BERT in aspect sentiment classification, we experimented with it using
datasets from three different domains, Rest14, Lap14 (Pontiki et al., 2014) and Twitter (Dong et al., 2014). These three datasets
are widely used in ABSA tasks. Rest14 contains customer comments of various aspects of restaurants, Lap14 contains consumer
comments of various aspects of electronic products, and Twitter contains comments from netizens on various subjects on Twitter.
Compared to product or business reviews, the language used in twitter posts is usually less formal, with more casual word choices,
and often includes various abbreviations or omissions of words. As a result, understanding the sentiment orientation of twitter posts
is more challenging. The label for each sample is either ‘‘positive’’, ‘‘negative’’, or ‘‘neutral’’, as shown in Table 1, which illustrates
the sample distribution for all datasets.
6.2. Settings
In our experiments, the syntax representation encoder contains a basic syntax-aware layer and three high-level syntax-aware
layers, all layers have 6 attention heads. The dimension of POS label embedding and relative position embedding are 300, and
there are 38 types of POS labels in total.
During the MPLP phase, each batch contains 100 instances. The POS label sequence has a 15% chance of being masked. The
learning rate equals 0.0005, with a dropout rate of 0.1.
For the syntax-guided transformer, there are three layers in total, with 12 attention heads per layer.
When training SRE-BERT, each batch contains 32 instances, the learning rate equals 0.00005, with a dropout rate at 0.1.
We primarily compared our SRE-BERT with three types of methods: attention-based methods, GNN-based methods, and other
methods:
A. Attention-based methods:
10
X. Huang et al. Information Processing and Management 61 (2024) 103630
Table 2
Overall performance of all models.
Model Rest14 Lap14 Twitter
Acc F1 Acc F1 Acc F1
TD-LSTM 75.37 64.51 68.25 65.96 – –
ATAE-LSTM 78.60 67.02 68.88 63.93 – –
MemNet 79.60 69.60 70.60 65.20 71.50 69.90
IAN 79.30 70.10 72.10 67.40 72.50 70.80
RAM 80.23 70.80 74.49 71.35 69.36 67.30
MGAN 81.25 71.94 75.39 72.47 72.54 70.81
GCAE 77.43 66.24 71.03 64.43 – –
TNet 80.79 70.84 76.01 71.47 73.00 71.40
ASGCN 80.77 72.02 75.55 71.05 71.10 69.50
R-GAT+BERT 86.51 80.98 78.32 74.21 76.01 74.49
DGEDT+BERT 86.16 80.26 79.74 75.59 77.89 75.46
DM-GCN+BERT 85.98 79.89 78.79 75.11 76.30 75.22
dotGCN 86.16 80.49 81.03 78.10 78.11 77.00
BiSyn-GAT+ 86.51 80.72 80.06 77.02 77.10 76.10
Our SRE-BERT 88.21 83.22 82.29 78.68 79.05 77.61
• TD-LSTM initially computes the features of the word sequence preceding and following the target. It subsequently combines
these two sets of features to make predictions (Tang, Qin, Feng, & Liu, 2016) .
• ATAE-LSTM integrates the embeddings of both the aspect term and each word, followed by employing an attention mechanism
to gather sentiment information that is dependent on the target (Wang et al., 2016b) .
• IAN suggests the utilization of mutual attention for obtaining a target description through an interactive approach (Ma et al.,
2017).
• RAM employs various attention mechanisms in pursuit of effectively synthesizing crucial features within complex sentence
structures (Chen, Sun, Bing, & Yang, 2017).
• MGAN employs a multi-grain attention strategy that incorporates both fine-grained and coarse-grained attention to capture
target-context interactions at the word level in the context of ABSA (Fan et al., 2018a).
B. GNN-based methods:
• ASGCN utilizes GCN to acquire insights into the interdependencies among words , and then obtain the perceptual expression
of the aspect term (Zhang et al., 2019).
• R-GAT+BERT explores grammatical relationships between the aspect term and each word by employing dependency pattern
features instead of simple word dependency relationships (Wang et al., 2020).
• DGEDT+BERT combines GCN and transformer to derive sentence expressions and then fuses these two expressions to generate
aspect-aware expressions (Tang et al., 2020).
• DM-GCN+BERT enables the interaction between grammar and semantic expressions using GCN, ultimately obtaining aspect-
aware expressions (Pang et al., 2021) .
• dotGCN introduces a discrete latent trees in place of the dependency tree derived from the parser. It establishes connections
between attention scores and syntactic distances in the context’s aspect and induces trees from attention scores (Chen et al.,
2022).
• BiSyn-GAT+ proposes a bilingual syntax-aware GNN that leverages both constituency and dependency tree dual syntactic
information for modeling the contextual perception with the cross-aspect relationships (Liang et al., 2022).
C. Other methods:
• MemNet generates sentence representations that are closely tied to the target by utilizing jump attention mechanisms (Tang,
Qin, & Liu, 2016).
• GCAE employs CNNs enhanced with control signal techniques to tackle the ABSA challenge (Xue & Li, 2018).
• TNet presents a CNN-based solution for extracting features and maintaining the impact of prior RNN-based approaches by
exploiting contextual information and local semantic information (Li, Bing, Lam, & Shi, 2018).
It can be observed from Table 2, the SRE-BERT demonstrates superior performance when compared to the baseline models across
three distinct datasets, indicating the effectiveness of the SRE-BERT. Furthermore, our findings reveal that models incorporating
BERT consistently outperform those that do not, demonstrating the strong ability of large pre-trained language models in capturing
semantic expressions. In the models utilizing dependency graphs and GNNs, DGEDT+BERT exhibits superior performance in
11
X. Huang et al. Information Processing and Management 61 (2024) 103630
Fig. 6. Effects of various mask probabilities on syntax representation encoder and SRE-BERT in MPLP.
comparison to R-GAT+BERT, as the former integrates both syntax and semantics information of text through the use of the
Transformer and GCN, whereas the latter solely employs GNN. The BiSyn-GAT+ model outperforms the R-GAT+BERT model in
modeling sentiment features solely based on grammar information. This is because the former utilizes both constituency trees and
dependency trees to enhance the interaction, while the latter only relies on dependency trees. This indicates that there is additional
grammar knowledge beyond dependency relationships that can be leveraged. On the other hand, the dotGCN model surpasses all
the GNN-based models mentioned above by replacing the dependency tree with a discrete latent tree derived from attention scores.
This demonstrates the potential of modeling text syntax structure using grammar knowledge beyond dependency relationships.
In the process of pre-training syntax representation encoders with MPLP dataset, mask partial POS labels are designed to enhance
the interaction between each POS label and the context. Different mask probabilities train different syntax representation encoders,
which in turn impact the performance of SRE-BERT in ABSA.
We explored the effect of setting different mask probabilities on the syntax representation encoder with experiments. Further-
more, we explored the effect of syntax representation encoders pre-trained with different mask probabilities on SRE-BERT.
As Fig. 6 shows, when the mask probability is set to 0.05, the syntax representation encoder has the highest accuracy in predicting
the POS labels that are masked. After the mask probability exceeds 0.15, the prediction ability of the syntax representation encoder
starts to decrease continuously, because too many POS labels are masked will seriously damage the syntactic structure of the text
and cannot model the syntactic relations effectively.
However, in ABSA, the syntax representation encoder with the highest predictive ability does not necessarily help SRE-BERT to
obtain the best performance in ABSA. From Fig. 6, we can observe that the syntax representation encoder pre-trained with a mask
probability of 0.05 does not help SRE-BERT as much as the one with a mask probability of 0.15. This is because the mask probability
is too small and the syntax representation obtained is not robust enough.
A large quantity of text may contain more syntactic forms. We experimentally investigate the impact of different sizes of MPLP
datasets on pre-trained syntax representation encoders. Moreover, we explored the influence of these syntax representation encoders
on SRE-BERT.
As shown in Fig. 7, the larger the MPLP dataset is, the stronger the predictive ability of the pre-trained syntax representation
encoder is, and the greater the improvement in SRE-BERT. This is because when the MPLP dataset is small, it contains a limited
number of texts, and the syntax representation encoder cannot learn sufficient grammar knowledge. In contrast, when the MPLP
dataset is large, it includes texts with various types of grammatical structures, enabling the syntax representation encoder to receive
thorough training.
In several different datasets, the performance of SRE-BERT consistently improves as the MPLP dataset scales up. This indicates
the wide applicability of syntactic knowledge across different domains.
12
X. Huang et al. Information Processing and Management 61 (2024) 103630
Fig. 7. Effects of various sizes of MPLP datasets on syntax representation encoder and SRE-BERT.
Fig. 8. Effects of layers and quantity of attention heads within the syntax-guided transformer on SRE-BERT.
The syntax-guided transformer utilizes syntactic representation to compute multi-headed attention for integrating semantic
information. Consequently, we explore the effects of the number of layers and attention heads in the syntax-guided transformer.
As depicted in Fig. 8, the optimal performance is achieved at 3 layers, after that, it hardly changes anymore. This indicates that
a 3-layer syntax-guided transformer has sufficiently utilized syntax representation to integrate semantic information.
Meanwhile, the greater the quantity of attention heads, the higher the performance of SRE-BERT, indicating that multiple
attention heads are more conducive to fusing semantic information utilizing syntax representation.
In the ablation study, we investigated separately the results using syntax-guided aspect-aware representation 𝑅𝑠𝑦𝑛_𝑠𝑒𝑚
𝑎 solely and
the results using semantic-based aspect-aware representation 𝑅𝑠𝑒𝑚 𝑎 solely.
In addition, to explore the differences between token-level syntax information and word-level syntax information, we also
attempted to use word-level POS labels as a replacement for token-level extended POS labels, denoted as ‘‘w/o EPL’’ (without
extended POS labels). It is worth noting that, to maintain consistency between the syntax representation encoder and the semantic
representation encoder, in ‘‘w/o EPL’’, we performed average pooling on token-level semantic vectors to acquire word-level semantic
vectors.
13
X. Huang et al. Information Processing and Management 61 (2024) 103630
Table 3
Ablation study.
Model Rest14 Lap14 Twitter
Acc F1 Acc F1 Acc F1
SRE-BERT 88.21 83.22 82.29 78.68 79.05 77.61
𝑅𝑠𝑦𝑛_𝑠𝑒𝑚
𝑎 84.20 75.36 79.94 76.54 76.30 75.00
𝑅𝑠𝑒𝑚
𝑎
85.80 78.95 78.53 74.07 75.87 74.53
w/o EPL 86.60 81.05 81.17 77.95 78.18 76.73
w/o MPLP 85.98 78.87 79.11 74.32 76.30 75.22
Finally, to investigate the impact of the MPLP task on the pre-trained syntax representation encoder, we attempted to remove
the MPLP task, referred to as ‘‘w/o MPLP’’ (without MPLP).
As shown in Table 3, 𝑅𝑠𝑦𝑛_𝑠𝑒𝑚
𝑎 performs weaker than 𝑅𝑠𝑒𝑚𝑎 on the Rest14 and stronger than 𝑅𝑠𝑒𝑚
𝑎 on the Lap14 and Twitter, while
employing a combination of both works best. This indicates that combining syntactic vector-enhanced semantic representations with
semantic representations gained directly from BERT can yield more comprehensive contextual information.
The performance of ‘‘w/o EPL’’ is inferior to SRE-BERT, indicating that token-level feature exchange is more precise and accurate
than word-level feature exchange. This is because BERT itself is pretrained based on sub-words, and pooling the average values
of token-level features into word-level features can result in semantic information decay. However, the effectiveness of ‘‘w/o EPL’’
surpasses 𝑅𝑠𝑒𝑚
𝑎 , which might be due to the fact that even word-level grammatical knowledge can enhance word semantic expression.
The performance of ‘‘w/o MPLP’’ heavily declined compared to SRE-BERT, indicating that using the MPLP task for pretraining
the syntax representation encoder enhances its ability to model syntactic structures.
We selected two cases to visually exemplify the feature fusion capability of GNN-based models and our SRE-BERT for samples
with aspect terms and opinion words that have a far syntactic distance. Figs. 11 and 12 respectively illustrate the dependency
relationships and shortest dependency paths between aspect terms and opinion words for the two cases. It can be observed that in
these cases, the shortest distance between aspect terms and opinion words exceeds 2, which is the number of layers typically set
for most GNN-based ABSA models.
Furthermore, to gain a clearer understanding of which tokens are critical in a sentence, the following formula is utilized to
calculate the importance of each token in aspect-based representation 𝑅𝑎 :
‖ ‖
( ) ‖𝑅𝑎 − 𝑅𝑎∖𝑡𝑖 ‖
𝐼 𝑡𝑖 = ∑ ‖ ‖ (19)
𝑚 ‖ ‖
‖𝑅
𝑖=1 ‖ 𝑎 − 𝑅 𝑎∖𝑡𝑖 ‖
‖
where 𝑚 is the total number of tokens contained in the input text, 𝑡𝑖 represents the 𝑖th token, 𝑅𝑎 is the aspect-based representation,
( )
and 𝑅𝑎∖𝑡𝑖 represents the aspect-based representation obtained by replacing 𝑡𝑖 with ‘‘[MASK]’’. And 𝐼 𝑡𝑖 represents the importance
of the 𝑖th token to 𝑅𝑎 .
The visualizations in Figs. 9 and 10 represent two cases, demonstrating the importance of each token to the representation 𝑅𝑎 . To
better illustrate the differences between our SRE-BERT and GNN-based approaches, we also compare the results with R-GAT+BERT
and DM-GCN+BERT.
Case 1 ‘‘After numerous attempts of trying (including setting the clock in BIOS setup directly), I gave up(I am a techie)’’.
In this text, the aspect term is ‘‘clock in BIOS setup’’ and the true sentiment polarity is ‘‘negative’’. Fig. 11 shows the dependency
graph of this sentence and the shortest dependency path between aspect terms and opinion words. It can be observed that the
syntactic distance between aspect terms and opinion words exceeds 5, indicating that it is challenging for GNN to facilitate feature
interaction between them.
From Fig. 9, it can be seen that our SRE-BERT captured the tokens ‘‘numerous’’ and ‘‘gave up’’ and made correct predictions.
On the other hand, R-GAT+BERT incorrectly captured the token ‘‘directly’’ and made an incorrect prediction. This may be because
the number of GNN layers in R-GAT+BERT is set to 2, but the aspect term is distant from the tokens with sentiment tendency in
the dependency graph, resulting in the inability to convey sentiment information to the aspect term.
Similarly, DM-GCN+BERT captured the token ‘‘trying’’ but made an incorrect prediction. Again, in order to achieve the best
GNN performance, DM-GCN+BERT also assigns the quantity of GNN layers as 2, which essentially hinders the effectiveness of the
GNN in the model.
Case 2 ‘‘The best thing about this laptop is the price along with some of the newer features’’.
In this text, the aspect term is ‘‘price’’ and the true sentiment polarity is ‘‘positive’’. Fig. 12 shows the dependency graph of this
sentence. It can be observed that the syntactic distance between ‘‘price’’ and ‘‘best’’ is 3, exceeding the optimal number of layers
for typical GNN-based models.
Fig. 10 demonstrates our SRE-BERT captured the token ‘‘best’’ and made correct predictions.
Unfortunately, R-GAT+BERT incorrectly captured the token ‘‘the’’ and made an incorrect prediction. This could still be due to
errors caused by the limited layers of GNN, as the tokens ‘‘best’’ and ‘‘price’’ are relatively far apart in the dependency graph.
14
X. Huang et al. Information Processing and Management 61 (2024) 103630
Fig. 9. Visualization of the importance level of each token in case 1. The aspect term is ‘‘clock in BIOS setup’’ and the true emotional tendency is ‘‘negative’’.
R-GAT+BERT predicts ‘‘positive’’, DM-GCN+BERT predicts ‘‘positive’’, and our SRE-BERT predicts ‘‘negative’’.
Fig. 10. Visualization of the importance level of each token in case 2. The aspect term is ‘‘price’’ and true emotional tendency is ‘‘positive’’. R-GAT+BERT
predicts ‘‘neutral’’, DM-GCN+BERT predicts ‘‘positive’’, and our SRE-BERT predicts ‘‘positive’’.
DM-GCN+BERT captured the ‘‘best’’ and ‘‘laptop’’ and made the correct prediction. This could be due to the use of semantic
representation in addition to GNN in DM-GCN+BERT, which reduces the impact of limitations imposed by GNN compared to
R-GAT+BERT that only utilizes GNN.
8. Discussion
Grammar knowledge can be seen as the rules formulated by humans when creating language. With these rules, humans can
capture various information contained in texts more quickly. Against this background, many ABSA models attempt to use dependency
graphs combined with GNN to achieve more efficient communication between words. However, these models often face two issues:
there is a gap between word-based dependency graphs and token-based pre-trained language models like BERT, and the inherent
structural problem of GNN limits the communication between words within a certain number of layers. To be able to use grammar
knowledge more flexibly, we propose SRE-BERT, a method that directly generates token-level syntax vectors to model syntax
relationships instead of using dependency graphs and GNN. Specifically, we first extend the part-of-speech tags from word level to
token level, and then calculate the syntax vector of each token based on the relative position and part-of-speech tags between tokens.
In this process, an MPLP task is proposed for the pre-training of the syntax expression encoder. Then, we use token-level syntax
vectors to calculate multi-head attention to directly model the syntax relationships between any tokens. Finally, the syntax-guided
semantic representation obtained and the semantic representation obtained from BERT are combined for ABSA tasks.
Compared with previous studies that utilize syntactic knowledge to improve the performance of ABSA models (Chen et al.,
2022; Liang et al., 2022; Pang et al., 2021; Tang et al., 2020; Wang et al., 2020; Zhang et al., 2019), our study has the following
implications:
• Theoretically, our research explores the possibility of establishing syntactic connections between any two tokens directly in the
text. Previous studies have always relied on dependency graphs combined with GNN when using syntactic knowledge, which
undoubtedly restricts the scope of information transmission to a small range and hinders communication between distant
words. However, the syntactic structure of text is complex and diverse, which often leaves GNN-based ABSA models helpless
in dealing with situations where there is a considerable distance between aspects and other words. We directly use the syntactic
vectors of each token to construct their syntactic relationships, allowing even distant tokens to be considered in their syntactic
connections instead of abruptly severing them.
• By extending syntactic relationships to the token level, we facilitate the alignment between syntactic knowledge and pre-trained
language models. Previous research has only modeled word-level syntactic relationships using dependency graphs based on
word relationships, which undoubtedly creates a gap with token-based pre-trained language models like BERT. We extend
syntactic relationships to the token level, providing a new approach for the compatibility between syntactic knowledge and
BERT.
• The training data for aspect-based sentiment analysis (ABSA) tasks itself is quite limited. Our proposed MPLP task directly
uses a large amount of unlabeled text to pre-train syntactic vectors, which undoubtedly significantly reduces the cost of using
syntactic knowledge.
15
X. Huang et al. Information Processing and Management 61 (2024) 103630
Fig. 11. Dependency graph and shortest dependency paths between aspect terms and opinion words for case 1.
16
X. Huang et al. Information Processing and Management 61 (2024) 103630
Fig. 12. Dependency graph and shortest dependency paths between aspect terms and opinion words for case 2.
17
X. Huang et al. Information Processing and Management 61 (2024) 103630
9. Conclusion
Existing GNN-based aspect sentiment analysis models are limited by the inherent structure of GNNs and are powerless in
cases where the grammatical distance between two words exceeds the number of GNN layers. To overcome this drawback, we
propose a syntax representation enhanced BERT (SRE-BERT), which uses an elaborate syntactic encoder to obtain token-level syntax
representation and subsequently feeds them into a syntax-guided transformer for computing multi-head attention, thereby allowing
any two BERT tokens to be directly connected syntactically. Finally, we fuse the semantic representation generated by syntax-guided
transformer with the semantic representation generated by BERT for the ABSA task. Three sets of domain-specific data were used
to test the efficacy of our approach, and the experimental results validate its effectiveness.
In general, this paper explores the possibility of using syntax-guided transformer with syntax vectors as a replacement for GNN
and dependency graphs, aiming to enrich the syntactic information representation instead of relying on simplistic syntax information.
Our future research will focus on the following two aspects:
• Firstly, we will further investigate alternative methods for obtaining syntactic representations. In this paper, we utilize part-of-
speech tags and relative distances as inputs and pre-train the word’s syntax representation through the MPLP task. However,
this may not be the optimal approach.
• Secondly, we will explore how to better integrate syntax representations and semantic representations to obtain more accurate
aspect-level representations. In this paper, we use syntax representations as Query and Key, semantic representations as Value,
and compute the syntax-guided semantic representation, which is then concatenated with the semantic representation obtained
from BERT. We believe that a more in-depth fusion of these two aspects may lead to improved results.
Xiaosai Huang: Conceptualization, Methodology, Software, Writing – original draft. Jing Li: Supervision, Conceptualization,
Funding acquisition, Resources. Jia Wu: Conceptualization, Resources, Supervision, Writing – review & editing. Jun Chang: Data
curation, Software, Validation. Donghua Liu: Resources, Visualization. Kai Zhu: Writing – review & editing.
Data availability
Acknowledgments
This work was supported by the National Natural Science Foundation of China under Grant No. 62372335 and the ARC Project
DP230100899.
References
Bao, L., Lambert, P., & Badia, T. (2019). Attention and lexicon regularized LSTM for aspect-based sentiment analysis. In Proceedings of the 57th annual
meeting of the association for computational linguistics: Student research workshop (pp. 253–259). Florence, Italy: Association for Computational Linguistics,
http://dx.doi.org/10.18653/v1/P19-2035, Retrieved from https://aclanthology.org/P19-2035.
Chen, P., Sun, Z., Bing, L., & Yang, W. (2017). Recurrent attention network on memory for aspect sentiment analysis. In Proceedings of the 2017 conference on
empirical methods in natural language processing (pp. 452–461). Copenhagen, Denmark: Association for Computational Linguistics, http://dx.doi.org/10.18653/
v1/D17-1047, Retrieved from https://aclanthology.org/D17-1047.
Chen, C., Teng, Z., Wang, Z., & Zhang, Y. (2022). Discrete opinion tree induction for aspect-based sentiment analysis. In Proceedings of the 60th annual
meeting of the association for computational linguistics (Volume 1: Long Papers) (pp. 2051–2064). Dublin, Ireland: Association for Computational Linguistics,
http://dx.doi.org/10.18653/v1/2022.acl-long.145, Retrieved from https://aclanthology.org/2022.acl-long.145.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805.
Dong, L., Wei, F., Tan, C., Tang, D., Zhou, M., & Xu, K. (2014). Adaptive recursive neural network for target-dependent Twitter sentiment classification. In
Proceedings of the 52nd annual meeting of the association for computational linguistics (Volume 2: Short Papers) (pp. 49–54). Baltimore, Maryland: Association
for Computational Linguistics, http://dx.doi.org/10.3115/v1/P14-2009, Retrieved from https://aclanthology.org/P14-2009.
Fan, F., Feng, Y., & Zhao, D. (2018a). Multi-grained attention network for aspect-level sentiment classification. In Proceedings of the 2018 conference on empirical
methods in natural language processing (pp. 3433–3442). Brussels, Belgium: Association for Computational Linguistics, http://dx.doi.org/10.18653/v1/D18-1380,
Retrieved from https://aclanthology.org/D18-1380.
Fan, F., Feng, Y., & Zhao, D. (2018b). Multi-grained attention network for aspect-level sentiment classification. In Proceedings of the 2018 conference on empirical
methods in natural language processing (pp. 3433–3442).
Gu, S., Zhang, L., Hou, Y., & Song, Y. (2018). A position-aware bidirectional attention network for aspect-level sentiment analysis. In Proceedings of the 27th
international conference on computational linguistics (pp. 774–784). Santa Fe, New Mexico, USA: Association for Computational Linguistics, Retrieved from
https://aclanthology.org/C18-1066.
He, R., Lee, W. S., Ng, H. T., & Dahlmeier, D. (2018). Effective attention modeling for aspect-level sentiment classification. In Proceedings of the 27th international
conference on computational linguistics (pp. 1121–1131). Santa Fe, New Mexico, USA: Association for Computational Linguistics.
Huang, B., & Carley, K. (2019). Syntax-aware aspect level sentiment classification with graph attention networks. In Proceedings of the 2019 conference on empirical
methods in natural language processing and the 9th international joint conference on natural language processing (pp. 5469–5477). Hong Kong, China: Association
for Computational Linguistics, http://dx.doi.org/10.18653/v1/D19-1549, Retrieved from https://aclanthology.org/D19-1549.
Jin, W., Zhao, B., Zhang, L., Liu, C., & Yu, H. (2023). Back to common sense: Oxford dictionary descriptive knowledge augmentation for aspect-based sentiment
analysis. Information Processing & Management, 60(3), Article 103260.
18
X. Huang et al. Information Processing and Management 61 (2024) 103630
Lengkeek, M., van der Knaap, F., & Frasincar, F. (2023). Leveraging hierarchical language models for aspect-based sentiment analysis on financial data. Information
Processing & Management, 60(5), Article 103435.
Li, X., Bing, L., Lam, W., & Shi, B. (2018). Transformation networks for target-oriented sentiment classification. In Proceedings of the 56th annual meeting
of the association for computational linguistics (Volume 1: Long Papers) (pp. 946–956). Melbourne, Australia: Association for Computational Linguistics,
http://dx.doi.org/10.18653/v1/P18-1087, Retrieved from https://aclanthology.org/P18-1087.
Li, R., Chen, H., Feng, F., Ma, Z., Wang, X., & Hovy, E. (2021). Dual graph convolutional networks for aspect-based sentiment analysis. In Proceedings
of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (Volume
1: Long Papers) (pp. 6319–6329). Online: Association for Computational Linguistics, http://dx.doi.org/10.18653/v1/2021.acl-long.494, Retrieved from
https://aclanthology.org/2021.acl-long.494.
Li, Q., Han, Z., & Wu, X.-M. (2018). Deeper insights into graph convolutional networks for semi-supervised learning. In Thirty-second AAAI conference on artificial
intelligence.
Li, L., Liu, Y., & Zhou, A. (2018a). Hierarchical attention based position-aware network for aspect-level sentiment analysis. In Proceedings of the 22nd conference on
computational natural language learning (pp. 181–189). Brussels, Belgium: Association for Computational Linguistics, http://dx.doi.org/10.18653/v1/K18-1018,
Retrieved from https://aclanthology.org/K18-1018.
Li, L., Liu, Y., & Zhou, A. (2018b). Hierarchical attention based position-aware network for aspect-level sentiment analysis. In Proceedings of the 22nd conference
on computational natural language learning (pp. 181–189).
Liang, S., Wei, W., Mao, X.-L., Wang, F., & He, Z. (2022). BiSyn-GAT+: Bi-syntax aware graph attention network for aspect-based sentiment analysis.
In Findings of the association for computational linguistics: ACL 2022 (pp. 1835–1848). Dublin, Ireland: Association for Computational Linguistics, http:
//dx.doi.org/10.18653/v1/2022.findings-acl.144, Retrieved from https://aclanthology.org/2022.findings-acl.144.
Lu, G., Li, J., & Wei, J. (2022). Aspect sentiment analysis with heterogeneous graph neural networks. Information Processing & Management, 59(4), Article 102953.
Luo, X., Wu, J., Zhou, C., Zhang, X., & Wang, Y. (2020). Deep semantic network representation. In 2020 IEEE international conference on data mining (pp.
1154–1159). IEEE.
Ma, D., Li, S., Zhang, X., & Wang, H. (2017). Interactive attention networks for aspect-level sentiment classification. In Proceedings of the 26th international joint
conference on artificial intelligence (pp. 4068–4074).
Majumder, N., Poria, S., Gelbukh, A., Akhtar, M. S., Cambria, E., & Ekbal, A. (2018). IARM: Inter-aspect relation modeling with memory networks in aspect-based
sentiment analysis. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 3402–3411). Brussels, Belgium: Association
for Computational Linguistics, http://dx.doi.org/10.18653/v1/D18-1377, Retrieved from https://aclanthology.org/D18-1377.
Meškelė, D., & Frasincar, F. (2020). ALDONAr: A hybrid solution for sentence-level aspect-based sentiment analysis using a lexicalized domain ontology and a
regularized neural attention model. Information Processing & Management, 57(3), Article 102211.
Pang, S., Xue, Y., Yan, Z., Huang, W., & Feng, J. (2021). Dynamic and multi-channel graph convolutional networks for aspect-based sentiment analysis.
In Findings of the association for computational linguistics: ACL-IJCNLP 2021 (pp. 2627–2636). Online: Association for Computational Linguistics, http:
//dx.doi.org/10.18653/v1/2021.findings-acl.232, Retrieved from https://aclanthology.org/2021.findings-acl.232.
Pontiki, M., Galanis, D., Pavlopoulos, J., Papageorgiou, H., Androutsopoulos, I., & Manandhar, S. (2014). SemEval-2014 task 4: Aspect based sentiment
analysis. In Proceedings of the 8th international workshop on semantic evaluation (pp. 27–35). Dublin, Ireland: Association for Computational Linguistics,
http://dx.doi.org/10.3115/v1/S14-2004, Retrieved from https://aclanthology.org/S14-2004.
Ruder, S., Ghaffari, P., & Breslin, J. G. (2016). A hierarchical model of reviews for aspect-based sentiment analysis. In Proceedings of the 2016 conference on empirical
methods in natural language processing (pp. 999–1005). Austin, Texas: Association for Computational Linguistics, http://dx.doi.org/10.18653/v1/D16-1103,
Retrieved from https://aclanthology.org/D16-1103.
Song, M., Park, H., & Shin, K.-s. (2019). Attention-based long short-term memory network using sentiment lexicon embedding for aspect-level sentiment analysis
in Korean. Information Processing & Management, 56(3), 637–653.
Sun, K., Zhang, R., Mensah, S., Mao, Y., & Liu, X. (2019). Aspect-level sentiment analysis via convolution over dependency tree. In Proceedings of the 2019
conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (pp. 5679–5688). Hong
Kong, China: Association for Computational Linguistics, http://dx.doi.org/10.18653/v1/D19-1569, Retrieved from https://aclanthology.org/D19-1569.
Tang, H., Ji, D., Li, C., & Zhou, Q. (2020). Dependency graph enhanced dual-transformer structure for aspect-based sentiment classification. In Proceedings of the
58th annual meeting of the association for computational linguistics (pp. 6578–6588). Online: Association for Computational Linguistics, http://dx.doi.org/10.
18653/v1/2020.acl-main.588, Retrieved from https://aclanthology.org/2020.acl-main.588.
Tang, D., Qin, B., Feng, X., & Liu, T. (2016). Effective LSTMs for target-dependent sentiment classification. In Proceedings of COLING 2016, the 26th
international conference on computational linguistics: Technical papers (pp. 3298–3307). Osaka, Japan: The COLING 2016 Organizing Committee, Retrieved
from https://aclanthology.org/C16-1311.
Tang, D., Qin, B., & Liu, T. (2016). Aspect level sentiment classification with deep memory network. In Proceedings of the 2016 conference on empirical methods
in natural language processing (pp. 214–224). Austin, Texas: Association for Computational Linguistics, http://dx.doi.org/10.18653/v1/D16-1021, Retrieved
from https://aclanthology.org/D16-1021.
Tay, Y., Tuan, L. A., & Hui, S. C. (2018). Learning to attend via word-aspect associative fusion for aspect-based sentiment analysis. In Proceedings of the AAAI
conference on artificial intelligence: vol. 32, (no. 1).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017). Attention is all you need. In Advances in neural information processing
systems: vol. 30.
Wang, Y., Huang, M., Zhu, X., & Zhao, L. (2016a). Attention-based LSTM for aspect-level sentiment classification. In Proceedings of the 2016 conference on empirical
methods in natural language processing (pp. 606–615). Austin, Texas: Association for Computational Linguistics, http://dx.doi.org/10.18653/v1/D16-1058,
Retrieved from https://aclanthology.org/D16-1058.
Wang, Y., Huang, M., Zhu, X., & Zhao, L. (2016b). Attention-based LSTM for aspect-level sentiment classification. In Proceedings of the 2016 conference on empirical
methods in natural language processing (pp. 606–615).
Wang, J., Li, J., Li, S., Kang, Y., Zhang, M., Si, L., et al. (2018). Aspect sentiment classification with both word-level and clause-level attention networks. In
IJCAI, Vol. 2018 (pp. 4439–4445).
Wang, K., Shen, W., Yang, Y., Quan, X., & Wang, R. (2020). Relational graph attention network for aspect-based sentiment analysis. In Proceedings of the 58th
annual meeting of the association for computational linguistics (pp. 3229–3238). Online: Association for Computational Linguistics, http://dx.doi.org/10.18653/
v1/2020.acl-main.295, Retrieved from https://aclanthology.org/2020.acl-main.295.
Xue, W., & Li, T. (2018). Aspect based sentiment analysis with gated convolutional networks. In Proceedings of the 56th annual meeting of the association for
computational linguistics (Volume 1: Long Papers) (pp. 2514–2523). Melbourne, Australia: Association for Computational Linguistics, http://dx.doi.org/10.
18653/v1/P18-1234, Retrieved from https://aclanthology.org/P18-1234.
Yang, C., Zhang, H., Jiang, B., & Li, K. (2019). Aspect-based sentiment analysis with alternating coattention networks. Information Processing & Management,
56(3), 463–478.
Zhang, C., Li, Q., & Song, D. (2019). Aspect-based sentiment classification with aspect-specific graph convolutional networks. In Proceedings of the 2019 conference
on empirical methods in natural language processing and the 9th international joint conference on natural language processing (pp. 4568–4578). Hong Kong, China:
Association for Computational Linguistics, http://dx.doi.org/10.18653/v1/D19-1464, Retrieved from https://aclanthology.org/D19-1464.
19
X. Huang et al. Information Processing and Management 61 (2024) 103630
Zhang, K., Zhang, K., Zhang, M., Zhao, H., Liu, Q., Wu, W., et al. (2022). Incorporating dynamic semantics into pre-trained language model for aspect-based
sentiment analysis. In Findings of the association for computational linguistics: ACL 2022 (pp. 3599–3610). Dublin, Ireland: Association for Computational
Linguistics, http://dx.doi.org/10.18653/v1/2022.findings-acl.285, Retrieved from https://aclanthology.org/2022.findings-acl.285.
Zhao, L., Liu, Y., Zhang, M., Guo, T., & Chen, L. (2021). Modeling label-wise syntax for fine-grained sentiment analysis of reviews via memory-based neural
model. Information Processing & Management, 58(5), Article 102641.
Zhu, X., Kuang, Z., & Zhang, L. (2023). A prompt model with combined semantic refinement for aspect sentiment analysis. Information Processing & Management,
60(5), Article 103462.
20