0% found this document useful (0 votes)
27 views

Let Your Graph Do The Talking: Encoding Structured Data For Llms

Uploaded by

kke89brnib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Let Your Graph Do The Talking: Encoding Structured Data For Llms

Uploaded by

kke89brnib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Let Your Graph Do the Talking: Encoding Structured Data for LLMs

Bryan Perozzi 1 Bahare Fatemi 1 Dustin Zelle 1 Anton Tsitsulin 1


Mehran Kazemi 1 Rami Al-Rfou 2 Jonathan Halcrow 1

Abstract Answer Answer


How can we best encode structured data into
sequential form for use in large language mod-
arXiv:2402.05862v1 [cs.LG] 8 Feb 2024

LLM LLM
els (LLMs)? In this work, we introduce a
parameter-efficient method to explicitly represent
structured data for LLMs. Our method, Graph-
Token, learns an encoding function to extend Text(G) GraphToken(G)

prompts with explicit structured information. Un-


like other work which focuses on limited domains Is there a Is there a
cycle in cycle in
(e.g., knowledge graph representation), our work this
this
is the first effort focused on the general encoding graph? graph?
of structured data to be used for various reasoning G Q G Q
tasks. We show that explicitly representing the (a) (b)
graph structure allows significant improvements
to graph reasoning tasks. Specifically, we see Figure 1. Graph encoding options for a frozen LLM. a) Fixed
across the board improvements - up to 73% points encoding, e.g., (Fatemi et al., 2024; Wang et al., 2023b; Stechly
- on node, edge and, graph-level tasks from the et al., 2023), b) This work proposes using GraphToken, a learned
GraphQA benchmark. graph prompt function to explicitly encode graphs in a parameter
efficient way.

1. Introduction
There has been an explosion of recent excitement around with new and supporting information, they are capable of
using LLMs (Vaswani et al., 2017; Devlin et al., 2018; Rad- adapting their parametric beliefs to effectively incorporate
ford et al., 2018; Raffel et al., 2020; Brown et al., 2020; new evidence.
Touvron et al., 2023; Zhao et al., 2023) to represent, pro- An automatic way to enrich the input context of an LLM
cess, and analyze textual data. These models typically take with factual and fresh information is through Retrieval Aug-
sequential text as their input but recent work has extended mented Generation (RAG) (Khandelwal et al., 2019; Guu
inputs to spatial and temporal modalities (e.g., image (Chen et al., 2020). RAG works by augmenting the prompt with
et al., 2022) and video (Arnab et al., 2021)). additional relevant, factual and fresh information. Sources
Despite their success, current realizations of LLMs have for RAG might include web searches or private databases.
noticeable problems – including a tendency to generate Often this information is in the form of structured data –
outputs which are untrue or unsupported by their prompt, data that has complex dependencies between different, dis-
commonly referred to as hallucinations (Wang et al., 2023a). crete parts of the whole. For example, private relational
Another intimately related issue is the problem of freshness, databases, social networks, or molecules all have relational
where the knowledge required to answer a query exists information between their discrete data items.
only after an LLM’s training date (Vu et al., 2023). One Structured data is ubiquitous in the real world – it surrounds
mitigation for these problems is through the enrichment our daily lives – and understanding how to represent this
of the prompt with additional factual and fresh data. As data optimally for its inclusion in LLMs is a crucial research
Kadavath et al. (2022) showed, when LLMs are supplied question. The predominant mode of encoding structured
1 data for LLMs is to use various types of hand-crafted, text-
Google Research 2 Waymo Research.
Correspondence to: Bryan Perozzi <[email protected]>. based serialization (Guo et al., 2023; Wang et al., 2023b;
Stechly et al., 2023) (See Figure 1 (a)). This approach
can impose significant decoding complexity for the lan-

1
Let Your Graph Do the Talking: Encoding Structured Data for LLMs

guage model: from any text description, the model must previous ones. While earlier models were mainly based on
first correctly decode and understand the structure before it N-gram models (Jurafsky, 2021), newer models adopted
can utilize the information. Recently, Fatemi et al. (2024) neural approaches with the advent of distributed word repre-
demonstrated that pure text representations of structured sentations (Bengio et al., 2000; Mikolov et al., 2013). The
data are insufficient for graph reasoning with LLMs. They increased power offered by the neural language models and
show that LLMs are not able to utilize structure efficiently the increase in model and dataset sizes has led to a new
when posed with common reasoning tasks that are easily learning paradigm where large language models (LLMs) are
answered by classical graph algorithms. This highlights pre-trained in an unsupervised way on massive amounts of
the need to explore better and more expressive ways of textual data and are used as base (foundation) models (De-
representing structured data to an LLM. vlin et al., 2018; Radford et al., 2019). For each downstream
application, the base model is fine-tuned on small amounts
In this paper, we propose GraphToken (Figure 1 (b)), a
of task-specific data to adapt the model to the task.
parameter-efficient method for representing structured data
for LLMs. Pre-training LLMs on text corpora closely re-
lated to the desired reasoning task can enhance performance,
but it can be computationally expensive, particularly for
larger models. Additionally, fine-tuning requires domain- Parameter-Efficient Fine-Tuning: With the rapid growth
specific data and human expertise, further increasing costs. in the number of parameters for the state-of-the-art LLMs
Inspired by recent advancements in parameter-efficient fine- (Achiam et al., 2023; Team et al., 2023) fine-tuning for each
tuning (Lester et al., 2021; Xu et al., 2023), our method, downstream task has become prohibitively expensive in both
GraphToken, learns an encoding function that generates time and resources. The goal of parameter-efficient fine-
fine-tuned soft-token prompts. The soft-token prompt ex- tuning (PEFT) (Xu et al., 2023) is to adapt models to new
tends a textual prompt with explicit GraphToken encoded tasks by updating only a small number of (possibly new)
structural information, allowing us to train only a trivial parameters. There are a few dominant PEFT approaches:
number of GraphToken parameters when compared to the • Adapter-based approaches (Houlsby et al., 2019; He
total LLM parameter budget. et al., 2021) hold the LLM parameters frozen and add
Our work is the first to develop parameter-efficient encoders new trainable parameters to various parts of the model,
specifically for general reasoning tasks on structured data. with the main differentiating factor between different
We demonstrate that explicitly representing structure leads approaches being where the adapter parameters are
to significant improvement on the comprehensive GraphQA added.
benchmark (Fatemi et al., 2024). • LoRA and its variants (Hu et al., 2021; Edalati et al.,
2022; Valipour et al., 2022) similarly hold the LLM
Our Contributions. We propose the following innovations: parameters frozen and add new trainable parameters,
• GraphToken, a novel parameter-efficient encoder for however these trainable parameters are added to the
structured data inclusion in LLMs. frozen LLM parameters such that the fine-tuned LLM
• Extensive experiments on various graph reasoning is identical in architecture to the initial LLM, but with
tasks showing that our method significantly improves only those added parameters update.
LLM capabilities. • Partial fine-tuning and partial masking approaches
• Analysis demonstrating that the GraphToken encoder (Zhao et al., 2020; Zaken et al., 2021) only fine-tune
generalizes to both unseen tasks and graphs. or mask a subset of the LLM parameters – no new
parameters are introduced.
• Finally, soft-prompt approaches (Li & Liang, 2021;
2. Background Lester et al., 2021) prepend tokens with learnable pa-
We introduce the related work in LLMs, prompting methods, rameters to the beginning of the LLM input or to the
Graph Neural Networks (GNNs), graph encoders, and graph beginning of every LLM layer – like adapter-based and
models combined with LLMs. LoRA approaches, they hold the actual LLM parame-
ters frozen.
2.1. Large Language Models Our work falls under the umbrella of soft-prompt ap-
Pre-Trained Large Language Models (LLMs): Lan- proaches but can be extended to other PEFT approaches
guage models (Rosenfeld, 2000; Zhao et al., 2023) are prob- as well. Most relevant to our work is the work of Levine
abilistic models that assign probabilities to sequences of et al. (2022), where the input is fed into a separate trainable
words by breaking the probability of a sequence into the neural network to produce the soft-prompt. We extend this
product of the probabilities of the next tokens given the to encoding structured data input via a GNN to produce the
LLM soft-prompt.

2
Let Your Graph Do the Talking: Encoding Structured Data for LLMs

2.2. Graph Encoding with Neural Networks embedding space as “graph tokens.”
The field of graph representation learning (Chami et al., To maintain the reasoning and language capabilities of the
2022) seeks ways to represent structured data (i.e., discrete LLM, we freeze its parameters and teach the graph encoder
and relation) in a continuous domain – typically for use in to align its output representations with the LLM embedding
a downstream machine learning task. While seminal work space: we learn only those parameters of the graph encoder
like DeepWalk (Perozzi et al., 2014) popularized the node during the training process. This reduces computational
embedding problem, later work utilized GNNs to generalize requirements significantly (graph encoder parameters con-
and learn representations of the entire graph (graph embed- stitute a trivial sum compared to the LLM). During our tests,
dings). Many approaches learning graph representations the LLM is prompted with the output of the graph encoder
(node or graph embeddings) have followed (Tsitsulin et al., and a task about the graph, for example: ‘Does this graph
2018; Xie et al., 2022). contain a cycle?’. As such, the quality of the results is
purely a function of how well the graph encoder represents
2.3. Graphs and LLMs the answer to the task and how well the LLM interprets that
output.
The confluence of graph representation learning and rea-
soning with LLMs is a rapidly growing field of research – 3.1. Architecture
like language, structured data surrounds us but, unlike LLM
input, it is not sequential. Some of the first graphs in the An overview of the architecture is provided in Figure 2. At
literature are knowledge graphs as in (Agarwal et al., 2020), a high level, our model only has two components. First,
where the retrieval corpus of a retrieval LLM is augmented the graph encoder takes a graph as input and outputs a
with text-encoded knowledge graphs. Ye et al. (2023) utilize fixed number of token embeddings. These tokens are then
instruction fine-tuned LLMs for node classification. Simi- prepended to the sequence of initial token embeddings in
larly, Chen et al. (2023b) leverage LLMs to enhance graph the prompt for an LLM, which is then decoded to produce
learning models by incorporating rich text attributes. Wang an answer as text.
et al. (2023b) showed that language models demonstrate pre-
Graph Encoder. GNN models range from simple averag-
liminary abilities for graph reasoning tasks. Later, Fatemi
ing methods to complex models with multi-headed attention.
et al. (2024) proposed GraphQA – a comprehensive bench-
Thus there are a wide variety of graph representations possi-
mark to systematically evaluate models for graph reasoning
ble. We suspect that some of these representations are more
tasks – finding that graph reasoning tasks are currently dif-
suited to be consumed by an LLM. Therefore, we conducted
ficult and that scaling laws do not seem to apply. Finally,
a thorough study that includes several well-known graph
while there is a growing body of work in pre-training, fine-
encoder choices in Section 4.2. Our graph encoder takes the
tuning, and prompt-tuning with GNNs by themselves (Fang
relational structure of the graph as input, using some form of
et al., 2023; Liu et al., 2023), the research, though concep-
graph positional encoding as node features (either learned,
tually similar, differs crucially from our work. GNN-based
Laplacian, or a combination thereof) - see Section 4.2.2 for
approaches lack the textual understanding capabilities that
details.) Next, we apply a GNN to obtain a representation
are central to the integration of LLMs with graph learning
of the graph, which we read out using one of a few different
and reasoning.
techniques techniques depending on the task.
• For graph-level tasks (e.g., cycle check) we do
3. GraphToken
global pooling for readout, taking the mean or sum of
When considering how to pass structured data to an LLM the representations over all of the nodes.
there are largely two families of options: (1) encoding it as • For node-level tasks (e.g., node degree) we sepa-
lexical tokens for LLM embedding as in (Fatemi et al., 2024) rately output the representation of each node. This can
or (2) encoding it directly to a continuous representation be optionally concatenated with a graph-level pooling.
via a neural network – skipping any LLM token embedding. • For edge-level tasks (e.g., edge existence), we
While representing a graph as a sequence of lexical tokens use a global representation or the two node-level repre-
has benefits in terms of interpretability, there is often no sentations concatenated.
clear choice in what order to sequentially write the struc-
We note that the exact option for readout used (e.g. mean
tured data. We believe a text encoding of structured data
or sum pooling) is a hyper-parameter chosen during model
prohibits rich, concise, and expressive representations. In-
selection. Whichever the readout technique, this representa-
stead, our method eschews representing a graph in text in
tion is then projected onto the space of tokens used by the
favor of directly producing – using a GNN as an encoder –
LLM with a final dense layer.
the continuous representations for the LLM input. We refer
to these new graph encoder learned soft-tokens in the LLM LLM. For the experiments in the paper we use PaLM 2(Anil

3
Let Your Graph Do the Talking: Encoding Structured Data for LLMs

GraphToken Encoder
Graph
Graph Convolution Graph Convolution

Feature Encoding

GraphTokens

Readout

Frozen LLM

positional encoding
Embedding lookup

Question Tokens
Output
Question


Yes, there is a
Is there a cycle in this graph? Is there a cycle in this graph ?

+
cycle in this graph.

Figure 2. A visual overview of the architecture of GraphToken. The framework takes a graph and a corresponding question as input. The
graph encoder takes the graph and generates graph tokens. The question is tokenized and embedded to question tokens. A frozen LLM
leverages the graph and question tokens to generate an answer.

et al., 2023), however, our method generalizes to nearly any level, node-level, and edge-level tasks.
LLM in use today. For our purposes, any language model • R2: The performance of different graph convolution
which can accept a sequence of token embeddings and pro- architectures varies across tasks. This highlights the
duce text is acceptable, so long as it is possible to compute importance of carefully choosing the right architecture
a gradient with respect to part of the input sequence. for the specific graph reasoning problem at hand.
• R3: By intentionally breaking equivariance, we en-
3.2. Training procedure hance GraphToken’s graph reasoning capabilities.
Our training procedure is very similar to that used by soft
prompting methods (Lester et al., 2021). The training input Datasets. We conduct our experiments on the graph rea-
consists of triples (G, T, A) where G is a graph structure, soning tasks proposed in GraphQA (Fatemi et al., 2024).
T is a statement or question describing the task (e.g., ‘Does This dataset presents multiple graph reasoning problems
this graph contain a cycle?’ for cycle check) and A is with different difficulty levels. These tasks can be catego-
the ground truth answer (‘Yes, there exists a cycle in this rized as follows.
graph.’). • Graph-level. node count (counting the number of
In the forward pass, we compute the augmented query Q = nodes in a graph), edge count (counting the num-
E(G)||T (T ), concatenating the GraphToken encoding of ber of edges in a graph), cycle check (determining
the graph E(G) with the initial embedding of the task textual whether a graph contains a cycle), and triangle
representation, T (T ). counting (counting the number of triangles in a
graph).
We train by optimizing the final LLM perplexity (total log- • Node-level. node degree (calculating the degree
likelihood), L(A | Q), of the expected answer A with re- of a given node in a graph), and connected nodes
spect to the augmented query, Q. We minimize this loss, (finding all the nodes that are connected to a given node
back-propagating the gradient of L with respect to E(G) to in a graph),
the parameters of the GraphToken encoder – keeping all • Edge-level. reachability (finding if there is a
LLM parameters frozen. We use the Lion optimizer (Chen path from one node to another), edge existence
et al., 2023a) with a learning rate α = 0.05. (whether a given edge exists in a graph, and
shortest path (finding the length of the shortest
4. Experiments path from one node to another).

In this section, we summarize the key experiments con-


ducted with GraphToken. We begin by highlighting some Setting. We implemented GraphToken in Tensor-
of the most exciting results from our analysis here: flow (Abadi et al., 2015) using the TF-GNN library (Fer-
ludin et al., 2023). The LLM used in our experiments is the
• R1: GraphToken demonstrates superior performance instruction-fine-tuned Flan (Chung et al., 2022) checkpoint
compared to established baselines across a comprehen- of PaLM 2 S (Anil et al., 2023). Experiments were carried
sive range of graph reasoning tasks, including graph- out on Google TPUv3 and TPUv5e (Jouppi et al., 2017).

4
Let Your Graph Do the Talking: Encoding Structured Data for LLMs

Table 1. Results comparing GraphToken against prompt engineering and soft prompting on graph reasoning tasks using the GraphQATest
benchmark (Fatemi et al., 2024), by simple accuracy. We see that GraphToken substantially improves LLM performance on all graph,
node, and edge-level tasks. The first best result for each task is highlighted in bold and the second best result is highlighted with an
underline.
Graph Tasks Node Tasks Edge Tasks
Method Node count Edge count Cycle check Triangle counting Node degree Connected nodes Reachability Edge existence Shortest path
ZERO - SHOT 0.217 0.124 0.760 0.015 0.140 0.147 0.849 0.445 0.115
ZERO - COT 0.146 0.094 0.323 0.127 0.104 0.088 0.735 0.335 0.336
FEW- SHOT 0.253 0.120 0.374 0.030 0.174 0.124 0.794 0.368 0.227
COT 0.276 0.128 0.580 0.081 0.292 0.131 0.452 0.428 0.386
COT- BAG 0.269 0.125 0.521 0.081 0.280 0.158 0.452 0.373 0.404
SOFT- PROMPT 0.056 0.018 0.832 0.162 0.098 0.068 0.838 0.544 0.462
GraphToken 0.996 0.426 0.956 0.348 0.962 0.264 0.932 0.738 0.638

Model selection was performed by evaluating performance Results. The results of this experiment, summarized in
on GraphQATrain Table 1, demonstrate that GraphToken significantly out-
performs existing methods on all graph, node, and edge-
4.1. Experiment 1: GraphToken Performance level tasks. While SOFT- PROMPT achieves the second best
score on some tasks, this is mainly due to its ability to pre-
In this experiment, we rigorously evaluate the performance dict majority labels. For example, 82% of the questions
of GraphToken against the following comprehensive set of in cycle check are about existent cycles. Similarly,
established baselines: 54% of the questions are about non-existent edges in edge
• ZERO - SHOT. In this approach, the model is given a existence.
task description and immediately asked to produce the
desired output. No additional examples or demonstra- 4.2. Experiment 2: Encoder Design
tions are provided.
From the results in Table 1, we can see that graph encoders
• FEW- SHOT. This approach provides the model with
can significantly improve a LLM’s capability on graph rea-
a few examples of the task and their desired out-
soning tasks. However the choice of graph encoders has a
puts (Brown et al., 2020). Unlike traditional training,
significant effect on task performance. Here we study how
these examples are included directly in the prompt,
different architecture choices affect the quality of the graph
allowing the model to learn and adapt during the infer-
representation for a language model’s use, including the
ence.
choices of the graph convolution, the features available to
• C OT. Chain-of-thought (CoT) prompting (Wei et al.,
the network, and the hyper-parameters.
2022) provides examples each showing step-by-step
reasoning, teaching the LLM to generate its own
4.2.1. C HOICE : G RAPH C ONVOLUTION
thought processes for tackling new tasks.
• ZERO - COT. Zero-shot CoT (Kojima et al., 2022) builds This experiment investigates the impact of graph convolu-
upon Chain-of-Thought (CoT) prompting by eliminat- tion choice on the performance of GraphToken. We evaluate
ing the need for training examples. The LLM generates the following diverse set of encoders:
its own step-by-step reasoning process using a simple
• Graph Convolutional Network (GCN): One of the
trigger phrase like “Let’s think step by step”.
earliest GNNs, this model does mean pooling of neigh-
• COT- BAG. BAG prompting (Wang et al., 2023b) ex-
bor features, followed by a non-linear transformation.
tends COT to improve the performance of LLMs on
(Kipf & Welling, 2017)
graph-related tasks by appending “Let’s construct a
• Message Passing Neural Network (MPNN): A gener-
graph with the nodes and edges first” to the prompt.
alization of the GCN, this model allows for more flexi-
• SOFT- PROMPT. This approach uses the standard soft
ble aggregation of neighbor features, and has additional
prompt tuning of Lester et al. (2021). It optimizes
nonlinear feature transformations possible. (Gilmer
a global static prompt which is shared across prob-
et al., 2017)
lem instances to improve task performance. Unlike
• Graph Isomorphism Network (GIN): A GNN de-
our proposed method, it does not have access to the
signed specifically to maximize the expressiveness of
graph information, making the results of this approach
the model, with respect to a classic graph isomorphim
equivalent to that of a majority classifier.
test. (Xu et al., 2018)
• Multi-Head Attention (Graph Transformer): This
GNN adapts transformer style attention, allowing it to

5
Let Your Graph Do the Talking: Encoding Structured Data for LLMs

Table 2. Study of individual graph encoder performance on GraphQATest tasks. Note that there is ‘no free lunch’ here – no single encoder
examined dominates across all tasks.
Graph Tasks Node Tasks Edge Tasks
Method Node count Edge count Cycle check Triangle counting Node degree Connected nodes Reachability Edge existence Shortest path
GCN 0.746 0.056 0.964 0.208 0.264 0.264 0.918 0.68 0.604
Non-linear

GIN 0.704 0.052 0.898 0.194 0.252 0.18 0.902 0.65 0.586
MPNN 0.792 0.368 0.956 0.348 0.962 0.25 0.934 0.648 0.638
HGT 0.252 0.084 0.934 0.234 0.266 0.184 0.944 0.718 0.6
MHA 0.912 0.264 0.962 0.266 0.552 0.244 0.932 0.738 0.608
Linear

Node Set 0.996 0.080 0.948 0.198 0.19 0.118 0.942 0.596 0.568
Edge Set 0.618 0.426 0.964 0.228 0.22 0.096 0.904 0.592 0.568

learn different ways of passing messages (based on the question: do GNNs need to be equivariant in order to gener-
attention mask). (Dwivedi & Bresson, 2021) alize, especially with extremely powerful decoders, such as
• Heterogeneous Graph Transformer (HGT): Another LLMs?
adoption of transformer style attention (we note that it
We investigate this question by testing the graph reasoning
can be applied to non-heterogeneous graphs as well).
capability of GraphToken with three distinct node featuriza-
(Hu et al., 2020)
tion settings:
• Linear Aggregation In addition to the popular en-
coders from the literature, we also evaluated a family • LPE: Laplacian positional encodings using the normal-
of models which use linear aggregation of features, as ized Laplacian matrix, as in (Dwivedi et al., 2023).
this has been shown to work surprisingly well on a • IDX: unique identity encoding designed to break the
number of tasks (Bojchevski et al., 2020). GNN equivariance.
– Node Set: This model simply pools all the node • LPE+IDX: a concatenation of the above two strategies.
features in the graph together.
Setting. The experimental setup is similar to 4.2. Node
– Edge Set: This model simply pools all the edge
features of dimensionality d = 4 are evaluated for LPE and
features together (edge features are defined as the
IDX featurization. Models using LPE+IDX contains node
concatenation of its two nodes’ features).
features of size d = 8.
Setting: The experimental setup is similar to the experi-
Result. The outcome of this experiment are show in Fig-
ment in Section 4.1. Again, GraphQATrain performance was
ure 3, where we see the difference of all models from the
used for model selection, and we report the corresponding
S OFT P ROMPT baseline (Lester et al., 2021) when evaluted
model’s results on GraphQATest .
on GraphQATest . The core result is that learned positional
Result: Table 2 shows the results for each model on embeddings (Fig. 3b) generally outperform Laplacian posi-
GraphQATest . In general, we see that there is no one model tion embeddings (Fig 3a) for most encoders and most tasks.
that consistently dominates across graph encoding tasks. These results show that breaking equivariance surprisingly
Instead, we see that different graph encoder architectures adds additional capabilities for graph reasoning when pow-
have strengths and weaknesses advantages. erful decoders are present. Some additional observations
include:
There is one notable pattern however, is that the simple
linear GNN models perform quite strongly at their respective • Counting Tasks. Learned features seem to provide es-
counting tasks (i.e. NodeSet does well at node count, sential lift for basic counting tasks (node count,
EdgeSet does well at edge count). However models edge count, and node degree) in many en-
with non-linear effects are still capable on these tasks (e.g., coders.
MHA does well at node count, and MPNN does well on • Combination. In some very interesting cases of task
edge count). and encoder, the combination of both types of features
greatly improved model performance (Fig. 3c). For
4.2.2. C HOICE : GNN F EATURES example, GCN and NodeSet significantly improved at
the node count task.
Recently, positional node encodings (Wang et al., 2022;
• Linear models. NodeSet (an encoder which does not
Dwivedi et al., 2023; Lim et al., 2023) were proposed to
use the graph edges) generally benefited from spectral
enhance the expressivity of GNNs. On the other hand, in
features as they added previously unseen information
molecular modeling it has been shown recently that non-
about the graph structure.
equivariant encoders can match or exceed quality of equiv-
ariant ones (Wang et al., 2023c). This raises a more general

6
Let Your Graph Do the Talking: Encoding Structured Data for LLMs

(a) Spectral Features (LPE) (b) Learned Features (IDX) (c) Learned and Spectral Features (LPE+IDX)

Figure 3. Effect of varying node features used in the graph encoder. Results shown are performance difference from the S OFT P ROMPT
baseline on GraphQATest . We see that breaking equivariance via learned features (Fig. 3b) generally improve the model performance, but
the combination of learned and spectral features (Fig. 3c) proves uniquely powerful for some encoders.

Table 3. Total number of parameters in the graph encoder.


Body Head
GCN 17,152 1.1 × 107
GIN 17,152 1.1 × 107
MPNN 83,968 1.1 × 107
HGT 198,788 1.1 × 107
MHA 101,376 1.1 × 107
Node Set 0 4.1 × 105
Edge Set 0 7.4 × 105
Figure 4. UMAP (McInnes et al., 2018) projection of GraphToken
embeddings produced by two different encoders, colored by the
diameter of a graph. We plot all 8-node graphs.

4.2.3. PARAMETER U SAGE IN G RAPH T OKEN


Setting: We consider the graph convolution evaluation from eters (Touvron et al., 2023). Meanwhile the closed source
Section 4.2.1, using LPE features with dimensionality d = 4. PaLM 1 model contains 540 billion parameters (Chowdhery
The graph encoders have a latent space of size 128. We then et al., 2022). In light of this, we can see that GraphToken is
project this into a prompt embedding with approximately highly parameter-efficient, and significantly improves the
80, 000 parameters in GraphToken . graph reasoning capability of a LLM while barely adding
Results: Table 3 shows the number of parameters used in any parameters at all.
the graph encoder. Here ‘body’ refers to the number of
parameters in the graph encoder itself, and ‘Head’ refers 5. Discussion
to the parameters in the transformation layer to the higher-
dimensional LLM token space. So far we have examined the performance benefits of Graph-
Token, and the design choices necessary when building a
Its also insightful to consider the number of parameters used graph encoder. However several questions remain: (1) What
in each of the models. Table 3 specifies total number of pa- exactly are the encoders doing, and (2) does it generalize?
rameters used by each GNN architecture. We note that this In this section we seek to provide some insight (if not an-
size is dominated by the total number of parameters in the swers) to these questions, and lay the foundations for future
projection layer to the token space (roughly 11 million). Out work.
of all non-linear architectures, attention-based ones (MHA
and HGT) add the most encoder-based parameters. In gen-
5.1. Graph Encoder Analysis
eral, the size of our graph encoder models varies from 17k to
199k parameters. This is significantly smaller than typical This section studies the properties learned by GraphToken’s
LLMs, which currently often contain tens or hundreds of bil- graph encoders by directly examining the representations
lions of parameters. For example, the open-source LLama2 they produce. We study both the in-distribution and out-
language model scales from 7 billion to 70 billion param- of-distribution properties of these encoders. We consider

7
Let Your Graph Do the Talking: Encoding Structured Data for LLMs

Table 4. Predicting bipartiteness using graph encoders trained for different tasks, measured on all graphs with 8 nodes. Observe that
graph encoders trained on cycle check and triangle counting generalize well to bipartiteness detection.
Graph Tasks Node Tasks Edge Tasks
Method Node count Edge count Cycle check Triangle counting Node degree Connected nodes Reachability Edge existence Shortest path
GCN 53.82 53.28 55.46 50.00 50.00 54.64 50.00 48.48 51.60
Non-linear

GIN 51.09 53.27 52.74 51.91 53.26 53.57 51.36 52.17 52.18
MPNN 68.01 71.34 56.82 76.82 60.13 60.95 61.77 62.58 54.37
HGT 50.00 54.35 68.53 95.03 50.27 59.81 68.85 74.58 50.00
MHA 50.27 66.39 87.00 72.14 58.74 66.38 51.63 54.12 64.45
Linear

Node Set 56.55 57.38 56.30 55.74 56.29 56.28 55.73 57.93 56.56
Edge Set 50.82 50.82 50.82 50.55 50.54 50.54 50.82 50.82 50.54

9 tasks in total: total number of edges; maximum node 5.2. Future Work
degree; graph diameter; number of triangles; average local
This work opens up an exciting new avenue of exploration
clustering coefficient; largest core number; average shortest
for reasoning with structured data and LLMs. Some poten-
path length; testing planarity; testing bipartiteness.
tial avenues that we consider particularly exciting include:
One benefit of studying graphs is data availability: for small-
• This work just considers existing convolutions and
enough graphs, we can generate all possible graphs exhaus-
measures their effectiveness. An obvious and essential
tively using geng (McKay et al., 1981). The evaluation
next step is designing graph convolutions that best
goes as follows: First, we train an encoder on a task from
support LLMs in various graph reasoning tasks.
GraphQA (e.g. cycle check). Then, to evaluate the
• Evaluating the usefulness of this approach for factual
cross-task generalizability of the different encoders we train
grounding. Can we improve the ability of an LLM
a kNN classifier (or regressor) with k = 5 on the represen-
to answer questions about the data using prompting
tations of (i) an exhaustive set of connected graphs with 8
over knowledge graphs? Could an LLM answer novel
nodes (called graph8c in Balcilar et al. (2021)) and (ii) an
questions about a molecule given a GNN-produced
exhaustive set of tree graphs with 15 nodes. We note that be-
representation of it?
cause we are generating a large set of graphs (e.g. there are
• GraphToken improves performance with broken equiv-
11117 graphs of size 8) and only trained on GraphQATrain
ariance. Can this result inform other problems with
(1000 instances), the vast majority of the graphs we are us-
very strong decoder models?
ing here are unseen. As an illustration, a UMAP (McInnes
• This work examines how a GNN can be used to an
et al., 2018) visualization of the embeddings for all 8 node
enhance LLMs, but what about the reverse? Can we
graphs using two GNN encoders is presented in Figure 4.
use an LLM to interrogate a GNN to better explain its
Results. Since we present a lot of experiments and it’s hard results or provide higher quality answers?
to cover them all, we focus here on the task of predicting
whether a graph is bipartite and outsource the rest to the
6. Conclusions
Appendix. From the basic graph theory we know that, if
there is a triangle or an odd cycle in a graph, it can not be In this work we have studied the structured data encoding
bipartite. Therefore, we expect triangle counting problem for LLMs. Our novel method, GraphToken, learns
and cycle check training objectives to perform well on a graph embedding function through the gradients provided
this task. In Table 4 we can see that this is precisely what by a frozen LLM. GraphToken is especially well suited for
happens, with attention-based methods taking the lead. This projecting structured data into latent ‘prompt space’. It is a
is an interesting example of generalization from the graph parameter-efficient method as it requires only training the
encoders to a new task. graph encoder and does not update LLM parameters. Our
extensive experimental analysis across 9 graph reasoning
Overall, there is a significant performance gap between
tasks shows that GraphToken greatly improves graph rea-
different graph encoders, MPNN and attention-based ones
soning in LLMs – we observe up to a 73% improvement on
being generally the best. We observe significant correlations
the GraphQA benchmark.
in performance of in-distribution learning – for instance,
GraphToken trained on edge count performs the best on There is still much to do! We believe that our approach
edge count prediction. What is interesting is that node is fundamental for adapting new structured data sources to
count performs comparably here. This suggests that graph LLMs (which are expensive and time consuming to train),
encoders learn some universal features that are applicable and presents a very attractive way of improving fundamental
to many different downstream tasks. problems in LLMs, including hallucinations, factuality, and
freshness.

8
Let Your Graph Do the Talking: Encoding Structured Data for LLMs

Acknowledgements Bojchevski, A., Gasteiger, J., Perozzi, B., Kapoor, A., Blais,
M., Rózemberczki, B., Lukasik, M., and Günnemann, S.
We thank Oleksandr Ferludin, Johannes Gasteiger, Silvio Scaling graph neural networks with approximate pager-
Lattanzi, Vahab Mirrokni and Jan Pfeifer for discussions ank. In KDD, 2020. Cited on page 6.
about the work.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
References Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., et al. Language models are few-shot learners.
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., NeurIPS, 2020. Cited on pages 1 and 5.
Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M.,
Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, Chami, I., Abu-El-Haija, S., Perozzi, B., Re, C., and Mur-
M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Lev- phy, K. Machine learning on graphs: A model and com-
enberg, J., Mané, D., Monga, R., Moore, S., Murray, D., prehensive taxonomy. JMLR, 2022. Cited on page 3.
Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever,
I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A.,
V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Padlewski, P., Salz, D., Goodman, S., Grycner, A.,
Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Large- Mustafa, B., Beyer, L., et al. PaLI: A jointly-scaled mul-
scale machine learning on heterogeneous systems, 2015. tilingual language-image model. In ICLR, 2022. Cited
URL https://www.tensorflow.org/. Software on page 1.
available from tensorflow.org. Cited on page 4.
Chen, X., Liang, C., Huang, D., Real, E., Wang, K., Liu,
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Y., Pham, H., Dong, X., Luong, T., Hsieh, C.-J., et al.
Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Symbolic discovery of optimization algorithms. arXiv
Anadkat, S., et al. Gpt-4 technical report. arXiv preprint preprint arXiv:2302.06675, 2023a. Cited on page 4.
arXiv:2303.08774, 2023. Cited on page 2.
Chen, Z., Mao, H., Li, H., Jin, W., Wen, H., Wei, X., Wang,
Agarwal, O., Ge, H., Shakeri, S., and Al-Rfou, R. S., Yin, D., Fan, W., Liu, H., et al. Exploring the potential
Knowledge graph based synthetic corpus generation for of large language models (llms) in learning on graphs.
knowledge-enhanced language model pre-training. arXiv arXiv preprint 2307.03393, 2023b. Cited on page 3.
preprint arXiv:2010.12688, 2020. Cited on page 3.
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra,
Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, G., Roberts, A., Barham, P., Chung, H. W., Sutton,
D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko,
Z., et al. Palm 2 technical report. arXiv preprint S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer,
arXiv:2305.10403, 2023. Cited on pages 3 and 4. N., Prabhakaran, V., Reif, E., Du, N., Hutchinson, B.,
Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, G.,
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Yin, P., Duke, T., Levskaya, A., Ghemawat, S., Dev, S.,
and Schmid, C. Vivit: A video vision transformer. In Michalewski, H., Garcia, X., Misra, V., Robinson, K., Fe-
ICCV, 2021. Cited on page 1. dus, L., Zhou, D., Ippolito, D., Luan, D., Lim, H., Zoph,
B., Spiridonov, A., Sepassi, R., Dohan, D., Agrawal,
Balcilar, M., Héroux, P., Gauzere, B., Vasseur, P., Adam, S., S., Omernick, M., Dai, A. M., Pillai, T. S., Pellat, M.,
and Honeine, P. Breaking the limits of message passing Lewkowycz, A., Moreira, E., Child, R., Polozov, O., Lee,
graph neural networks. In ICML, 2021. Cited on page 8. K., Zhou, Z., Wang, X., Saeta, B., Diaz, M., Firat, O.,
Catasta, M., Wei, J., Meier-Hellstern, K., Eck, D., Dean,
Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez- J., Petrov, S., and Fiedel, N. Palm: Scaling language
Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, modeling with pathways, 2022. Cited on page 7.
A., Raposo, D., Santoro, A., Faulkner, R., Gulcehre, C.,
Song, F., Ballard, A., Gilmer, J., Dahl, G., Vaswani, A., Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y.,
Allen, K., Nash, C., Langston, V., Dyer, C., Heess, N., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma,
Wierstra, D., Kohli, P., Botvinick, M., Vinyals, O., Li, S., et al. Scaling instruction-finetuned language models.
Y., and Pascanu, R. Relational inductive biases, deep arXiv preprint arXiv:2210.11416, 2022. Cited on page 4.
learning, and graph networks, 2018. Cited on page 13.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:
Bengio, Y., Ducharme, R., and Vincent, P. A neural prob- Pre-training of deep bidirectional transformers for lan-
abilistic language model. NIPS, 2000. Cited on page guage understanding. arXiv preprint arXiv:1810.04805,
2. 2018. Cited on pages 1 and 2.

9
Let Your Graph Do the Talking: Encoding Structured Data for LLMs

Dwivedi, V. P. and Bresson, X. A generalization of trans- Hu, Z., Dong, Y., Wang, K., and Sun, Y. Heterogeneous
former networks to graphs, 2021. Cited on pages 6 graph transformer, 2020. Cited on page 6.
and 13.
Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal,
Dwivedi, V. P., Joshi, C. K., Luu, A. T., Laurent, T., Bengio, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers,
Y., and Bresson, X. Benchmarking graph neural networks. A., Boyle, R., Cantin, P.-l., Chao, C., Clark, C., Coriell, J.,
JMLR, 24(43):1–48, 2023. Cited on page 6. Daley, M., Dau, M., Dean, J., Gelb, B., Ghaemmaghami,
T. V., Gottipati, R., Gulland, W., Hagmann, R., Ho, C. R.,
Edalati, A., Tahaei, M., Kobyzev, I., Nia, V. P., Clark, Hogberg, D., Hu, J., Hundt, R., Hurt, D., Ibarz, J., Jaffey,
J. J., and Rezagholizadeh, M. Krona: Parameter ef- A., Jaworski, A., Kaplan, A., Khaitan, H., Killebrew, D.,
ficient tuning with kronecker adapter. arXiv preprint Koch, A., Kumar, N., Lacy, S., Laudon, J., Law, J., Le,
arXiv:2212.10650, 2022. Cited on page 2. D., Leary, C., Liu, Z., Lucke, K., Lundin, A., MacKean,
Fang, T., Zhang, Y., Yang, Y., Wang, C., and Chen, L. G., Maggiore, A., Mahony, M., Miller, K., Nagarajan, R.,
Universal prompt tuning for graph neural networks, 2023. Narayanaswami, R., Ni, R., Nix, K., Norrie, T., Omer-
Cited on page 3. nick, M., Penukonda, N., Phelps, A., Ross, J., Ross, M.,
Salek, A., Samadiani, E., Severn, C., Sizikov, G., Snel-
Fatemi, B., Halcrow, J., and Perozzi, B. Talk like a graph: ham, M., Souter, J., Steinberg, D., Swing, A., Tan, M.,
Encoding graphs for large language models. In ICLR, Thorson, G., Tian, B., Toma, H., Tuttle, E., Vasudevan,
2024. Cited on pages 1, 2, 3, 4, and 5. V., Walter, R., Wang, W., Wilcox, E., and Yoon, D. H.
In-datacenter performance analysis of a tensor processing
Ferludin, O., Eigenwillig, A., Blais, M., Zelle, D., Pfeifer, unit. SIGARCH Comput. Archit. News, 2017. Cited on
J., Sanchez-Gonzalez, A., Li, W. L. S., Abu-El-Haija, S., page 4.
Battaglia, P., Bulut, N., Halcrow, J., de Almeida, F. M. G.,
Gonnet, P., Jiang, L., Kothari, P., Lattanzi, S., Linhares, Jurafsky, Dan; Martin, J. H. N-gram language models. In
A., Mayer, B., Mirrokni, V., Palowitch, J., Paradkar, M., Speech and Language Processing (3rd ed.). 2021. Cited
She, J., Tsitsulin, A., Villela, K., Wang, L., Wong, D., on page 2.
and Perozzi, B. TF-GNN: Graph neural networks in
tensorflow, 2023. Cited on pages 4 and 13. Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain,
D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma,
Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and N., Tran-Johnson, E., et al. Language models (mostly)
Dahl, G. E. Neural message passing for quantum chem- know what they know. arXiv preprint arXiv:2207.05221,
istry, 2017. Cited on page 5. 2022. Cited on page 1.
Guo, J., Du, L., and Liu, H. Gpt4graph: Can large lan- Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L.,
guage models understand graph structured data? an em- and Lewis, M. Generalization through memorization:
pirical evaluation and benchmarking. arXiv preprint Nearest neighbor language models. arXiv preprint
arXiv:2305.15066, 2023. Cited on page 1. arXiv:1911.00172, 2019. Cited on page 1.
Guu, K., Lee, K., Tung, Z., Pasupat, P., and Chang, M.
Kipf, T. N. and Welling, M. Semi-supervised classification
Retrieval augmented language model pre-training. In
with graph convolutional networks, 2017. Cited on page
ICML, 2020. Cited on page 1.
5.
He, R., Liu, L., Ye, H., Tan, Q., Ding, B., Cheng, L., Low,
J.-W., Bing, L., and Si, L. On the effectiveness of adapter- Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y.
based tuning for pretrained language model adaptation. Large language models are zero-shot reasoners. NeurIPS,
arXiv preprint arXiv:2106.03164, 2021. Cited on page 2. 35:22199–22213, 2022. Cited on page 5.

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., Lester, B., Al-Rfou, R., and Constant, N. The power of
De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and scale for parameter-efficient prompt tuning, 2021. Cited
Gelly, S. Parameter-efficient transfer learning for nlp. In on pages 2, 4, 5, and 6.
ICML, 2019. Cited on page 2.
Levine, Y., Dalmedigos, I., Ram, O., Zeldes, Y., Jan-
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, nai, D., Muhlgay, D., Osin, Y., Lieber, O., Lenz, B.,
S., Wang, L., and Chen, W. Lora: Low-rank adaptation of Shalev-Shwartz, S., Shashua, A., Leyton-Brown, K., and
large language models. arXiv preprint arXiv:2106.09685, Shoham, Y. Standing on the shoulders of giant frozen
2021. Cited on page 2. language models, 2022. Cited on page 2.

10
Let Your Graph Do the Talking: Encoding Structured Data for LLMs

Li, X. L. and Liang, P. Prefix-tuning: Optimizing continuous Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
prompts for generation. arXiv preprint arXiv:2101.00190, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
2021. Cited on page 2. Bhosale, S., et al. Llama 2: Open foundation and fine-
tuned chat models. arXiv preprint arXiv:2307.09288,
Lim, D., Robinson, J., Zhao, L., Smidt, T., Sra, S., Maron, 2023. Cited on pages 1 and 7.
H., and Jegelka, S. Sign and basis invariant networks for
spectral graph representation learning. In ICLR, 2023. Tsitsulin, A., Mottin, D., Karras, P., Bronstein, A., and
Cited on page 6. Müller, E. Sgr: Self-supervised spectral graph represen-
tation learning. arXiv preprint arXiv:1811.06237, 2018.
Liu, Z., Yu, X., Fang, Y., and Zhang, X. Graphprompt: Uni-
Cited on page 3.
fying pre-training and downstream tasks for graph neural
networks. In Proceedings of the ACM Web Conference Valipour, M., Rezagholizadeh, M., Kobyzev, I., and Ghodsi,
2023, 2023. Cited on page 3. A. Dylora: Parameter efficient tuning of pre-trained
McInnes, L., Healy, J., and Melville, J. Umap: Uniform models using dynamic search-free low-rank adaptation.
manifold approximation and projection for dimension arXiv preprint arXiv:2210.07558, 2022. Cited on page 2.
reduction. arXiv preprint arXiv:1802.03426, 2018. Cited
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
on pages 7 and 8.
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention
McKay, B. D. et al. Practical graph isomorphism. 1981. is all you need. NeurIPS, 30, 2017. Cited on page 1.
Cited on page 8.
Vu, T., Iyyer, M., Wang, X., Constant, N., Wei, J., Wei,
Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient J., Tar, C., Sung, Y.-H., Zhou, D., Le, Q., and Luong, T.
estimation of word representations in vector space. arXiv Freshllms: Refreshing large language models with search
preprint arXiv:1301.3781, 2013. Cited on page 2. engine augmentation, 2023. Cited on page 1.

Perozzi, B., Al-Rfou, R., and Skiena, S. Deepwalk: online Wang, C., Liu, X., Yue, Y., Tang, X., Zhang, T., Jiayang,
learning of social representations. In KDD, 2014. Cited C., Yao, Y., Gao, W., Hu, X., Qi, Z., Wang, Y., Yang, L.,
on page 3. Wang, J., Xie, X., Zhang, Z., and Zhang, Y. Survey on
factuality in large language models: Knowledge, retrieval
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., and domain-specificity, 2023a. Cited on page 1.
et al. Improving language understanding by generative
pre-training. 2018. Cited on page 1. Wang, H., Yin, H., Zhang, M., and Li, P. Equivariant and
stable positional encoding for more powerful graph neural
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D.,
networks. In ICLR, 2022. Cited on page 6.
Sutskever, I., et al. Language models are unsupervised
multitask learners. OpenAI blog, 1(8):9, 2019. Cited on Wang, H., Feng, S., He, T., Tan, Z., Han, X., and Tsvetkov,
page 2. Y. Can language models solve graph problems in natural
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., language? In NeurIPS, 2023b. Cited on pages 1, 3, and 5.
Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring
Wang, Y., Elhag, A. A., Jaitly, N., Susskind, J. M., and
the limits of transfer learning with a unified text-to-text
Bautista, M. A. Generating molecular conformer fields.
transformer. The Journal of Machine Learning Research,
arXiv preprint arXiv:2311.17932, 2023c. Cited on page
21(1):5485–5551, 2020. Cited on page 1.
6.
Rosenfeld, R. Two decades of statistical language modeling:
Where do we go from here? Proceedings of the IEEE, 88 Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F.,
(8):1270–1278, 2000. Cited on page 2. Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought
prompting elicits reasoning in large language models.
Stechly, K., Marquez, M., and Kambhampati, S. Gpt- NeurIPS, 2022. Cited on page 5.
4 doesn’t know it’s wrong: An analysis of iterative
prompting for reasoning problems. arXiv preprint Xie, Y., Xu, Z., Zhang, J., Wang, Z., and Ji, S. Self-
arXiv:2310.12397, 2023. Cited on page 1. supervised learning of graph neural networks: A unified
review. IEEE TPAMI, 2022. Cited on page 3.
Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Yu,
J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., et al. Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How
Gemini: a family of highly capable multimodal models. powerful are graph neural networks? arXiv preprint
arXiv preprint arXiv:2312.11805, 2023. Cited on page 2. arXiv:1810.00826, 2018. Cited on page 5.

11
Let Your Graph Do the Talking: Encoding Structured Data for LLMs

Xu, L., Xie, H., Qin, S.-Z. J., Tao, X., and Wang, F. L.
Parameter-efficient fine-tuning methods for pretrained
language models: A critical review and assessment. arXiv
preprint arXiv:2312.12148, 2023. Cited on page 2.
Ye, R., Zhang, C., Wang, R., Xu, S., and Zhang, Y.
Natural language is all a graph needs. arXiv preprint
arXiv:2308.07134, 2023. Cited on page 3.

Zaken, E. B., Ravfogel, S., and Goldberg, Y. Bitfit:


Simple parameter-efficient fine-tuning for transformer-
based masked language-models. arXiv preprint
arXiv:2106.10199, 2021. Cited on page 2.

Zhao, M., Lin, T., Mi, F., Jaggi, M., and Schütze, H. Mask-
ing as an efficient alternative to finetuning for pretrained
language models. arXiv preprint arXiv:2004.12406, 2020.
Cited on page 2.
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y.,
Min, Y., Zhang, B., Zhang, J., Dong, Z., et al. A survey of
large language models. arXiv preprint arXiv:2303.18223,
2023. Cited on pages 1 and 2.

12
Let Your Graph Do the Talking: Encoding Structured Data for LLMs

A. Appendix
A.1. Graph Encoders
Notation. We briefly describe the notation we will use. The graph G = (V, E) contains the set of V nodes and E edges.
While we will only discuss simple graphs, everything discussed can be extended to heterogeneous graphs w.l.o.g. (Battaglia
et al., 2018; Ferludin et al., 2023).
Using the notation of Ferludin et al. (2023), a GNN has two primary operations. First, a next state function (N EXT S TATE)
which computes the hidden state hv of a node (or edge, m(u,v) ) given information from its neighbors and its previous state,
and an aggregation function (E DGE P OOL) which pools information for a node’s immediate neighborhood into a fixed size
representation. More formally, we can say that the next state of a node is:
(i+1)
h(i+1)
v = N EXT S TATEV (h(i) (i+1)
v , mv ).

Then the pooled messages m(i+1)


v are defined as follows:
(i+1) (i+1) (i)
m(u,v) = N EXT S TATEE (h(i) (i)
u , hv , m(u,v) ),
(i+1)
m(i+1)
v = E DGE P OOL(i+1) (h(i)
v , {m(u,v) | u ∈ N (v)}).

Different realizations of the N EXT S TATE and E DGE P OOL functions can implement a wide variety of GNN operations. This
can include powerful models which use Transformer style attention instead of the provided graph edges (Dwivedi & Bresson,
2021).
The architecture of NodeSet and EdgeSet is shown in Figure 5. Other GNN models have graph convolutions before
node/edge states are read out.

MLP MLP
MLP
MLP MLP
MLP

pool pool poo


MLP
MLP
MLP

project project proje


pool
… … …
Graph Tokens Graph Tokens Graph To
project
(a) Node Set architecture (b) Edge Set architecture


Figure 5. Figurative illustrations of set-based GNN architectures employed in the paper. We pool representations from either nodes or
edges, transform them via anGraph
MLP withTokens weights, pool, and project to the GraphToken space.
shared

A.2. Additional experiments


We present additional results for graph encoder analysis. Tables 5–15 present additional results on more graph properties, as
well as experiments on tree-structured graphs of size 15. In general, complete graph populations demonstrate significantly
better performance than trees – we can attribute that to the fact that GraphToken was trained on diverse sets of data, and
trees are somewhat out-of-distribution. Nevertheless, for all considered cases the best overall encoder model achieved better
results than naı̈ve set encodings.

13
Let Your Graph Do the Talking: Encoding Structured Data for LLMs

Table 5. Average local clustering coefficient MSE measured on all connected graphs with 8 nodes. We highlight the best performance per
training task in columns.
Graph Tasks Node Tasks Edge Tasks
Method Node count Edge count Cycle check Triangle counting Node degree Connected nodes Reachability Edge existence Shortest path
GCN 1.62 1.67 2.12 4.49 4.49 1.73 4.49 16.57 3.75
Non-linear

GIN 2.18 2.29 2.45 2.60 2.44 2.31 3.73 2.88 3.37
MPNN 1.03 0.95 1.38 0.81 1.50 1.34 1.68 1.87 1.47
HGT 2.63 2.25 2.08 1.23 2.49 2.17 1.90 1.62 2.52
MHA 2.69 1.01 1.23 0.96 1.56 1.25 2.08 1.59 1.29
Linear

Node Set 2.59 2.56 2.59 2.59 2.58 2.60 2.58 2.58 2.56
Edge Set 2.22 2.22 2.22 2.22 2.24 2.23 2.22 2.22 2.23

Table 6. Degree accuracy on all connected graphs with 8 nodes. We highlight the best performance per training task in columns.
Graph Tasks Node Tasks Edge Tasks
Method Node count Edge count Cycle check Triangle counting Node degree Connected nodes Reachability Edge existence Shortest path
GCN 57.46 56.65 52.46 40.09 40.09 57.42 40.09 15.73 40.26
Non-linear

GIN 56.86 56.30 54.55 48.75 55.59 57.56 40.14 50.81 44.83
MPNN 69.45 69.60 67.19 71.84 64.56 67.62 61.37 58.66 63.18
HGT 55.20 55.70 56.54 60.17 56.62 57.65 58.02 59.06 55.46
MHA 54.86 64.33 62.86 65.63 61.67 63.22 56.98 61.60 63.97
Linear

Node Set 54.66 54.91 54.98 55.06 54.78 54.64 54.50 54.94 54.72
Edge Set 63.48 63.37 63.07 63.55 63.08 63.37 63.47 63.06 63.44

Table 7. Diameter Accuracy on all connected graphs with 8 nodes. We highlight the best performance per training task in columns.
Graph Tasks Node Tasks Edge Tasks
Method Node count Edge count Cycle check Triangle counting Node degree Connected nodes Reachability Edge existence Shortest path
GCN 66.86 67.81 66.70 37.37 37.37 68.91 37.37 52.13 55.13
Non-linear

GIN 66.06 64.87 63.97 61.09 64.98 66.43 37.80 60.65 54.82
MPNN 76.92 76.86 73.63 78.33 74.78 77.18 74.42 69.56 76.23
HGT 63.97 65.24 66.88 70.45 65.30 68.45 69.64 68.97 66.04
MHA 63.76 74.17 76.00 74.03 73.50 74.71 68.45 69.32 72.95
Linear

Node Set 67.28 67.24 67.01 66.97 66.81 67.19 67.09 66.87 66.79
Edge Set 66.99 66.51 66.63 66.83 66.65 67.02 66.60 66.93 66.90

Table 8. k-Core Accuracy on all connected graphs with 8 nodes. We highlight the best performance per training task in columns.
Graph Tasks Node Tasks Edge Tasks
Method Node count Edge count Cycle check Triangle counting Node degree Connected nodes Reachability Edge existence Shortest path
GCN 69.49 69.15 66.61 58.33 58.33 69.16 58.33 25.18 61.55
Non-linear

GIN 68.03 65.98 64.85 62.67 66.74 67.84 58.84 63.34 59.08
MPNN 87.42 87.54 81.81 88.63 80.30 83.48 80.08 71.01 82.05
HGT 63.92 65.29 67.00 70.01 65.44 67.32 68.35 70.08 65.13
MHA 64.30 80.80 73.49 80.81 76.98 78.83 69.43 74.21 75.92
Linear

Node Set 68.23 68.74 68.50 68.71 68.07 67.99 68.85 68.17 68.70
Edge Set 66.30 65.78 65.58 66.15 65.76 65.91 65.94 65.77 65.71

Table 9. #edges Accuracy on all connected graphs with 8 nodes. We highlight the best performance per training task in columns.
Graph Tasks Node Tasks Edge Tasks
Method Node count Edge count Cycle check Triangle counting Node degree Connected nodes Reachability Edge existence Shortest path
GCN 38.91 39.19 35.94 11.60 11.60 40.24 11.60 2.19 14.58
Non-linear

GIN 38.13 37.33 36.57 31.66 37.74 38.34 11.88 31.45 25.92
MPNN 86.58 86.72 53.15 84.56 52.12 66.01 50.70 41.96 59.95
HGT 35.63 37.45 38.23 40.39 37.14 37.80 39.68 39.74 36.86
MHA 35.85 55.32 45.04 53.52 47.89 49.44 39.69 42.84 46.17
Linear

Node Set 40.06 40.14 39.40 40.15 39.97 39.72 39.88 39.79 39.89
Edge Set 37.93 38.11 38.05 37.92 38.05 37.67 37.64 37.82 37.91

14
Let Your Graph Do the Talking: Encoding Structured Data for LLMs

Table 10. Planarity AUC on all connected graphs with 8 nodes. We highlight the best performance per training task in columns.
Graph Tasks Node Tasks Edge Tasks
Method Node count Edge count Cycle check Triangle counting Node degree Connected nodes Reachability Edge existence Shortest path
GCN 74.18 73.76 72.61 50.00 50.00 74.74 50.00 50.00 49.44
Non-linear

GIN 77.35 73.00 72.06 69.37 74.86 75.85 50.73 68.97 61.58
MPNN 86.14 86.52 84.16 86.64 83.74 85.17 84.32 77.84 85.55
HGT 69.24 71.41 71.02 74.07 71.47 72.20 72.20 73.59 71.55
MHA 69.96 80.87 78.35 80.46 81.53 81.21 74.98 78.29 80.58
Linear

Node Set 78.41 78.76 78.86 78.82 78.18 78.54 78.72 78.76 78.78
Edge Set 72.17 71.64 72.06 72.20 71.93 72.11 72.01 72.27 72.01

Table 11. Shortest path MSE on all connected graphs with 8 nodes. We highlight the best performance per training task in columns.
Graph Tasks Node Tasks Edge Tasks
Method Node count Edge count Cycle check Triangle counting Node degree Connected nodes Reachability Edge existence Shortest path
GCN 2.27 2.24 2.31 6.07 6.07 2.06 6.07 11.09 3.75
Non-linear

GIN 2.57 2.77 2.83 2.93 2.52 2.54 4.84 3.09 3.61
MPNN 0.29 0.29 0.76 0.31 0.71 0.49 0.75 1.58 0.51
HGT 3.03 2.64 2.27 1.60 2.60 2.14 1.80 1.95 2.81
MHA 3.04 0.71 0.95 0.78 1.01 0.74 1.74 1.55 1.05
Linear

Node Set 2.35 2.35 2.35 2.36 2.36 2.35 2.34 2.36 2.34
Edge Set 2.99 2.99 2.99 2.99 2.97 2.97 2.99 2.99 2.99

Table 12. # of triangles MSE on all connected graphs with 8 nodes. We highlight the best performance per training task in columns.
Graph Tasks Node Tasks Edge Tasks
Method Node count Edge count Cycle check Triangle counting Node degree Connected nodes Reachability Edge existence Shortest path
GCN 132.94 129.03 164.53 316.07 316.07 127.17 316.07 690.03 293.53
Non-linear

GIN 152.13 168.35 182.95 201.64 169.71 156.16 251.23 200.45 251.65
MPNN 8.33 7.51 32.08 4.56 51.90 27.18 51.04 124.89 41.73
HGT 191.14 170.71 165.88 126.92 172.84 160.29 156.10 136.22 175.45
MHA 197.36 30.27 96.56 27.10 59.58 52.42 138.48 80.22 60.72
Linear

Node Set 167.81 168.72 167.33 167.40 167.90 167.96 168.57 169.38 166.13
Edge Set 181.44 181.21 181.18 181.32 180.86 179.44 181.08 181.68 181.40

Table 13. Degree Accuracy on all trees with 15 nodes. We highlight the best performance per training task in columns.
Graph Tasks Node Tasks Edge Tasks
Method Node count Edge count Cycle check Triangle counting Node degree Connected nodes Reachability Edge existence Shortest path
GCN 53.57 55.15 55.24 25.91 25.91 54.86 25.91 11.08 36.51
Non-linear

GIN 60.35 58.79 56.36 55.11 59.88 68.04 42.01 66.72 55.25
MPNN 79.37 78.36 59.18 72.35 62.38 65.90 57.37 57.33 58.45
HGT 54.88 55.33 55.34 58.65 54.33 58.84 57.27 57.43 55.34
MHA 59.17 61.61 60.38 57.18 54.99 61.00 52.29 58.56 53.95
Linear

Node Set 65.64 66.32 65.93 66.10 66.13 65.95 66.28 66.22 65.82
Edge Set 69.59 69.87 69.44 69.40 69.86 69.56 69.32 69.55 69.66

Table 14. Diameter Accuracy on all trees with 15 nodes. We highlight the best performance per training task in columns.
Graph Tasks Node Tasks Edge Tasks
Method Node count Edge count Cycle check Triangle counting Node degree Connected nodes Reachability Edge existence Shortest path
GCN 50.77 50.36 49.54 25.97 25.97 50.01 25.97 6.77 26.64
Non-linear

GIN 58.29 54.44 52.24 49.41 51.47 59.62 24.11 58.77 46.27
MPNN 54.24 54.68 54.97 59.29 67.65 63.80 54.13 52.05 59.48
HGT 57.15 54.88 54.90 57.58 57.05 65.22 54.51 58.70 53.07
MHA 53.95 56.63 60.41 54.62 53.39 56.07 52.85 55.17 51.70
Linear

Node Set 61.89 62.68 62.74 62.36 61.99 61.93 62.34 62.49 62.40
Edge Set 56.57 56.19 56.27 56.83 56.25 56.53 56.31 56.72 56.84

15
Let Your Graph Do the Talking: Encoding Structured Data for LLMs

Table 15. Shortest path MSE on all trees with 15 nodes. We highlight the best performance per training task in columns.
Graph Tasks Node Tasks Edge Tasks
Method Node count Edge count Cycle check Triangle counting Node degree Connected nodes Reachability Edge existence Shortest path
GCN 12.95 12.31 12.62 26.17 26.17 12.22 26.17 49.78 21.71
Non-linear

GIN 9.57 10.69 11.32 11.88 11.03 8.37 19.35 9.76 14.39
MPNN 4.19 4.54 9.82 4.92 6.87 6.10 11.06 12.10 11.01
HGT 10.57 10.96 11.65 9.09 12.56 8.17 10.76 9.26 10.98
MHA 10.49 9.88 9.51 11.22 12.75 10.52 13.31 10.09 12.78
Linear

Node Set 10.20 10.05 10.13 10.11 10.17 10.21 10.07 10.18 10.03
Edge Set 9.92 9.87 9.92 9.93 9.88 9.88 10.01 9.91 9.87

16

You might also like