1361 2019 Heterogeneous Graph Neural Network
1361 2019 Heterogeneous Graph Neural Network
793
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
(b) HC ✗ ✗ ✓ ✗ ✓ ✓ ✓
heterogeneous graph attributes I ✗ ✗ ✗ ✗ ✓ ✓ ✓
type-1
text
d ...
c attributes same feature transformation function for all node types as their
type-2
f contents vary from each other. Thus challenge 2 is: how to de-
a image
b e ... sign node content encoder for addressing content heterogeneity of
... different nodes in HetG, as indicated by C2 in Figure 1(b)?
g
text • (C3) Different types of neighbors contribute differently to the
type-k
image node embeddings in HetG. For example, in the academic graph
... of Figure 1(a), author and paper neighbors should have more
C2 C3 influence on the embedding of author node as a venue node
C1
contains diverse topics thus has more general embedding. Most
Figure 1: (a) HetG examples: an academic graph and a review of current GNNs mainly focus on homogeneous graphs and do not
graph. (b) Challenges of graph neural network for HetG: C1 consider node type impact. Thus challenge 3 is: how to aggregate
- sampling heterogeneous neighbors (for node a in this case, feature information of heterogeneous neighbors by considering the
node colors denote different types); C2 - encoding heteroge- impacts of different node types, as indicated by C3 in Figure 1(b).
neous contents; C3 - aggregating heterogeneous neighbors. To solve these challenges, we propose HetGNN, a heterogeneous
[7], and GAT [31] employ convolutional operator, LSTM architec- graph neural network model for representation learning in HetG.
ture, and self-attention mechanism to aggregate feature information First, we design a random walk with restart based strategy to sample
of neighboring nodes, respectively. The advances and applications fixed size strongly correlated heterogeneous neighbors of each
of GNNs are largely concentrated on homogeneous graphs. Current node in HetG and group them according to node types. Next, we
state-of-the-art GNNs have not well solved the following challenges design a heterogeneous graph neural network architecture with two
faced for HetG, which we address in this paper. modules to aggregate feature information of sampled neighbors in
previous step. The first module employs recurrent neural network
• (C1) Many nodes in HetG may not connect to all types of neigh-
to encode “deep” feature interactions of heterogeneous contents
bors. In addition, the number of neighboring nodes varies from
and obtains content embedding of each node. The second module
node to node. For example, in Figure 1(a), any author node has
utilizes another recurrent neural network to aggregate content
no direct connection to a venue node. Meanwhile, in Figure 1(b),
embeddings of different neighboring groups, which are further
node a has 5 direct neighbors while node c only has 2. Most
combined by an attention mechanism for measuring the different
existing GNNs only aggregate feature information of direct (first-
impacts of heterogeneous node types and obtaining the ultimate
order) neighboring nodes and the feature propagation process
node embedding. Finally, we leverage a graph context loss and
may weaken the effect of farther neighbors. Moreover, the embed-
a mini-batch gradient descent procedure to train the model. To
ding generation of “hub” node is impaired by weakly correlated
summarize, the main contributions of our work are:
neighbors (“noise” neighbors) and the embedding of “cold-start”
node is not sufficiently represented due to limited neighbor infor- • We formalize the problem of heterogeneous graph representation
mation. Thus challenge 1 is: how to sample heterogeneous neighbors learning which involves both graph structure heterogeneity and
that are strongly correlated to embedding generation for each node node content heterogeneity.
in HetG, as indicated by C1 in Figure 1(b)? • We propose an innovative heterogeneous graph neural network
• (C2) A node in HetG can carry unstructured heterogeneous con- model, i.e., HetGNN, for representation learning on HetG. Het-
tents, e.д., attributes, text or image. In addition, content associated GNN is able to capture both structure and content heterogeneity
with different types of nodes can be different. For example, in and is useful for both transductive and inductive tasks. Table 1
Figure 1(b), type-1 nodes (e.д., b or c) contain attributes and text summarizes the key advantages of HetGNN, comparing to a num-
content, type-2 nodes (e.д., f or д) carry attributes and image, ber of recent models which include homogeneous, heterogeneous,
type-k nodes (e.д., d or e) are associated with text and image. attributed graph models, and graph neural network models.
The direct concatenation operation or linear transformation by • We conduct extensive experiments on several public datasets and
the current GNNs cannot model “deep” interactions among node our results demonstrate the superior performance of HetGNN
heterogeneous contents. Moreover, it is not applicable to use the over state-of-the-art baselines for numerous graph mining tasks
794
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
including link prediction, recommendation, node classification & • They are not suitable for aggregating heterogeneous neighbors
clustering, and inductive node classification & clustering. which have different content features. Heterogeneous neighbors
may require different feature transformations to deal with differ-
2 PROBLEM DEFINITION ent feature types and dimensions.
In this section, we introduce the concept of content-associated het- In light of these issues and to solve the challenge C1, we design a
erogeneous graphs that will be used in the paper and then formally heterogeneous neighbors sampling strategy based on random walk
define the problem of heterogeneous graph representation learning. with restart (RWR). It contains two consecutive steps:
Definition 2.1. Content-associated Heterogeneous Graphs. • Step-1: Sampling fixed length RWR. We start a random walk
A content associated heterogeneous graph (C-HetG) is defined as from node v ∈ V . The walk iteratively travels to the neighbors
a graph G = (V , E, OV , R E ) with multiple types of nodes V and of current node or returns to the starting node with a probabil-
links E. OV and R E represent the set of object types and that of ity p. RWR runs until it successfully collects a fixed number of
relation types, respectively. In addition, each node is associated nodes, denoted as RWR(v). Note that numbers of different types
with heterogeneous contents, e.д., attributes, text, or image. of nodes in RWR(v) are constrained to ensure that all node types
are sampled for v.
The academic graph in Figure 1(a) is a C-HetG. The node types • Step-2: Grouping different types of neighbors. For each node type
OV includes author, paper and venue. The link types R E includes t, we select top kt nodes from RWR(v) according to frequency
author-write-paper, paper-cite-paper and paper-publish-venue. Be- and take them as the set of t-type correlated neighbors of node v.
sides, the author or venue node is associated with paper abstract
written by the author or included in the venue, and the paper node This strategy is able to avoid the aforementioned issues due to: (1)
contains abstract, references, as well as venue. The bipartite review RWR collects all types of neighbors for each node; (2) the sampled
graph in Figure 1(a) is also C-HetG as |OV | + |R E | ≥ 3, where OV neighbor size of each node is fixed and the most frequently visited
includes user and item, the relation R E indicates review behavior. neighbors are selected; (3) neighbors of the same type (having the
The user node is associated with review that is written by the user same content features) are grouped such that type-based aggre-
and the item node contains title, description, and picture. gation can be deployed. Next, we design a heterogeneous graph
neural network architecture with two modules to aggregate feature
Problem 1. Heterogeneous Graph Representation Learning. information of the sampled heterogeneous neighbors for each node.
Given a C-HetG G = (V , E, OV , R E ) with node content set C,
the task is to design a model FΘ with parameters Θ to learn d- 3.2 Encoding Heterogeneous Contents (C2)
dimensional embeddings E ∈ R |V |×d (d ≪ |V |) that are able to To solve the challenge C2, we design a module to extract hetero-
encode both heterogeneous structural closeness and heterogeneous geneous contents Cv from node v ∈ V and encode them as a fixed
unstructured contents among them. The node embeddings can size embedding via a neural network f 1 . Specifically, we denote
be utilized in various graph mining tasks, such as link prediction,
the feature representation of i-th content in Cv as xi ∈ Rd f ×1
recommendation, multi-labels classification, and node clustering.
(d f : content feature dimension). Note that xi can be pre-trained
using different techniques w.r .t . different types of contents. For
3 HetGNN example, we can utilize Par2Vec [13] to pre-train text content or
In this section, we formally present HetGNN to resolve those three employ CNNs [17] to pre-train image content. Unlike the previous
challenges described in Section 1. HetGNN consists of four parts: models [7, 31] that concatenate different content features directly
(1) sampling heterogeneous neighbors; (2) encoding node hetero- or linearly transform them into an unified vector, we design a new
geneous contents; (3) aggregating heterogeneous neighbors; (4) architecture based on bi-directional LSTM (Bi-LSTM) [9] to capture
formulating the objective and designing model training procedure. “deep” feature interactions and obtain larger expressive capability.
Figure 2 illustrates the framework of HetGNN. Formally, the content embedding of v is computed as follows:
−−−−−→ É ←−−−−−
h i
3.1 Sampling Heterogeneous Neighbors (C1) i ∈ Cv LST M F C θ x (xi ) LST M F C θ x (xi )
Í
The key idea of most graph neural networks (GNNs) is to aggregate f 1 (v) =
|Cv |
feature information from a node’s direct (first-order) neighbors, (1)
such as GraphSAGE [7] or GAT [31]. However, directly applying
where f 1 (v) ∈ Rd×1 (d: content embedding dimension), F C θ x
these approaches to heterogeneous graphs may raise several issues:
denotes feature transformer which can be identity (no transforma-
• They cannot directly capture feature information from different
types of neighbors. For example, authors do not directly connect Éconnected neural network with parameter θ x , etc. The
tion), fully
operator denotes concatenation. The LSTM is formulated as:
to local authors and venue neighbors in Fig. 1(a), which could
lead to insufficient representation. zi = σ (Uz F C θ x (xi ) + Wz hi−1 + bz )
• They are weakened by various neighbor sizes. Some author writes fi = σ (Uf F C θ x (xi ) + Wf hi−1 + bf )
many papers while some only have few papers in the academic oi = σ (Uo F C θ x (xi ) + Wo hi−1 + bo )
graph. Some items are reviewed by many users while some receive (2)
ĉi = tanh(Uc F C θ x (xi ) + Wc hi−1 + bc )
few feedbacks in the review graph. The embedding of “hub” node
could be impaired by weakly correlated neighbors and “cold-start” ci = fi ◦ ci−1 + zi ◦ ĉi
node embedding may not be sufficiently represented. hi = tanh(ci ) ◦ oi
795
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
NN-1 NN-2
attributes
text NN-3
a a image Graph a
text
Context Loss
attributes
attributes
BiLSTM
BiLSTM
attributes ℱ'
type-1 + !"
par2vec
BiLSTM
BiLSTM
text +
Mean Pooling
Mean Pooling
...
+
...
...
neighbors
... ...
...
...
...
...
...
ℱ'
!$
CNN
type-K +
BiLSTM
BiLSTM
image !%
ℱ' self +
node heterogeneous contents encoding same type neighbors aggregation types mixture with attention
Figure 2: (a) The overall architecture of HetGNN: it first samples fix sized heterogeneous neighbors for each node (node a in this
case), next encodes each node content embedding via NN-1, then aggregates content embeddings of the sampled heterogeneous
neighbors through NN-2 and NN-3, finally optimizes the model via a graph context loss; (b) NN-1: node heterogeneous contents
encoder; (c) NN-2: type-based neighbors aggregator; (d) NN-3: heterogeneous types combination.
where hi ∈ R(d/2)×1 is the output hidden state of i-th content, ◦ 3.3.1 Same Type Neighbors Aggregation.
denotes Hadamard product, Uj ∈ R(d/2)×d f , Wj ∈ R(d/2)×(d/2) , In Section 3.1, we use RWR based strategy to sample fixed size
and bj ∈ R(d/2)×1 (j ∈ {z, f , o, c}) are learnable parameters, zi , fi , neighbor sets of different node types for each node. Accordingly,
and oi are forget gate vector, input gate vector, and output gate we denote the t-type sampled neighbor set of v ∈ V as Nt (v). Then,
vector of i-th content feature, respectively. To be more specific, we employ a neural network f 2t to aggregate content embeddings of
the above architecture first uses different F C layers to transform v ′ ∈ Nt (v). Formally, the aggregated t-type neighbors embedding
different content features, then employs the Bi-LSTM to capture for v is formulated as follows:
“deep” feature interactions and accumulate expression capability
f 2t (v) = AGvt ′ ∈N (v) f 1 (v ′ )
of all content features, and finally utilizes a mean pooling layer t
(3)
over all hidden states to obtain the general content embedding of
where f 2t (v) ∈ Rd×1 (d: aggregated content embedding dimension),
v, as illustrated in Figure 2(b). Note that the Bi-LSTM operates on
f 1 (v ′ ) is the content embedding of v ′ generated by the module in
an unordered content set Cv , which is inspired by previous work
Section 3.2, AG t is the t-type neighbors aggregator which can
[7] for aggregating unordered neighbors. Besides, we use different
be fully connected neural network, convolutional neural network,
Bi-LSTMs to aggregate content features for different types of nodes
recurrent neural network, etc. In this work, we use the Bi-LSTM
as their contents vary from each other.
since it yields better performance in practise. Thus we re-formulate
There are three main advantages for this encoding architecture:
f 2t (v) as follows:
(1) it has concise structures with relative low complexity (less pa-
rameters), making the model implementation and tuning relatively h
−−−−−→ É ←−−−−− i
v ′ ∈N t (v) LST M f 1 (v ) LST M f 1 (v ′ )
Í ′
easy; (2) it is capable to fuse the heterogeneous contents informa-
f 2t (v) = (4)
tion, leading to a strong expression capability; (3) it is flexible to add |Nt (v)|
extra content features, making the model extension convenient.
where LSTM module has the same formulation as Eq. (2) except
input and parameter set. Obviously, we employ Bi-LSTM to ag-
3.3 Aggregating Heterogeneous Neighbors (C3) gregate content embeddings of all t-type neighbors and use the
To aggregate content embeddings (obtained from Section 3.2) of average over all hidden states to represent the general aggregated
heterogeneous neighbors for each node and solve the challenge embedding, as illustrated in Figure 2(c). We use different Bi-LSTMs
C3, we design another module which is a type-based neural net- to distinguish different node types for neighbors aggregation. Note
work. It includes two consecutive steps: (1) same type neighbors that the Bi-LSTM operates on an unordered neighbors set, which is
aggregation; (2) types combination. inspired by GraphSAGE [7].
796
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
3.3.2 Types Combination. 1 as it makes little impact when M > 1. Thus Eq. (9) degenerates to
The previous step generates |OV | (OV : set of node types in the the cross entropy loss:
graph) aggregated embeddings for node v. To combine these type-
log σ (Evc · Ev ) + log σ (−Evc ′ · Ev ) (10)
based neighbor embeddings with v’s content embedding, we em-
ploy the attention mechanism [31]. The motivation is that different In other words, for each context node vc of v, we sample a negative
types of neighbors will make different contributions to the final node vc ′ according to Pt (vc ′ ). Therefore, we can reformulate the
representation of v. Thus the output embedding is formulated as: objective o 1 in Eq. (7) as follows:
Ev = α v,v f 1 (v) + α v,t f 2t (v) o2 =
Õ Õ
(5) log σ (Evc · Ev ) + log σ (−Evc ′ · Ev )
(11)
t ∈OV ⟨v,vc ,vc ′ ⟩ ∈Tw al k
noise distribution w.r .t . the t-type nodes. In this model, we set M = 2 http://jmcauley.ucsd.edu/data/amazon/index.html
797
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
Table 2: Datasets used in this work. Table 3: Link prediction results. Split notation in data de-
notes train/test data split years or ratios.
Data Node Edge
# author: 160,713 # author-paper: 295,103 MP2V ASNE SHNE GSAGE GAT
Dataspl i t Metric HetGNN
Academic I (A-I) # paper: 111,409 # paper-paper: 138,464 [4] [15] [34] [7] [31]
# venue: 150 # paper-venue: 111,409
A-I2003 AUC 0.636 0.683 0.696 0.694 0.701 0.714
# author: 28,646 # author-paper: 69,311 (type-1) F1 0.435 0.584 0.597 0.586 0.606 0.620
Academic II (A-II) # paper: 21,044 # paper-paper: 46,931
A-I2003 AUC 0.790 0.794 0.781 0.790 0.821 0.837
# venue: 18 # paper-venue: 21,044
(type-2) F1 0.743 0.774 0.755 0.746 0.792 0.815
# user: 18,340
Movies Review (R-I) # user-item: 629,125 A-I2002 AUC 0.626 0.667 0.688 0.681 0.691 0.710
# item: 56,361
(type-1) F1 0.412 0.554 0.590 0.567 0.589 0.615
# user: 16,844
CDs Review (R-II) # user-item: 555,050 A-I2002 AUC 0.808 0.782 0.795 0.806 0.837 0.851
# item: 106,892
(type-2) F1 0.770 0.753 0.761 0.772 0.816 0.828
A-II2013 AUC 0.596 0.689 0.683 0.695 0.678 0.717
Table 2 (see Section A.2 in supplement for detail of these datasets). (type-1) F1 0.348 0.643 0.639 0.615 0.613 0.669
Note that HetGNN is flexible to be applied to other HetG. A-II2013 AUC 0.712 0.721 0.695 0.714 0.732 0.767
4.1.2 Baselines. (type-2) F1 0.647 0.713 0.674 0.664 0.705 0.754
We use five baselines including heterogeneous graph embedding A-II2012 AUC 0.586 0.671 0.672 0.676 0.655 0.701
(type-1) F1 0.318 0.615 0.612 0.573 0.560 0.642
model metapath2vec [4] (represented as MP2V), attributed graph
models ASNE [15] and SHNE [34], as well as graph neural network A-II2012 AUC 0.724 0.726 0.706 0.739 0.750 0.775
(type-2) F1 0.664 0.737 0.692 0.706 0.715 0.757
models GraphSAGE [7] (represented as GSAGE) and GAT [31]
AUC 0.634 0.623 0.651 0.661 0.683 0.749
(see Section A.3 in supplement for detailed settings of these baseline R-I5:5
F1 0.445 0.551 0.586 0.542 0.665 0.735
methods).
AUC 0.701 0.656 0.695 0.716 0.706 0.787
4.1.3 Reproducibility. R-I7:3
F1 0.595 0.613 0.660 0.688 0.702 0.776
For the proposed model, the embedding dimension is set as 128. AUC 0.678 0.655 0.685 0.677 0.712 0.736
The size of sampled neighbor set (in Section 3.1) equals 23 (10, 10, R-II5:5
F1 0.541 0.582 0.593 0.565 0.659 0.701
3 for author, paper, venue neighbor groups, respectively) in aca- AUC 0.737 0.695 0.728 0.721 0.742 0.772
R-II7:3
demic data. This value equals 20 (10, 10 for user, item neighbor F1 0.660 0.648 0.685 0.653 0.713 0.749
groups, respectively) in review data. We use Par2Vec [19] and CNN
[17] to pre-train text and image features, respectively. Besides, the before Ts (split year) is training data, otherwise test data. Ts of A-I
DeepWalk [20] is employed to pre-train node embeddings. The data is set to 2003 and 2002. The value for A-II data is set to 2013
nodes in academic data are associated with text (paper abstract) fea- and 2012. In the review data, we consider user-item review links
tures and pre-trained node embeddings, while the nodes in review and divide training/test data sequentially. The train/test ratio (in
data include text (item description), image (item picture) features, terms of review number) is set to 7 : 3 and 5 : 5 for both R-I and
and pre-trained node embeddings. Section A.4 of supplement con- R-II data.
tains more detailed settings. We employ Pytorch3 to implement Result. The performances of all models are reported in Table
HetGNN and conduct experiments on GPU. Code is available at: 3, where the best results are highlighted in bold. According to this
https://github.com/chuxuzhang/KDD2019_HetGNN. table: (a) the best baselines in most cases are attributed graph em-
bedding methods or graph neural network models, showing that
4.2 Applications incorporating node attributes or employing deep neural network
4.2.1 Link Prediction ( RQ1-1). generates desirable node embeddings for link prediction; (b) Het-
Which links will happen in the future? To answer RQ1-1, we design GNN outperforms all baselines in all cases especially in review
experiments to evaluate HetGNN on several link prediction tasks. data. The relative improvements (%) over the best baselines range
Setting. Unlike previous work [6] that randomly samples a por- from 1.5% to 5.6% and 3.4% to 10.5% for academic data and review
tion of links for training and uses the remaining for evaluation, data, respectively. It demonstrates that the proposed heterogeneous
we consider a more practical setting that splits training and test graph neural network framework is effective and obtains better
data sequentially. Specifically, first, the graph of training data is node embeddings (than baselines) for link prediction.
utilized to learn node embeddings and the corresponding links 4.2.2 Recommendation ( RQ1-2).
are used to train a binary logistic classifier. Then, test relations Which nodes should be recommended to the target node? To answer
with equal number of random negative (non-connected) links are RQ1-2, we design experiment to evaluate HetGNN on personalized
used to evaluate the trained classifier. In addition, only new links node recommendation task.
among nodes in training data are considered and duplicated links Setting. The concept of node recommendation is similar to link
are removed from evaluation. The link embedding is formed by prediction besides the experimental settings and evaluation metrics.
element-wise multiplication of embeddings of the two edge nodes. To distinguish with the previous link prediction task, we evaluate
We use AUC and F1 scores as evaluation metrics. In academic data, venue recommendation (author-venue link) performance in the
we consider two types of links: (type-1) collaboration between two academic data. Specifically, the graph in training data is utilized
authors and (type-2) citation between author and paper. The data to learn node embeddings. The ground truth of recommendation
3 https://pytorch.org/ is based on author’s appearance (having papers) in venue of test
798
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
A-I2002
Rec
Pre
0.144
0.046
0.152
0.050
0.279
0.086
0.231
0.073
0.274
0.087
0.293
0.093
DM
F1 0.070 0.075 0.134 0.112 0.132 0.141 DM
Rec 0.516 0.419 0.608 0.540 0.568 0.625 DB DB
A-II2013 Pre 0.207 0.174 0.241 0.219 0.230 0.252
F1 0.295 0.333 0.345 0.312 0.327 0.359 Figure 3: Author embeddings visualization of four selected
Rec 0.468 0.382 0.552 0.512 0.518 0.606 domains in academic data.
A-II2012 Pre 0.204 0.171 0.233 0.224 0.227 0.264
F1 0.284 0.236 0.327 0.312 0.316 0.368 Table 6: Inductive multi-labels classification (IMC) and node
clustering (INC) results. Percentage is training data ratio.
Table 5: Multi-label classification (MC) and node clustering
(NC) results. Percentage denotes training data ratio. Task Metric GSAGE [7] GAT [31] HetGNN
IMC Macro-F1 0.938 0.954 0.962
MP2V ASNE SHNE GSAGE GAT (10%) Micro-F1 0.945 0.958 0.965
Task Metric HetGNN
[4] [15] [34] [7] [31]
IMC Macro-F1 0.949 0.956 0.964
MC Macro-F1 0.972 0.965 0.939 0.978 0.962 0.978 (30%) Micro-F1 0.955 0.960 0.968
(10%) Micro-F1 0.973 0.967 0.940 0.978 0.963 0.979
NMI 0.714 0.765 0.840
INC
MC Macro-F1 0.975 0.969 0.939 0.979 0.965 0.981 ARI 0.764 0.803 0.894
(30%) Micro-F1 0.975 0.970 0.941 0.980 0.965 0.982
NMI 0.894 0.854 0.776 0.914 0.845 0.901
NC
ARI 0.933 0.898 0.813 0.945 0.882 0.932
embeddings are used as the input to a logistic regression classifier.
Besides, the size (ratio) of training data is set to 10% and 30%, and
data. The preference score is defined as the inner-product between the remaining nodes are used for test. We use both Micro-F1 and
embeddings of two nodes. We use Recall (Rec), Precision (Pre), and Macro-F1 as evaluation metrics. For the node clustering task, the
F1 scores in top-k recommendation list as the evaluation metric. In learned node embeddings are used as the input to a clustering
addition, duplicated author-venue pairs are removed from evalu- model. Here we employ the k-means algorithm to cluster the data
ation. The reported score is the average value over all evaluated and evaluate the clustering performance in terms of normalized
authors. The same as link prediction task, the train/test split year Ts mutual information (NMI) and adjusted rand index (ARI).
for A-I data is set to 2003 and 2002. The value for A-II data is set to Result. Table 5 reports results of all methods, where the best
2013 and 2012. Besides, k is set to 5 and 3 for two data respectively. results are highlighted in bold. It is can be seen that: (1) most of
Result. The results of different models are reported in Table models have good performance in multi-labels classification and
4. The best results are highlighted in bold. According to this ta- obtain large Macro-F1 and Micro-F1 scores (over 0.95). It is reason-
ble, the best baselines are attributed graph embedding methods or able since authors of four selected domains are quite different from
graph neural network models in most cases. In addition, HetGNN each other; (2) Despite (1), HetGNN achieves the best performance
performs best in all cases. The relative improvements (%) over the or is comparable to the best method for multi-label classification
best baseline range from 2.8% to 16.0%, showing that HetGNN is and node clustering tasks, showing that HetGNN can learn effective
effective and can learn better node embeddings (than baselines) for node embeddings for these tasks.
node recommendation. Furthermore, we employ TensorFlow embedding projector to
4.2.3 Classification and Clustering ( RQ1-3). visualize author embeddings of four domains, as shown by Figure
Which class/cluster does this node belong to? To answer RQ1- 3. For each area, we randomly sample 100 authors. It is easy to see
3, we design experiments to evaluate HetGNN for multi-labels that embeddings of authors in the same class cluster closely and can
classification and node clustering tasks. be well distinguished from others in both 2D and 3D visualizations,
Setting. Similar to metapath2vec [4], we match authors in A- demonstrating the effectiveness of learned node embeddings.
II dataset with four selected research domains, i.e., Data Mining 4.2.4 Inductive Classification and Clustering ( RQ2).
(DM), Computer Vision (CV), Natural Language Processing (NLP) Which class/cluster does new node belong to? To answer RQ2, we
and Database (DB). Specifically, we choose three top venues4 for design experiment to evaluate HetGNN for inductive multi-labels
each area. Each author is labeled with the area with the majority classification and inductive node clustering tasks.
of his/her publications (authors without paper in these venues are Setting. The setting of this task is similar to the previous node
excluded in evaluation). The node embeddings are learned from the classification and clustering tasks except that we use the new node
full dataset. For the multi-labels classification task, the learned node embeddings as the model input. Specifically, first, we use the train-
4 DM: KDD, WSDM, ICDM. CV: CVPR, ICCV, ECCV. NLP: ACL, EMNLP, NAACL. DB: SIGMOD, ing data (A-II dataset, train/test split year = 2013) to train the model.
VLDB, ICDE Then, we employ the learned model to infer the embeddings of
799
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
Link Prediction (Type-1) Link Prediction (Type-2) Node Recommendation Link Prediction (Type-1) Link Prediction (Type-2) Node Recommendation
Link Prediction (Type-1) Link Prediction (Type-2) Node Recommendation Link Prediction (Type-1) Link Prediction (Type-2) Node Recommendation
all new nodes in test data. Finally, we use the inferred new node
embeddings as the input to classification and clustering models.
Result. Table 6 reports performances of graph neural network
models, where the best results are highlighted in bold. According
to this table: (1) all methods have good performances in inductive
Link Prediction (Type-1) Link Prediction (Type-2) Node Recommendation
multi-labels classification as the reason described in the previous
task. However, HetGNN still achieves the best performance; (2)
The result of HetGNN is better than the others for inductive node
clustering. The average relative improvements (%) over GSAGE and
GAT are 17.3% and 10.6%, respectively. It shows that the learned
HetGNN model is effective for inferring new node embeddings. Figure 6: Impact of sampled neighbor size.
800
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
utilized in various graph mining tasks. For example, inspired by [7] Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation
word2vec [19], Perozzi et al . [20] developed the innovative Deep- learning on large graphs. In NIPS. 1024–1034.
[8] Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual
Walk which introduces node-context concept in graph (analogy to evolution of fashion trends with one-class collaborative filtering. In WWW. 507–
word-context) and feeds a set of random walks over graph (anal- 517.
[9] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
ogy to “sentences”) to SkipGram so as to obtain node embeddings. computation 9, 8 (1997), 1735–1780.
Later, to address graph structure heterogeneity, Dong et al . [4] [10] Binbin Hu, Chuan Shi, Wayne Xin Zhao, and Philip S Yu. 2018. Leveraging
introduced metapath guided walks and proposed metapath2vec meta-path based context for top-n recommendation with a neural co-attention
model. In KDD. 1531–1540.
for representation learning in HetG. Further, attributed graph em- [11] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-
bedding models [14, 15, 34] have been proposed to leverages both mization. arXiv preprint arXiv:1412.6980 (2014).
graph structure and node attributes for learning node embeddings. [12] Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph
convolutional networks. In ICLR.
Besides those methods, many other approaches have been proposed [13] Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and
[1, 18, 21, 28, 32], such as NetMF [21] that learns node embedding documents. In ICML. 1188–1196.
[14] Jundong Li, Harsh Dani, Xia Hu, Jiliang Tang, Yi Chang, and Huan Liu. 2017.
via matrix factorization and NetRA [32] that uses adversarially Attributed network embedding for learning in a dynamic environment. In CIKM.
regularized autoencoders to learn node embeddings, and so on. 387–396.
Graph neural networks. Recently, with the advent of deep [15] Lizi Liao, Xiangnan He, Hanwang Zhang, and Tat-Seng Chua. 2018. Attributed
social network embedding. TKDE 30, 12 (2018), 2257–2270.
learning, graph neural networks (GNNs) [5, 7, 12, 16, 24, 31] has [16] Ziqi Liu, Chaochao Chen, Xinxing Yang, Jun Zhou, Xiaolong Li, and Le Song.
gained a lot of attention. Unlike previous graph embedding models, 2018. Heterogeneous Graph Neural Networks for Malicious Account Detection.
the key idea behind GNNs is to aggregate feature information from In CIKM. 2077–2085.
[17] Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional
node’s local neighbors via neural networks. For example, Graph- networks for semantic segmentation. In CVPR. 3431–3440.
SAGE [7] uses neural networks, e.д., LSTM, to aggregate neighbors’ [18] Jianxin Ma, Peng Cui, Xiao Wang, and Wenwu Zhu. 2018. Hierarchical Taxonomy
Aware Network Embedding. In KDD. 1920–1929.
feature information. Besides, GAT [31] employs self-attention mech- [19] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
anism to measure impacts of different neighbors and combine their Distributed representations of words and phrases and their compositionality. In
impacts to obtain node embeddings. Moreover, some task depen- NIPS. 3111–3119.
[20] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning
dent approaches, e.д., GEM [16] for malicious accounts detection, of social representations. In KDD. 701–710.
have been proposed to obtain better node embeddings for specific [21] Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. 2018.
tasks. Network embedding as matrix factorization: Unifying deepwalk, line, pte, and
node2vec. In WSDM. 459–467.
[22] Meng Qu, Jian Tang, and Jiawei Han. 2018. Curriculum Learning for Heteroge-
6 CONCLUSION neous Star Network Embedding via Deep Reinforcement Learning. In WSDM.
468–476.
In this paper, we introduced the problem of heterogeneous graph [23] Xiang Ren, Jialu Liu, Xiao Yu, Urvashi Khandelwal, Quanquan Gu, Lidan Wang,
representation learning and proposed a heterogeneous graph neu- and Jiawei Han. 2014. Cluscite: Effective citation recommendation by information
network-based clustering. In KDD. 821–830.
ral network model, i.e., HetGNN, to address this problem. HetGNN [24] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan
jointly considered node heterogeneous contents encoding, type- Titov, and Max Welling. 2018. Modeling relational data with graph convolutional
based neighbors aggregation, and heterogeneous types combina- networks. In ESWC. 593–607.
[25] Yizhou Sun, Jiawei Han, Charu C Aggarwal, and Nitesh V Chawla. 2012. When
tion. In the training stage, a graph context loss and a mini-batch will it happen?: relationship prediction in heterogeneous information networks.
gradient descent procedure were employed to learn the model pa- In WSDM. 663–672.
rameters. Extensive experiments on various graph mining tasks, [26] Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S Yu, and Tianyi Wu. 2011. Pathsim:
Meta path-based top-k similarity search in heterogeneous information networks.
i.e., link prediction, recommendation, node classification & cluster- VLDB 4, 11 (2011), 992–1003.
ing and inductive node classification & clustering, demonstrated [27] Yizhou Sun, Brandon Norick, Jaiwei Han, Xifeng Yan, Philip Yu, and Xiao Yu.
2012. PathSelClus: Integrating Meta-Path Selection with User-Guided Object
that HetGNN can outperform state-of-the-art methods. Clustering in Heterogeneous Information Networks. In KDD. 1348–1356.
[28] Jian Tang, Meng Qu, and Qiaozhu Mei. 2015. Pte: Predictive text embedding
ACKNOWLEDGMENTS through large-scale heterogeneous text networks. In KDD. 1165–1174.
[29] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei.
This work is supported by the CCDC Army Research Laboratory 2015. Line: Large-scale information network embedding. In WWW. 1067–1077.
under Cooperative Agreement Number W911NF-09-2-0053 (Net- [30] Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. Arnet-
miner: extraction and mining of academic social networks. In KDD. 990–998.
work Science CTA) and the National Science Foundation (NSF) [31] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro
grant IIS-1447795. Lio, and Yoshua Bengio. 2018. Graph attention networks. In ICLR.
[32] Wenchao Yu, Cheng Zheng, Wei Cheng, Charu C Aggarwal, Dongjin Song, Bo
Zong, Haifeng Chen, and Wei Wang. 2018. Learning Deep Network Representa-
REFERENCES tions with Adversarially Regularized Autoencoders. In KDD. 2663–2671.
[1] Shiyu Chang, Wei Han, Jiliang Tang, Guo-Jun Qi, Charu C Aggarwal, and [33] Chuxu Zhang, Chao Huang, Lu Yu, Xiangliang Zhang, and Nitesh V Chawla.
Thomas S Huang. 2015. Heterogeneous network embedding via deep archi- 2018. Camel: Content-Aware and Meta-path Augmented Metric Learning for
tectures. In KDD. 119–128. Author Identification. In WWW. 709–718.
[2] Ting Chen and Yizhou Sun. 2017. Task-Guided and Path-Augmented Heteroge- [34] Chuxu Zhang, Ananthram Swami, and Nitesh V Chawla. 2019. SHNE: Represen-
neous Network Embedding for Author Identification. In WSDM. 295–304. tation Learning for Semantic-Associated Heterogeneous Networks. In WSDM.
[3] Peng Cui, Xiao Wang, Jian Pei, and Wenwu Zhu. 2018. A survey on network 690–698.
embedding. TKDE (2018). [35] Chuxu Zhang, Lu Yu, Xiangliang Zhang, and Nitesh V Chawla. 2018. Task-Guided
[4] Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. 2017. metapath2vec: and Semantic-Aware Ranking for Academic Author-Paper Correlation Inference..
Scalable Representation Learning for Heterogeneous Networks. In KDD. 135–144. In IJCAI. 3641–3647.
[5] Hongyang Gao, Zhengyang Wang, and Shuiwang Ji. 2018. Large-scale learnable [36] Yizhou Zhang, Yun Xiong, Xiangnan Kong, Shanshan Li, Jinhong Mi, and Yangy-
graph convolutional networks. In KDD. 1416–1424. ong Zhu. 2018. Deep Collective Classification in Heterogeneous Information
[6] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for Networks. In WWW. 399–408.
networks. In KDD. 855–864.
801
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
802
Research Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA
besides “latent” feature, we use the same content features as Het- • Type-FC. This variant replaces types combination module (atten-
GNN and concatenate them as general attribute features. For tion) of HetGNN with a FC. That is, the concatenated embedding
SHNE, we utilize paper abstract and item description (text se- of different neighbor groups (types) is fed to a FC layer to get
quence length = 100) as the input for deep semantic encoding aggregated embedding. The other modules are the same as Het-
(i.e., LSTM) in two data, respectively. Besides, the walk sampling GNN.
setting is the same as MP2V. For GraphSAGE and GAT, we use Besides, the training procedures of all model variants are the same
the same input features (concatenated as a general feature) and as HetGNN.
the sampled neighbors set for each node as HetGNN.
• Software & Hardware. We employ Pytorch8 to implement Het- A.6 Hyper-parameters Sensitivity Setup
GNN and further conduct it on a server with GPU machines. Code In Section 4.3.2, we conduct experiments on A-II dataset (train/test
is available at: https://github.com/chuxuzhang/KDD2019_HetGNN. split year = 2013) to study the impacts of two hyper-parameters:
embedding dimension d and the sampled neighbors size for each
A.5 Model Variants Description node. We investigate a specific parameter by changing its value
In Section 4.3.1, we propose three model variants to conduct abla- and fixing the others. Specifically, when fixing sampled neighbor
tion study experiments. These models are: size (i.e., 23), we set different embedding dimension d (i.e., 8, 16,
• No-Neigh. This variant does not consider neighbors influence 32, 64, 128, 256) of HetGNN and evaluate its performance for each
and uses heterogeneous contents encoding f 1 (v) (Section 3.2) to dimension. Besides, when fixing embedding dimension (i.e., 128),
represent embedding of node v ∈ V . That is, it removes heteroge- we set different sizes of sampled neighbors set (i.e., 6, 12, 17, 23,
neous neighbors aggregation module (Section 3.3) of HetGNN. 28, 34) for each node and evaluate HetGNN’s performance for each
• Content-FC. This variant replaces heterogeneous content en- size. The constitutions of different neighbors groups (types) for
coder (Bi-LSTM) of HetGNN with a fully connected neural net- aforementioned sizes are: 6 = 2 (author) + 2 (paper) + 2 (venue), 12
work (FC). That is, the concatenated content feature is fed to a = 5 (author) + 5 (paper) + 2 (venue), 17 = 7 (author) + 7 (paper) + 3
FC layer to get content embedding. The other modules are the (venue), 23 = 10 (author) + 10 (paper) + 3 (venue), 28 = 12 (author) +
same as HetGNN. 12 (paper) + 4 (venue), and 34 = 15 (author) + 15 (paper) + 4 (venue).
8 https://pytorch.org/
803