0% found this document useful (0 votes)
15 views

Joint Latent Space Model For Social Networks With Multivariate Attributes

The document proposes a joint latent space model (APLSM) to analyze social network data along with high-dimensional multivariate attribute measurements for each individual. The APLSM summarizes information from the social network and attribute measurements in a person-attribute joint latent space. It uses a variational Bayesian expectation-maximization algorithm to estimate the locations of attributes and individuals in the joint latent space. This allows for effective integration, visualization, and prediction of social networks and attributes. The model is applied to study the social circles and characteristics of French financial elites.

Uploaded by

pedroasxz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Joint Latent Space Model For Social Networks With Multivariate Attributes

The document proposes a joint latent space model (APLSM) to analyze social network data along with high-dimensional multivariate attribute measurements for each individual. The APLSM summarizes information from the social network and attribute measurements in a person-attribute joint latent space. It uses a variational Bayesian expectation-maximization algorithm to estimate the locations of attributes and individuals in the joint latent space. This allows for effective integration, visualization, and prediction of social networks and attributes. The model is applied to study the social circles and characteristics of French financial elites.

Uploaded by

pedroasxz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Joint Latent Space Model for Social

Networks with Multivariate Attributes


Selena Shuo Wang∗, Department of Psychology
Subhadeep Paul, Department of Statistics
Paul De Boeck, Department of Psychology
arXiv:1910.12128v2 [stat.AP] 1 Feb 2021

The Ohio State University

Abstract
In many application problems in social, behavioral, and economic sciences, re-
searchers often have data on a social network among a group of individuals along
with high dimensional multivariate measurements for each individual. To analyze
such networked data structures, we propose a joint Attribute and Person Latent
Space Model (APLSM) that summarizes information from the social network and
the multiple attribute measurements in a person-attribute joint latent space. We
develop a Variational Bayesian Expectation-Maximization estimation algorithm to
estimate the posterior distribution of the attribute and person locations in the joint
latent space. This methodology allows for effective integration, informative visualiza-
tion, and prediction of social networks and high dimensional attribute measurements.
Using APLSM, we explore the inner workings of the French financial elites based on
their social networks and their career, political views, and social status. We observe
a division in the social circles of the French elites in accordance with the differences
in their individual characteristics.

Keywords: High-dimensional Covariates, Multimodal Networks, Social Networks, Latent


Space Models

1 Introduction
Understanding interactions among sets of entities often represented as complex net-
works, is a central research task in many data-intensive scientific fields, including Statis-

This research is partially supported by a grant from National Science Foundation under Grant No.
DMS 1830547. The authors would like to thank Prof. Vishesh Karwa of Temple University, Prof. Srijan
Sengupta of North Carolina State University and Prof. Jessica Logan of The Ohio State University for
discussions that helped in conceptualizing the statistical models.

1
tics, Machine learning, Social sciences, Biology, Psychology, and Economics (Watts and
Strogatz, 1998; Barabási and Albert, 1999; Albert and Barabási, 2002; Jackson et al., 2008;
Girvan and Newman, 2002; Shmulevich et al., 2002; Bickel and Chen, 2009; Bullmore and
Sporns, 2009; Rubinov and Sporns, 2010; Carrington et al., 2005; Borgatti et al., 2009;
Wasserman et al., 1994; Lazer, 2011). However, a majority of methodological and applied
studies have only considered interactions of one type among a set of entities of the same
type. More recent studies have pointed to the heterogeneous and multimodal nature of
such interactions, whereby a complex networked system is composed of multiple types of
interactions among entities that themselves are of multiple types (Kivelä et al., 2014; Boc-
caletti et al., 2014; Mucha et al., 2010; Paul and Chen, 2016; Paul et al., 2020b; Sengupta
and Chen, 2015; Sun et al., 2009; Liu et al., 2014; Ferrara et al., 2014; He et al., 2014;
Nickel et al., 2016; Paul et al., 2020a).
Social relationships are known to affect individual behaviors and outcomes including
dementia (Fratiglioni et al., 2000), decision making (Kim and Srivastava, 2007), adolescent
smoking Mercken et al. (2010), and online behavior choices (Kwon et al., 2014). At the same
time, individual attributes, such as race, age, and gender, can affect whether and how people
form friendships or romantic partnerships (Dean et al., 2017; McPherson et al., 2001). The
effect of social relationships on individual behaviors is observed through disparities in the
outcomes across different individuals when their friendship structures differ. The reciprocal
is also observed through disparities in the friendship structures when individuals’ attributes
differ. Therefore, flexible joint modeling of social relationships and individual behaviors
and attributes is needed to investigate their interrelationships effectively.
A number of popular models for social networks have been extended to incorporate
nodal covariates in the literature, including, exponential random graph models (ERGMs)
(Lusher et al., 2013), stochastic blockmodels (Mele et al., 2019; Sweet, 2015) and latent
space models (Fosdick and Hoff, 2015; Krivitsky and Handcock, 2008; Hoff et al., 2002;
Austin et al., 2013). In these models, the social network links are treated as dependent
variables, and the effects of nodal covariates on the probability of network ties are sub-
sequently estimated. Alternatively, social influence models use the node-level attributes
as the dependent variables and estimate the effects of the social network on the attributes
(Dorans and Drasgow, 1978; Leenders, 2002; Robins et al., 2001; Sweet and Adhikari, 2020;
Frank et al., 2004; Fujimoto et al., 2013; Shalizi and Thomas, 2011; VanderWeele, 2011;
VanderWeele and An, 2013; Bramoullé et al., 2009; Goldsmith-Pinkham and Imbens, 2013;
Shalizi and Thomas, 2011).
A different approach is to develop a joint modeling framework where different types

2
of data are integrated by jointly modeling them as the dependent variables. In network
science, the joint modeling framework has been proposed to model multi-view or multiplex
networks, where multiple types of relations are observed on the same set of nodes, using
stochastic block models (Barbillon et al., 2015; Kéfi et al., 2016), and latent space models
(Gollini and Murphy, 2016; Arroyo et al., 2019; Salter-Townshend and McCormick, 2017;
D’Angelo et al., 2018; Zhang et al., 2020). In these models, latent variables are used to
explain the probability of a node being connected with other nodes in multiple types of
relations. When dependencies can be assumed across different layers, the common node
representation is a flexible framework to summarize multiple types of information.
In many cases, high-dimensional multivariate covariates with complex latent structures
are available in addition to a connected network. In particular, often a social network is
observed along with individuals’ attributes or behavioral outcomes. In this case, two types
of relations, the social network relations and various types of individual attributes, are
observed among two types of nodes, the person nodes, and the attribute nodes. To jointly
model such data, a multivariate normal distribution was fitted by Fosdick and Hoff (2015)
to the latent variables from the social network and the observed covariates. This work
is in spirit the closest to our proposed model. However, it restricts the covariates to be
normally distributed, and it does not take into account the multiple latent dimensions of the
covariates. A dynamic version of Fosdick and Hoff (2015) can be found in Guhaniyogi et al.
(2020) with possibilities to accommodate both categorical and continuous attributes. The
most important distinction of our model from this line of work is that we use a second set of
latent variables, the latent attribute variables, in addition to the latent person variables to
summarize the information associated with each attribute. This modeling framework allows
for the joint latent space, where two types of nodes are interactive instead of one. Other
related works that jointly model heterogeneous networks are recently seen with stochastic
block models including Huang et al. (2020); Sengupta and Chen (2015).
In this paper, we propose a joint latent space model for heterogeneous and multi-
modal networks with multiple types of relations among multiple types of nodes. The
proposed Attribute Person Latent Space Model (APLSM) merges information from the
social network and the multivariate covariates by assuming that the probabilities of a
node being connected with other same-type and different-type nodes are explained by la-
tent variables. This model has a wide range of applications. For example in computer
science, it is of interest to summarize relational data, e.g. likes and followers in social
media with other user information such as personalities, health outcomes, online behavior
choices, etc. In economics and business, it is of interest to summarize consumer informa-

3
tion with their geographic networks and social networks. We demonstrate APLSM with
a data set on the French financial elites (Kadushin, 1995) available to download from
http://moreno.ss.uci.edu/data.html#ffe. To fit the APLSM, we propose a Varia-
tional Bayesian Expectation-Maximization (VBEM) algorithm Blei et al. (2017). As an
intermediate step we also develop a VBEM algorithm for fitting latent space models to bi-
partite networks. The variational methods enable the models to be fitted to large networks
with high dimensional attributes, while our simulations show the accuracy of the methods.
The remainder of this paper is organized as follows. In section 2, we introduce latent
space models for bipartite networks and develop the variational inference approach to
estimate the model. In section 3, we introduce the joint latent space model for the social
network and the multivariate covariates and extend the variational inference method to the
joint model. In section 4, we assess the performance of the estimators with a simulation
study, and in section 5, we apply the proposed methodology to the French financial elite
data. Finally, in section 6, we summarize our findings.

2 Latent Space Models


Development of Latent Space Models (LSM) for social networks can be traced back
to Hoff et al. (2002)’s latent distance and latent projection models. Both models assume
that nodes can be positioned in a D-dimensional latent space and that the probability of
an edge between two nodes depends on their closeness. The closer the two nodes, the less
likely they form a connection. Conditional on these latent positions, the probability of two
nodes forming an edge is independent of all the other edges in the network. In the latent
distance model, the Euclidean distances are used to describe the relationships between the
nodes; whereas, in the latent projection model, scaled vector products are used to describe
the relationships between nodes.
Let 𝒀𝑰 denote the 𝑁 × 𝑁 adjacency matrix of the social network among 𝑁 individuals.
The (𝑖, 𝑗) th element of the matrix 𝒀𝑰 , denoted as 𝑦𝑖𝐼𝑗 is 1 if person 𝑖 and person 𝑗 are
related, for 𝑖, 𝑗 = {1, 2, . . . , 𝑁 } and 𝑖 ≠ 𝑗. Let 𝑼 be a 𝑁 × 𝐷 matrix of latent person
position, each row of which is a 𝐷 dimensional vector 𝒖 𝒊 = (𝑢𝑖1 , 𝑢𝑖2 , . . . , 𝑢𝑖𝐷 ) indicating the
latent position of person 𝑖 in the Euclidean space. The latent distance model for a binary
social network 𝒀𝑰 can be written as:
exp(𝛼0 − |𝒖𝑖 − 𝒖 𝑗 | 2 )
𝑌𝑖𝐼𝑗 |(𝑼, 𝛼0 ) ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑔(𝜙𝑖 𝑗 )), 𝑔(𝜙𝑖 𝑗 ) = .
1 + exp(𝛼0 − |𝒖𝑖 − 𝒖 𝑗 | 2 )

We assume 𝒖 𝒊 to be the latent person variable, 𝒖 𝒊 ∼ 𝑁 (0, 𝜆20 I𝐷 ), where 𝛼0 , 𝜆20 are unknown
𝑖𝑖𝑑

4
parameters that need to be estimated and I𝐷 is the 𝐷 dimensional identity matrix. The
probability of an edge increases as the Euclidean distance between the two nodes decreases.
Here, we use the squared Euclidean distances |𝒖 𝒊 − 𝒖 𝒋 | 2 instead of the Euclidean distances
following (Gollini and Murphy, 2016). It has been shown in Gollini and Murphy (2016) that
squared Euclidean distances are computationally more efficient and that the latent positions
obtained using squared Euclidean distances are extremely similar to those obtained using
Euclidean distances.
Variations of the latent space model are further developed in Hoff (2005, 2008, 2009,
2018) and Ma et al. (2020). The extension of the latent distance model include Handcock
et al. (2007)’s latent position cluster model that allows for the clustering of nodes based on
the Euclidean distances. Replacing one Gaussian distribution with a mixture of Gaussians,
Handcock et al. (2007) was able to account for the possible latent community structure
in the networks. Additional random sender and receiver effects were added to the latent
position cluster model by Krivitsky et al. (2009). The latent space model have been further
extended to accommodate multiple networks Gollini and Murphy (2016); Salter-Townshend
and McCormick (2017), dynamic networks Sewell and Chen (2015), and bipartite networks
Friel et al. (2016). The majority of the works on latent space models described above have
utilized Bayesian estimation techniques including Markov Chain Monte Carlo (MCMC)
and Variational Inference (VI). Recently, Ma et al. (2020) has proposed two algorithms
based on nuclear norm penalization and projected gradient descent to fit the latent space
model with statistical consistency guarantees.

2.1 Latent Space Model for Bipartite Networks


Data on the attributes of the individuals in the network can be seen as a bipartite
network with directed edges between two types of nodes. The development of the latent
space model for bipartite networks includes a bipartite version of the latent cluster random
effects model in the latentnet package (Krivitsky and Handcock, 2008). In addition, the
latent space model for a dynamic bipartite network was introduced by (Friel et al., 2016)
to study the interlocking directorates in Irish companies. In the rest of this section, we
introduce a Variational Bayesian EM algorithm for fitting a latent space model to binary
bipartite networks (BLSM).
Let 𝒀𝑰 𝑨 denote the 𝑁 × 𝑀 bipartite network, whose (𝑖, 𝑎) th element 𝑦𝑖𝑎
𝐼 𝐴 is 1 if person 𝑖

has attribute 𝑎, for 𝑖 = {1, 2, . . . , 𝑁 } and 𝑎 = {1, 2, . . . , 𝑀 }. Let 𝑽 be a 𝑀×𝐷 matrix of latent
attribute positions, each row of which is a 𝐷 dimensional vector 𝒗 𝒂 = (𝑣 𝑎1 , 𝑣 𝑎2 , . . . , 𝑣 𝑎𝐷 )
indicating the latent position of attribute 𝑎 in the Euclidean space.

5
The latent distance model for the bipartite network 𝒀𝑰 𝑨 can be written as:
exp(𝛼1 − |𝒖𝑖 − 𝒗 𝑎 | 2 )
𝑌𝑖𝑎𝐼 𝐴 |(𝑼, 𝑽, 𝛼0 ) ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑔(𝜙𝑖𝑎 )), 𝑔(𝜙𝑖𝑎 ) = ,
1 + exp(𝛼1 − |𝒖𝑖 − 𝒗 𝑎 | 2 )

We assume 𝒖 𝒊 ∼ 𝑁 (0, 𝜆20 I𝐷 ), 𝒗 𝒂 ∼ 𝑁 (0, 𝜆21 I𝐷 ), and 𝛼1 , 𝜆0 and 𝜆 1 to be unknown parameters.


𝑖𝑖𝑑 𝑖𝑖𝑑

The parameter 𝛼1 accounts for the density of the bipartite network. The probability of a
positive response increases as the Euclidean distance between the attribute node and the
person node decreases.

2.2 Variational Bayesian EM for the Bipartite Network


We are interested in the posterior inference of the latent variables 𝒖 𝒊 and 𝒗 𝒂 following the
distance model conditioning on the observed bipartite network. The (conditional) posterior
distribution is the ratio of the joint distribution of the observed data and unobserved latent
variables to the observed data likelihood
𝑃(𝒀𝑰 𝑨 |𝑼, 𝑽)𝑃(𝑼, 𝑽)
𝑃(𝑼, 𝑽 |𝒀𝑰 𝑨 ) = .
𝑃(𝒀𝑰 𝑨 )
We can characterize the distribution of latent positions and thus obtain the point and inter-
val estimates by computing this posterior distribution. The variational inference algorithm
is commonly used to estimate latent variables whose posterior distribution is intractable
(Beal et al., 2003; Attias, 1999; Beal et al., 2006; Blei et al., 2017). In network analysis,
the variational approach has been proposed for the stochastic blockmodel (Daudin et al.,
2008; Celisse et al., 2012), the mixed-membership stochastic blockmodel (Airoldi et al.,
2008), the multi-layer stochastic blockmodel (Xu et al., 2014; Paul and Chen, 2016), the
dynamic stochastic blockmodel (Matias and Miele, 2016), the latent position cluster model
(Salter-Townshend and Murphy, 2013) and the multiple network latent space model (Gollini
and Murphy, 2016). Here, we propose a Variational Bayesian Expectation Maximization
(VBEM) algorithm to approximate the posterior of the person and the attribute latent
positions using the bipartite network. We propose a class of suitable variational posterior
distributions for the conditional distribution of (𝑼, 𝑽 |𝒀𝑰 𝑨 ) and obtain a distribution from
the class that minimizes the Kulback Leibler (KL) divergence from the true but intractable
posterior.
We assign the following variational posterior distributions: 𝑞(𝒖 𝒊 ) = 𝑁 ( ũ𝑖 , Λ̃0 ) and
𝑞(𝒗 𝒂 ) = 𝑁 (ṽ𝑎 , Λ̃1 ) and set the joint distribution as
𝑁
Ö 𝑀
Ö
𝑞(𝑼, 𝑽 |𝒀𝑰 𝑨 ) = 𝑞(𝒖 𝒊 ) 𝑞(𝒗 𝒂 ),
𝑖=1 𝑎=1

6
where ũ𝑖 , Λ̃0 , ṽ𝑎 , Λ̃1 are the parameters of the variational distribution, known as variational
parameters.
We can estimate the variational parameters by minimizing the Kullback-Leiber (KL) di-
vergence between the variational posterior 𝑞(𝑼, 𝑽 |𝒀𝑰 𝑨 ) and the true posterior 𝑓 (𝑼, 𝑽 |𝒀𝑰 𝑨 ).
Minimizing the KL divergence is equivalent to maximizing the following Evidence Lower
Bound (ELBO) function Blei et al. (2017), (see detailed derivations in the Supplementary
Materials)
 
log 𝑞(𝑼, 𝑽, 𝛼1 |𝒀𝑰 𝑨 )
ELBO = −E𝑞(𝑼,𝑽,𝛼1 |𝒀𝑰 𝑨 )
log 𝑝(𝑼, 𝑽, 𝒀𝑰 𝑨 |𝛼1 )

𝑞(𝑼, 𝑽, 𝛼1 |𝒀𝑰 𝑨 )
=− 𝑞(𝑼, 𝑽, 𝛼1 |𝒀𝑰 𝑨 ) log 𝑑 (𝑼, 𝑽, 𝛼1 )
𝑓 (𝑼, 𝑽, 𝛼1 |𝒀𝑰 𝑨 )
𝑁 𝑀 Î𝑁 Î𝑀
𝑖=1 𝑞(𝒖 𝒊 ) 𝑎=1 𝑞(𝒗 𝒂 )
∫ Ö Ö
=− 𝑞(𝒖 𝒊 ) 𝑞(𝒗 𝒂 ) log Î𝑁 Î𝑀 𝑑 (𝑼, 𝑽, 𝛼1 )
𝑖=1 𝑎=1 𝑓 (𝒀𝑰 𝑨 |𝑼, 𝑽, 𝛼1 ) 𝑖=1 𝑓 (𝒖 𝒊 ) 𝑎= 1 𝑓 (𝒗 𝒂 )
𝑁 ∫ 𝑀 ∫
∑︁ 𝑞(𝒖 𝒊 ) ∑︁ 𝑞(𝒗 𝒂 )
=− 𝑞(𝒖 𝒊 ) log 𝑑𝒖 𝒊 − 𝑞(𝒗 𝒂 ) log 𝑑𝒗 𝒂
𝑖=1
𝑓 (𝒖 𝒊 ) 𝑎=1
𝑓 (𝒗 𝒂 )

+ 𝑞(𝑼, 𝑽, 𝛼1 |𝒀𝑰 𝑨 ) log 𝑓 (𝒀𝑰 𝑨 |𝑼, 𝑽, 𝛼1 )𝑑 (𝑼, 𝑽, 𝛼1 )
𝑁
∑︁ 𝑀
∑︁
=− KL[𝑞(𝒖 𝒊 )| 𝑓 (𝒖 𝒊 )] − KL[𝑞(𝒗 𝒂 )| 𝑓 (𝒗 𝒂 )] + E𝑞(𝑼,𝑽,𝛼1 |𝒀𝑰 𝑨 ) [log 𝑓 (𝒀𝑰 𝑨 |𝑼, 𝑽, 𝛼1 )]
𝑖=1 𝑎=1
1  𝑁 tr( Λ̃ ) Í𝑁 𝒖˜ 𝒊 𝑇 𝒖˜ 𝒊
0
2
= − 𝐷𝑁 log(𝜆0 ) − 𝑁 log(det( Λ̃0 )) − 2
− 𝑖=1 2
2 2𝜆 0 2𝜆 0
Í𝑀
1 
2
 𝑀 tr( Λ̃1 ) 𝑎=1 𝒗˜𝒂 𝑇 𝒗˜𝒂 1
− 𝐷 𝑀 log(𝜆 1 ) − 𝑀 log(det( Λ̃1 )) − − + (𝑀 𝐷 + 𝑁 𝐷)
2 2𝜆21 2𝜆21 2
+ E𝑞(𝑼,𝑽 |𝒀𝑰 𝑨 ) [log 𝑓 (𝒀𝑰 𝑨 |𝑼, 𝑽)]. (1)

After applying Jensen’s inequality (Jensen, 1906), a lower-bound on E𝑞(𝑼,𝑽 |𝒀𝑰 𝑨 ) [log 𝑓 (𝒀𝑰 𝑨 |𝑼, 𝑽)]
is given by,

E𝑞(𝑼,𝑽 |𝒀𝑰 𝑨 ) [log 𝑓 (𝒀𝑰 𝑨 |𝑼, 𝑽, 𝛼1 )]


𝑁 ∑︁𝑀
" #
∑︁
≥ 𝑦𝑖𝑎 𝛼 ˜ 1 − tr( Λ̃0 ) − tr( Λ̃1 ) − ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
𝑖=1 𝑎=1
𝑁 ∑︁𝑀
!
∑︁ ˜1)
exp( 𝛼 
𝑇 −1

− log 1 + 1
exp − ( 𝒖˜ 𝒊 − 𝒗˜𝒂 ) (I + 2Λ̃0 + 2Λ̃1 ) ( 𝒖˜ 𝒊 − 𝒗˜𝒂 ) .
𝑖=1 𝑎=1 det(I + 2Λ̃0 + 2Λ̃1 ) 2
We use the Variational Expectation-Maximization algorithm (Jordan et al., 1999; Baum
et al., 1970; Dempster et al., 1977) to maximize the ELBO function. Following the vari-
ational EM algorithm, we replace the E step of the celebrated EM algorithm, where we

7
compute the expectation of the complete likelihood with respect to the conditional distri-
bution 𝑓 (𝑼, 𝑽 |𝒀𝑰 𝑨 ), with a VE step, where we compute the expectation with respect to the
best variational distribution (obtained by optimizing the ELBO function) at that iteration.
˜ 1(0) ,
The detailed procedures are as follows. We start with the initial parameter, Θ (0) = 𝛼
and ũ𝑖(0) , Λ̃0(0) , ṽ𝑎(0) , Λ̃1(0) , and then we iterate the following VE (Variational expectation)
and M (maximization) steps. During the VE step, we maximize the ELBO(𝑞(U), q(V), Θ)
with respect to the variational parameters 𝒖˜ 𝒊 , 𝒗˜𝒂 , 𝜆˜0 and 𝜆˜1 given the other model param-
eters and obtain ELBO(𝑞 ∗ (U), q∗ (V), Θ). During the M step, we fix 𝒖˜ 𝒊 , 𝒗˜𝒂 , Λ̃0 and Λ̃1
˜ 1 . To do this, we differenti-
and maximize the ELBO(𝑞(U), q(V), Θ) with respect to 𝛼
ate ELBO(𝑞(U), q(V), Θ) with respect to each variational parameter. We obtain closed
form update rules by setting the partial derivatives to zero while introducing the first- and
second-order Taylor series expansion approximation of the log functions in ELBO(𝑞(U), q(V)
, Θ) (see detailed derivations in supplementary material). The Taylor series expansions are
commonly used in the variational approaches. For example, three first-order Taylor ex-
pansions were used by Salter-Townshend and Murphy (2013) to simplify the Euclidean
distance in the latent position cluster model, and first- and second-order Taylor expansions
were used by Gollini and Murphy (2016) to simplify the squared Euclidean distance in
LSM. Following the previous publications using Taylor expansions, we approximate the
three log functions in our ELBO(𝑞(U), q(V), Θ) function to find closed form update rules
for the variational parameters. Define the function
𝑁 ∑︁
𝑀
!
∑︁ ˜1)
exp( 𝛼  
𝑭𝒊𝒂 = log 1 + 1
exp − ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 ) .
𝑖=1 𝑎=1 det(I + 2Λ̃0 + 2Λ̃1 ) 2

The closed form update rules of the (𝑡 + 1)th iteration are as follows

VE-step: Estimate 𝒖˜ 𝒊 (𝑡+1) , 𝒗˜𝒂 (𝑡+1) , Λ̃0(𝑡+1) and Λ̃1(𝑡+1) by minimizing ELBO(𝑞(U), q(V), Θ)
" 𝑀
# −1 " 𝑀 #
1 ∑︁ 1 ∑︁ 1 1
𝒖˜ 𝒊(𝒕+1) = 2
+ 𝑦𝑖𝑎 ) 𝑰 + 𝑯 𝒊𝒂 ( 𝒖˜ 𝒊(𝒕) ) 𝑦𝑖𝑎 𝒗˜(𝒕) ˜ 𝒊(𝒕) ) 𝒖˜ 𝒊(𝒕) − 𝑮 𝒊𝒂 ( 𝒖˜ 𝒊(𝒕) )
𝒂 + 𝑯 𝒊𝒂 ( 𝒖
2𝜆 0 𝑎=1 2 𝑎=1
2 2
" 𝑁
! # −1 " 𝑁 #
(𝒕+1) 1 ∑︁ 1 (𝒕)
∑︁
(𝒕) 1 (𝒕)
𝒗˜𝒂 = + 𝑦𝑖𝑎 𝑰 − 𝑯 𝒊𝒂 (˜ 𝒗𝒂 ) 𝑦𝑖𝑎 𝒖˜ 𝒊 − 𝑮 𝒊𝒂 (˜
𝒗𝒂 )
2𝜆21 𝑖=1 2 𝑖=1
2
" 𝑁 𝑀
! # −1
(𝑡+1) 𝑁 𝑁 1 ∑︁ ∑︁ (𝑡)
Λ̃0 = + 𝑦𝑖𝑎 𝑰 + 𝑮 𝒊𝒂 ( Λ̃0 )
2 2 𝜆20 𝑖=1 𝑎=1
" 𝑁 ∑︁ 𝑀
! # −1
𝑀 𝑀 1 ∑︁
Λ̃1(𝑡+1) = + 𝑦𝑖𝑎 𝑰 + 𝑮 𝒊𝒂 ( Λ̃1(𝑡) ) , (2)
2 2 𝜆21 𝑖=1 𝑎=1

8
where 𝑮 𝑰 𝑨 ( 𝒖˜ 𝒊 (𝑡) )and 𝑮 𝑰 𝑨 (˜
𝒗 𝒂 (𝑡) ) are the partial derivatives (gradients) of 𝑭𝑰 𝑨 with respect
to 𝒖˜ 𝒊 and 𝒗˜𝒂 , evaluated at 𝒖˜ 𝒊 (𝑡) and 𝒗˜𝒂 (𝑡) , respectively. In 𝑮 𝑰 𝑨 ( 𝒖˜ 𝒊 (𝑡) ), the subscript 𝑰 𝑨
indicates that the gradient is of function 𝑭𝑰 𝑨 , and the subscript 𝑖 in 𝒖˜ 𝒊 (𝑡) indicates that
the gradient is with respect to 𝒖˜ 𝒊 , evaluated at 𝒖˜ 𝒊 (𝑡) . Similarly, 𝑯 𝑰 𝑨 ( 𝒖˜ 𝒊 (𝑡) ) and 𝑯 𝑰 𝑨 (˜
𝒗 𝒂 (𝑡) )
are the second-order partial derivatives of 𝑭𝑰 𝑨 with respect to 𝒖˜ 𝒊 and 𝒗˜𝒂 , evaluated at 𝒖˜ 𝒊 (𝑡)
and 𝒗˜𝒂 (𝑡) , respectively.

˜ 1(𝑡+1) with the following closed form solution,


M-step: Estimate 𝛼

˜ 1(𝑡) ) + 𝛼 ˜ 1(𝑡) )
˜ 1(𝑡) ℎ 𝐼 𝐴 ( 𝛼
Í𝑁 Í 𝑀 𝐼 𝐴
(𝑡+1) 𝑖=1 𝑎=1 𝑦 𝑖𝑎 − 𝑔 𝐼 𝐴 ( 𝛼
˜1
𝛼 = , (3)
ℎ𝐼 𝐴 (𝛼 ˜ 1(𝑡) )

˜ 1(𝑡) ) is the partial derivative (gradient) of 𝑭𝑰 𝑨 with respect to 𝛼


where 𝑔 𝐼 𝐴 ( 𝛼 ˜ 1 , evaluated
˜ 1(𝑡) ; and ℎ 𝐼 𝐴 ( 𝛼
at 𝛼 ˜ 1(𝑡) ) is the second-order partial derivative of 𝑭𝑰 𝑨 with respect to 𝛼
˜1 ,
˜ 1(𝑡) .
evaluated at 𝛼

3 Joint Modeling of Social Network and Multivariate


Attributes
Next we define the attributes and persons latent space model (APLSM), which uses a
single joint latent space to combine and summarize the information contained in both the
person-person social network 𝒀𝑰 , and person-attribute bipartite network formed by the high
dimensional covariates 𝒀𝑰 𝑨 . We assume that the persons and attributes can be positioned
in an attribute-person joint latent space, which is a subset of the 𝐷 dimensional Euclidean
space R𝐷 . The latent person positions 𝑼 is a shared latent variable that simultaneously
affects both the social network and the multivariate covariates. In APLSM, we extend the
conditional independence assumption of LSM by assuming that the probability of two nodes
forming a connection in both matrices 𝒀𝑰 and 𝒀𝑰 𝑨 is independent of all other connections
given the latent positions of the two nodes involved. In the APLSM, the joint distribution
of the elements of the social network and the multivariate covariates can be written as
𝑁
Ö 𝑁
Ö 𝑁 Ö
Ö 𝑀
𝑝(𝒀𝑰 , 𝒀𝑰 𝑨 |𝑼, 𝑽, 𝛼0 , 𝛼1 ) = 𝑝 1 (𝑦𝑖,𝐼 𝑗 |𝜃 𝑖,𝐼 𝑗 ) 𝐼𝐴 𝐼𝐴
𝑝 2 (𝑦𝑖,𝑎 |𝜃 𝑖,𝑎 ),
𝑖=1 𝑗=1, 𝑗≠𝑖 𝑖=1 𝑎=1

𝐸 (𝑦𝑖,𝐼 𝑗 |𝜃 𝑖,𝐼 𝑗 ) = 𝑔1 (𝜃 𝑖,𝐼 𝑗 ), 𝐼𝐴 𝐼𝐴


𝐸 (𝑦𝑖,𝑎 𝐼𝐴
|𝜃 𝑖,𝑎 ) = 𝑔2 (𝜃 𝑖,𝑎 ),

where 𝑔𝑖 (·) are the link functions, and 𝑝𝑖 (·|·) are the parametric families of distributions
suitable for the type of data in the matrices. We set the priors 𝒖 𝒊 ∼ 𝑁 (0, 𝜆20 I𝐷 ), and
𝑖𝑖𝑑

9
𝒗 𝒂 ∼ 𝑁 (0, 𝜆21 I𝐷 ). 𝛼0 , 𝛼1 , 𝜆20 , 𝜆21 are unknown parameters. Different levels of 𝛼0 and 𝛼1
𝑖𝑖𝑑

account for different densities of the two networks.

𝜃 𝑖,𝐼 𝑗 = 𝛼0 − |𝒖 𝒊 − 𝒖 𝒋 | 2 , and 𝐼𝐴
𝜃 𝑖,𝑎 = 𝛼1 − |𝒖 𝒊 − 𝒗 𝒂 | 2 , (4)

𝐼 𝐴 . If the
In Equation 4, squared euclidean distance forms are assumed for 𝜃 𝑖,𝐼 𝑗 and 𝜃 𝑖,𝑎
data in 𝒀𝑰 and 𝒀𝑰 𝑨 are binary, then the link functions 𝑔1 (𝜃 𝑖,𝐼 𝑗 ) and 𝑔2 (𝜃 𝑖,𝑎
𝐼 𝐴 ) are logistic

exp (𝜃 𝑖,𝐼 𝑗 ) exp (𝜃 𝑖,𝑎


𝐼 𝐴)
inverse link functions, i.e., 𝑔1 (𝜃 𝑖,𝐼 𝑗 ) = 1+exp (𝜃 𝑖,𝐼 𝑗 )
𝐼 𝐴) =
and 𝑔2 (𝜃 𝑖,𝑎 𝐼 𝐴) ,
1+exp (𝜃 𝑖,𝑎
and the 𝑝𝑖 (·|·) are
Bernoulli PDFs.
In the APLSM, we define a attribute-person joint latent space, a hypothetical multidi-
mensional space, in which the locations of the persons and the attributes follow predefined
geometric rules reflecting each node’s connection with another. While the latent person po-
sitions are more commonly seen in network science, the latent attribute positions describe
the latent traits of the attributes seen through the persons’ responses. The probability of
person 𝑖 and person 𝑗 forming a friendly connection depends on the distance of 𝒖 𝒊 and 𝒖 𝒋
in the joint latent space. The smaller the latent distance between person 𝑖 and 𝑗, the higher
the chance that person 𝑖 and person 𝑗 are friends. Similarly, the closer the latent positions
of attribute 𝑎 and person 𝑖 are, the more likely that person 𝑖 shows attribute 𝑎. The rela-
tionships among persons also retain the transitivity and reciprocity properties: if person 𝑖
and 𝑗 form a bond, and person 𝑖 and 𝑘 are also friends, then person 𝑗 befriending person 𝑖
(reciprocity), and befriending person 𝑘 (transitivity) are both more likely. The transitivity
property is preserved for relationships between persons and attributes: if persons 𝑖 and 𝑗
form a bond, and person 𝑖 indicate a presence for attribute 𝑎, then person 𝑗 indicating a
presence for attribute 𝑎 is also more likely.
While it is common for the edges in the friendship networks to be binary, the multivari-
ate covariates can be more general. If the data in 𝒀𝑰 𝑨 are of discrete numerical scales, they
can be modeled with other parametric families. For example, we can use 𝑔2 (𝜃 𝑖,𝑎
𝐼 𝐴 ) = exp(𝜃 𝐼 𝐴 )
𝑖,𝑎
as the Poisson inverse link function to model count data in 𝒀𝑰 𝑨 , and thus 𝑝 2 (𝑦𝑖,𝑎
𝐼 𝐴 |𝜃 𝐼 𝐴 ) be-
𝑖,𝑎
comes the PDF of the Poisson distribution. If the data in 𝒀𝑰 𝑨 are normally distributed,
then the link function is the identity link. Alternatively, we can model the presence (or
absence) of an edge separately from the weight of the edge (if it is present). For example,
a zero inflated normal distribution was used by Sewell and Chen (2016) to model weighted
edges, and the same goal was achieved by Agarwal and Xue (2020) using a combination of
Bernoulli distribution and a non-parametric weight distribution.
In a similar fashion, the APLSM can be used to handle weighted edges. A zero inflated

10
𝐼 𝐴 |𝜃 𝐼 𝐴 can be seen as follows:
Poisson model for the distribution of 𝑦𝑖,𝑎 𝑖,𝑎
𝐼𝐴
 𝐼 𝐴 ))𝛾(𝜃 𝐼 𝐴 ) 𝑦 𝑖,𝑎 
Ö exp(−𝛾(𝜃 𝑖,𝑎
𝐼 𝐴 =0)
(𝑦 𝑖,𝑎 𝑖,𝑎
𝐼𝐴 𝐼𝐴 𝐼𝐴 𝐼𝐴
𝑝 3 (𝑦𝑖,𝑎 |𝜃 𝑖,𝑎 ) = (1 − (𝜅(𝜃 𝑖,𝑎 )) × (𝜅(𝜃 𝑖,𝑎 ) 𝐼 𝐴!
𝑦𝑖,𝑎
𝐼 𝐴)
exp(𝜃 𝑖,𝑎
𝐼𝐴
𝜅(𝜃 𝑖,𝑎 ) =
1 + exp(𝜃 𝑖,𝑎
𝐼 𝐴)

𝐼𝐴 𝐼𝐴
𝛾(𝜃 𝑖,𝑎 ) = exp(𝜃 𝑖,𝑎 ).

The Variational Bayesian EM algorithm


The VBEM algorithm for fitting APLSM can be derived following steps similar to
section 2.2. We start by noting that the posterior distribution of APLSM is as follows:
𝑃(𝒀𝑰 , 𝒀𝑰 𝑨 |𝑼, 𝑽)𝑃(𝑼, 𝑽)
𝑃(𝑼, 𝑽 |𝒀𝑰 , 𝒀𝑰 𝑨 ) = .
𝑃(𝒀𝑰 , 𝒀𝑰 𝑨 )
We assign the following variational posterior distribution: 𝑞(𝒖 𝒊 ) = 𝑁 ( ũ𝑖 , Λ̃0 ) and 𝑞(𝒗 𝒂 ) =
𝑁 (ṽ𝑎 , Λ̃1 ). We set the joint distribution as
𝑁
Ö 𝑀
Ö
𝑞(𝑼, 𝑽 |𝒀𝑰 , 𝒀𝑰 𝑨 ) = 𝑞(𝒖 𝒊 ) 𝑞(𝒗 𝒂 ),
𝑖=1 𝑎=1

where ũ𝑖 , Λ̃0 , ṽ𝑎 , Λ̃1 are the parameters of the distribution, known as variational parameters.
The Evidence Lower Bound (ELBO) function for APLSM is (see detailed derivations
in the Supplementary Materials)
log 𝑞(𝑼, 𝑽, 𝛼0 , 𝛼1 |𝒀𝑰 , 𝒀𝑰 𝑨 )
ELBO = −E𝑞(𝑼,𝑽,𝛼0 ,𝛼1 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [ ]
log 𝑝(𝑼, 𝑽, 𝒀𝑰 , 𝒀𝑰 𝑨 |𝛼0 , 𝛼1 )

𝑞(𝑼, 𝑽, 𝛼0 , 𝛼1 |𝒀𝑰 , 𝒀𝑰 𝑨 )
=− 𝑞(𝑼, 𝑽, 𝛼0 , 𝛼1 |𝒀𝑰 , 𝒀𝑰 𝑨 ) log 𝑑 (𝑼, 𝑽, 𝛼0 , 𝛼1 )
𝑓 (𝑼, 𝑽, 𝛼0 , 𝛼1 |𝒀𝑰 , 𝒀𝑰 𝑨 )
𝑁 𝑀 Î𝑁 Î𝑀
𝑖=1 𝑞(𝒖 𝒊 ) 𝑎=1 𝑞(𝒗 𝒂 )
∫ Ö Ö
=− 𝑞(𝒖 𝒊 ) 𝑞(𝒗 𝒂 ) log Î𝑁 Î𝑀 𝑑 (𝑼, 𝑽, 𝛼0 , 𝛼1 )
𝑖=1 𝑎=1 𝑓 (𝒀𝑰 , 𝒀𝑰 𝑨 |𝑼, 𝑽, 𝛼0 , 𝛼1 ) 𝑖=1 𝑓 (𝒖 𝒊 ) 𝑎= 1 𝑓 (𝒗 𝒂 )
𝑁 ∫ 𝑀 ∫
∑︁ 𝑞(𝒖 𝒊 ) ∑︁ 𝑞(𝒗 𝒂 )
=− 𝑞(𝒖 𝒊 ) log 𝑑𝒖 𝒊 − 𝑞(𝒗 𝒂 ) log 𝑑𝒗 𝒂
𝑖=1
𝑓 (𝒖 𝒊 ) 𝑎=1
𝑓 (𝒗 𝒂 )

+ 𝑞(𝑼, 𝑽, 𝛼0 , 𝛼1 |𝒀𝑰 , 𝒀𝑰 𝑨 ) log 𝑓 (𝒀𝑰 , 𝒀𝑰 𝑨 |𝑼, 𝑽, 𝛼0 , 𝛼1 )𝑑 (𝑼, 𝑽, 𝛼0 , 𝛼1 )
𝑁
∑︁ 𝑀
∑︁
=− KL[𝑞(𝒖 𝒊 )| 𝑓 (𝒖 𝒊 )] − KL[𝑞(𝒗 𝒂 )| 𝑓 (𝒗 𝒂 )] + E𝑞(𝑼,𝑽,𝛼0 ,𝛼1 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [log 𝑓 (𝒀𝑰 , 𝒀𝑰 𝑨 |𝑼, 𝑽, 𝛼0 , 𝛼1 )]
𝑖=1 𝑎=1
Í𝑁
1 2
 𝑁 tr( Λ̃ )
0 𝑖=1 𝒖˜ 𝒊 𝑇 𝒖˜ 𝒊
= − 𝐷𝑁 log(𝜆 0 ) − 𝑁 log(det( Λ̃0 )) − 2

2 2𝜆 0 2𝜆20

11
Í𝑀
1  𝑀 tr( Λ̃ )
1 𝑎=1 𝒗˜𝒂 𝑇 𝒗˜𝒂 1
− 𝐷 𝑀 log(𝜆21 ) − 𝑀 log(det( Λ̃1 )) − 2
− + (𝑀 𝐷 + 𝑁 𝐷)
2 2𝜆 1 2𝜆21 2
+ E𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [log 𝑓 (𝒀𝑰 , 𝒀𝑰 𝑨 |𝑼, 𝑽)]. (5)

After applying Jensen’s inequality (Jensen, 1906), a lower-bound on the third term is given
by,

E𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [log 𝑓 (𝒀𝑰 , 𝒀𝑰 𝑨 |𝑼, 𝑽, 𝛼0 , 𝛼1 )]


𝑁 ∑︁𝑀
" #
∑︁
≥ 𝑦𝑖𝑎 𝛼 ˜ 1 − tr( Λ̃0 ) − tr( Λ̃1 ) − ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
𝑖=1 𝑎=1
𝑁 𝑁
" #
∑︁ ∑︁
+ ˜ 0 − 2 tr( Λ̃0 ) − ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )𝑇 ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )
𝑦𝑖 𝑗 𝛼
𝑖=1 𝑗=1, 𝑗≠𝑖
𝑁 ∑︁
𝑀
!
∑︁ ˜1)
exp( 𝛼  
− log 1 + 1
exp − ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
𝑖=1 𝑎=1 det(I + 2Λ̃0 + 2Λ̃1 ) 2

𝑁 𝑁
!
∑︁ ∑︁ ˜0)
exp( 𝛼  
− log 1 + exp − ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )𝑇 (I + 4Λ̃0 ) −1 ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 ) . (6)
𝑖=1 𝑗=1, 𝑗≠𝑖 det(I + 4Λ̃0 ) 1/2

In addition to 𝑭𝒊𝒂 , we also take into account 𝑭𝒊


𝑁 𝑁
!
∑︁ ∑︁ ˜0)
exp( 𝛼  
𝑭𝒊 = log 1 + exp − ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )𝑇 (I + 4Λ̃0 ) −1 ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 ) .
𝑖=1 𝑗=1, 𝑗≠𝑖 det(I + 4Λ̃0 ) 1/2

The closed form update rules of the (𝑡 + 1)th iteration are as follows

VE-step: Estimate 𝒖˜ 𝒊 (𝑡+1) , 𝒗˜𝒂 (𝑡+1) , Λ̃0(𝑡+1) and Λ̃1(𝑡+1) by minimizing ELBO(𝑞(U), q(V), Θ)
" 𝑁 𝑀
# −1
1 ∑︁ ∑︁ 1
𝒖˜ 𝒊(𝒕+1) = 2
+ (𝑦𝑖 𝑗 + 𝑦 𝑗𝑖 ) + 𝑦𝑖𝑎 ) 𝑰 + 𝑯 𝒊 ( 𝒖˜ 𝒊(𝒕) ) + 𝑯 𝒊𝒂 ( 𝒖˜ 𝒊(𝒕) ) (7)
2𝜆 0 𝑗=1, 𝑗≠𝑖 𝑎=1
2
" 𝑁 𝑀
#
∑︁ ∑︁  1  1
(𝑦𝑖 𝑗 + 𝑦 𝑗𝑖 ) 𝒖˜ 𝒋 + 𝑦𝑖𝑎 𝒗˜(𝒕)
𝒂 − 𝑮 𝒊 ( 𝒖˜ 𝒊(𝒕) ) + 𝑯 𝒊 ( 𝒖˜ 𝒊(𝒕) ) + 𝑯 𝒊𝒂 ( 𝒖˜ 𝒊(𝒕) ) 𝒖˜ 𝒊(𝒕) − 𝑮 𝒊𝒂 ( 𝒖˜ 𝒊(𝒕) )
𝑗=1, 𝑗≠𝑖 𝑎=1
2 2
(8)
" 𝑁
! 𝑁
# −1 " #
1) 1 ∑︁ 1 ∑︁ 1
𝒗˜(𝒕+
𝒂 = 2
+ 𝑦 𝑖𝑎 𝑰 − 𝑯 𝒊𝒂 𝒗
(˜ (𝒕)
𝒂 ) 𝑦 𝑖𝑎 𝒖
˜ (𝒕)
𝒊 − 𝒗 (𝒕)
𝑮 𝒊𝒂 (˜ 𝒂 )
2𝜆 1 𝑖=1 2 𝑖=1
2
" 𝑁 𝑁 𝑁 𝑀
! # −1
𝑁 𝑁 1 ∑︁ ∑︁ ∑︁ ∑︁
Λ̃0(𝑡+1) = +2 𝑦𝑖 𝑗 + (𝑡)
𝑦𝑖𝑎 𝑰 + 𝑮 𝒊 ( Λ̃0 ) + 𝑮 𝒊𝒂 ( Λ̃0 ) (𝑡)
(9)
2 2 𝜆20 𝑖=1 𝑗=1 𝑖=1 𝑎=1

12
Algorithm 1: VBEM Estimation procedure
Input: Network Adjacency matrix 𝑌𝐼 , Multivariate Covariates matrix 𝑌𝐼 𝐴 , number of
dimensions 𝐷
˜0, 𝛼
Result: Model parameters Θ = 𝛼 ˜ 1 , and ũ𝑖 , Λ̃0 , ṽ𝑎 , Λ̃1
1: while 𝑡 < 𝑁iter and KLdis < .999999 do
2: ˜ 0 and 𝛼
Compute estimates 𝛼 ˜ 1 using Equations 11 and 12
3: Compute estimates Λ̃0 and Λ̃1 using Equations 9 and 10
4: Compute estimates ũ𝑖 and ṽ𝑎 using Equations 7 and 8
5: Compute estimates KL using Equations 5 and 6
6: KLdis ← KL𝑡 /KL𝑡−1
7: 𝑡 ← 𝑡+1
8: end while
9: return [ 𝛼
˜0, 𝛼
˜ 1 ,, ũ𝑖 , Λ̃0 , ṽ𝑎 , Λ̃1 , KL]

" 𝑁 𝑀
! # −1
𝑀 𝑀 1 ∑︁ ∑︁
Λ̃1(𝑡+1) = + (𝑡)
𝑦𝑖𝑎 𝑰 + 𝑮 𝒊𝒂 ( Λ̃1 ) , (10)
2 2 𝜆21 𝑖=1 𝑎=1

where 𝑮 𝑰 ( 𝒖˜ 𝒊 (𝑡) ), 𝑮 𝑰 𝑨 ( 𝒖˜ 𝒊 (𝑡) )and 𝑮 𝑰 𝑨 (˜


𝒗 𝒂 (𝑡) ) are the partial derivatives (gradients) of 𝑭𝑰 , 𝑭𝑰 𝑨
and 𝑭𝑰 𝑨 with respect to 𝒖˜ 𝒊 , 𝒖˜ 𝒊 and 𝒗˜𝒂 , evaluated at 𝒖˜ 𝒊 (𝑡) , 𝒖˜ 𝒊 (𝑡) and 𝒗˜𝒂 (𝑡) , respectively. In
𝑮 𝑰 ( 𝒖˜ 𝒊 (𝑡) ), the subscript 𝐼 indicates that the gradient is of function 𝑭𝑰 , and the subscript
𝑖 in 𝒖˜ 𝒊 (𝑡) indicates that the gradient is with respect to 𝒖˜ 𝒊 , evaluated at 𝒖˜ 𝒊 (𝑡) . Similarly,
𝑯 𝑰 ( 𝒖˜ 𝒊 (𝑡) ), 𝑯 𝑰 𝑨 ( 𝒖˜ 𝒊 (𝑡) ) and 𝑯 𝑰 𝑨 (˜
𝒗 𝒂 (𝑡) ) are the second-order partial derivatives of 𝑭𝑰 , 𝑭𝑰 𝑨 and
𝑭𝑰 𝑨 with respect to 𝒖˜ 𝒊 , 𝒗˜𝒂 , 𝒖˜ 𝒊 and 𝒗˜𝒂 , evaluated at 𝒖˜ 𝒊 (𝑡) , 𝒗˜𝒂 (𝑡) , 𝒖˜ 𝒊 (𝑡) and 𝒗˜𝒂 (𝑡) , respectively.

˜ 0(𝑡+1) and 𝛼
M-step: Estimate 𝛼 ˜ 1(𝑡+1) with the following update rules,

˜ 0(𝑡) ) + 𝛼
˜ 0(𝑡) ℎ 𝐼 ( 𝛼
˜ 0(𝑡) )
Í𝑁 Í𝑁
𝑖=1
𝐼
𝑗=1 𝑦 𝑖 𝑗 − 𝑔𝐼 (𝛼
˜ 0(𝑡+1)
𝛼 = (11)
˜ 0(𝑡) )
ℎ𝐼 (𝛼
˜ 1(𝑡) ) + 𝛼
˜ 1(𝑡) ℎ 𝐼 𝐴 ( 𝛼
˜ 1(𝑡) )
Í𝑁 Í 𝑀
𝑖=1
𝐼𝐴
𝑎=1 𝑦 𝑖𝑎 − 𝑔𝐼 𝐴 (𝛼
˜ 1(𝑡+1)
𝛼 = , (12)
˜ 1(𝑡) )
ℎ𝐼 𝐴 (𝛼

˜ 0(𝑡) ) and 𝑔 𝐼 𝐴 ( 𝛼
where 𝑔 𝐼 ( 𝛼 ˜ 1(𝑡) ) are the partial derivatives (gradients) of 𝑭𝑰 and 𝑭𝑰 𝑨 with
˜ 0 and 𝛼
respect to 𝛼 ˜ 0(𝑡) , 𝛼
˜ 1 , evaluated at 𝛼 ˜ 1(𝑡) ; and ℎ 𝐼 ( 𝛼
˜ 0(𝑡) ) and ℎ 𝐼 𝐴 ( 𝛼
˜ 1(𝑡) ) are the second-order
partial derivatives of 𝑭𝑰 and 𝑭𝑰 𝑨 with respect to 𝛼
˜ 0 and 𝛼 ˜ 0(𝑡) , 𝛼
˜ 1 , evaluated at 𝛼 ˜ 1(𝑡) .

13
Figure 1: The distributions of the AAEs when 𝛼0 = 2 and 𝛼1 = 1.5.

4 Simulation Study
In this section, we conduct a simulation study to evaluate the proposed VBEM algo-
rithm’s performance for the APLSM. We compare APLSM with two baseline approaches,
namely LSM using variational inference proposed by Gollini and Murphy (2016), and LSM
for bipartite networks (BLSM) using variational inference developed in Section 2.2. We ex-
pect that with the inclusion of the information in 𝒀𝑰 𝑨 , the APLSM will exhibit a stronger
model recovery fit than the LSM for the social network. Similarly, we expect the APLSM
to be more successful in model recovery than the BLSM due to the addition of information
from 𝒀𝑰 .
To assess whether the proposed algorithm can recover the true link probabilities, we use
the average absolute error (AAE) between the true link probabilities and the estimated link
probabilities as a metric. The smaller the AAEs, the closer the estimated link probabilities
are to the true link probabilities, indicating better model recovery. To assess whether
the proposed algorithm can recover true 𝛼 values, we look at the differences between the
estimated values and the true values. If the differences are concentrated around mean
0 with a small variance, it will indicate a good recovery. Finally, we assess the fit of the
estimated latent positions by calculating the proportions of the pairwise distances based on

14
Figure 2: Distributions of the distances between the true and estimated 𝛼0 (left), 𝛼1 (right) and when
𝛼0 = 0.5 and 𝛼1 = 0.

the estimated latent positions to the pairwise distances based on the true latent positions.
Even though the true latent positions cannot be exactly recovered in any latent space
model due to unidentifiability, we expect a successful algorithm to be able to recover the
true distances Sewell and Chen (2015). If the estimated latent positions preserve the nodes’
relative positions in the latent space, we expect the proportion of the estimated pairwise
distances to the true pairwise distances to be close to 1.
The design of the simulation is as follows. We first generate data following Equation 4.
To do this, we set 𝜆 0 and 𝜆 1 to be 1 and the number of attributes and persons to be 50. We
sample 𝒁 𝑰 and 𝒁 𝑨 from the multivariate normal distributions using the above parameter
values. We produce true link probabilities between persons and between attributes and
persons using different sets of 𝛼 values. Different 𝛼 values are associated with different
densities of the data. Then, we generate 𝒀𝑰 and 𝒀𝑰 𝑨 matrices. Each entry of the matrices
is independently generated from the Bernoulli distribution using the corresponding link
probability. We apply the APLSM with the VBEM estimator to the generated data and
obtain the latent positions’ posterior distributions and estimates for the fixed parameters.
We use the posterior means as the point estimates of the latent positions to obtain the
estimated probabilities. We also fit LSM to 𝒀𝑰 and BLSM to 𝒀𝑰 𝑨 . Using the posterior
means as point estimates of the latent positions, we obtain the estimated probabilities for
𝒀𝑰 and 𝒀𝑰 𝑨 , respectively. We repeat this process 200 times and report the results.

15
Figure 3: Distributions of pairwise distance ratios, comparing 𝒖˜𝒊 with 𝒖𝒊 (left) and 𝒗˜𝒂 with 𝒗 𝒂 (right)
when 𝛼0 = −1 and 𝛼1 = 0.5.

In Figure 1, we present the boxplots over 200 simulations of averages of the absolute
differences (AAEs) across entries in 𝒀𝑰 and 𝒀𝑰 𝑨 for APLSM, LSM and BLSM when 𝛼0 = 2
and 𝛼1 = 1.5. We see that the AAEs are generally smaller when an APLSM is fitted to the
data compared to AAEs when an LSM is fitted. Therefore there is a strong improvement in
model recovery using APLSM for 𝒀𝑰 compared with using LSM. Clearly, this improvement
results from the added attribute information that consolidates the latent person position
estimates. Similarly, the AAEs are smaller based on fitting the APLSM than AAEs based
on fitting the BLSM. Again, this suggests a big improvement in model recovery using
APLSM for 𝒀𝑰 𝑨 than using BLSM. In this case, the added social network consolidates the
latent person position estimates resulting in a better model recovery for APLSM.
In Figure 2, we present the distributions of the differences between the estimated 𝛼
values and the true 𝛼 values. The true 𝛼 values are .5 and 0 for 𝛼0 and 𝛼1 . The distribution
of the differences for each 𝛼 is centered around 0, indicating little bias in the 𝛼 estimates.
Both also are relatively narrow, implying that the estimated 𝛼 values are precise and close
to the true 𝛼 values.
Finally, in Figure 3, we compare the pairwise distances based on the estimated latent
positions with those based on the true latent positions. The distance between nodes 𝑖 and
𝑗 using estimated 𝒖˜ 𝒊 and 𝒖˜ 𝒋 should be close to the true distance between nodes 𝑖 and 𝑗
using 𝒖 𝒊 and 𝒖 𝒋 if the VBEM estimation algorithm successfully maintains and recovers the
relationship between nodes 𝑖 and 𝑗. As shown in both plots in Figure 3, the distributions
of the ratios of the estimated and true pairwise distances are narrow and centered around
1, implying satisfactory recoveries of the nodes’ relationships with each other through the

16
estimated latent positions.

5 Application
We apply APLSM to the French financial elite dataset collected by Kadushin and de
Quillacqi to study the friendship network among top financial elites in France during the
last years of the Socialist government Kadushin (1995). We first introduce the data and
then describe our results in detail.

Data
The data were collected through interviews for people who held leading positions in
major financial institutions and frequently appeared in financial section press reports. The
friendship information was collected by asking the interviewees to name their friends in the
social context. Kadushin and de Quillacqi then identified an inner circle of 28 elites from the
initial sample based on their influence and their perceived eliteness by other participants.
The resulting friendship network is a symmetric adjacency matrix.
The data also contains additional background information, including age, a complex set
of post-secondary education experiences, place of birth, political and cabinet careers, po-
litical party preference, religion, current residence, and club memberships. Two aspects of
the elites’ ”prestige” include whether the person is named in the social register and whether
the person has a particle (”de”) in front of either his (no woman was in the inner circle),
his wife’s, his mother’s, or his children’s names. Having ”de” in the name is associated
with nobility. Father’s occupation is one of the variables used to reflect an elite’s social
class. Fathers’ occupation is considered “high” if the father is in higher management, a
professional, an industrialist, or an investor. Unfortunately, upon communications with
the original author, we found that the coding procedures regarding some variables have
been lost, including Finance Ministry information, religion, etc. We end up with 13 binary
variables including information on education (“Science Po”, “Polytechniqu”, “University”
and “Ecole Nationale d’Administration”), career (“Inspection General de Finance” and
“Cabinet”), class (“Social Register”, “Father Status”, “Particule”), politics (“Socialist”,
“Capitalist” and “Centrist”) and “Age” after excluding the lost or the unrelated infor-
mation, i.e., mason and location, which are not associated with the social network based
on Kadushin (1995) (location is not considered to be related to the social network after
adjusting for multiple comparisons). “Age” was converted into a binary variable following

17
Kadushin (1995), where a group of elites was considered of older age with an average birth
year of 1938. We will use ENA as an abbreviation for Ecole Nationale d’Administration.
The science Po and the other educational variables warrant further explanations. The
Science Po or the Institut d’Etudes Politiques de Paris prepares students for the entrance
exam of the ENA. An alternative of the Science Po is the (Ecole) Polytechnique, a French
military school whose graduates often enter one of the technical ministries. These elites
with Polytechnique degrees enter one of the technical ministries. Both the Science Po
and the Polytechnique are called Grandes Ecoles. A Grandes Ecoles education is highly
respected in France as it leads to membership in the ENA, where the grands corps, which
are the French civil service elites, including the Inspection General de Finance, etc-recruit
its members (Kadushin, 1995).

Previous Approaches
The authors in Kadushin (1995) first used multidimensional scaling to draw the friend-
ship network’s sociogram. Then they applied Quadratic Assignment Procedure regressions
and correlations to test each background variable’s association with the social network.
Based on the social network, two clusters were identified, which the authors called the left
and the right moieties. The dependence between the social network and background infor-
mation was understood through comparisons of the elites between the left and the right
moiety. The elites in the right moiety were found to have a higher social class (upper-class
parentage with high social standing), to be older (average birth year of 1929), and to have
fewer appointments in public offices. The left moiety elites were more likely to be ENA
graduates, grand corps members, cabinet members, treasury service members, socialists,
and younger (average birth year of 1938).
Using the APLSM, we will construct a joint latent space which will allow us to jointly
model elites’ friendship connections and their background information. Using the APLSM,
we will also replicate Kadushin (1995)’s left and the right moiety, adding simultaneous inter-
pretation for the division in the elite circle. Furthermore, we observe an additional division
within the left moiety using APLSM, which provides opportunities for new hypotheses.

Analyses and Findings


We fit the APLSM model to the social network 𝒀𝑰 and the covariates 𝒀𝑰 𝑨 of the French
financial elites dataset. For comparison, we also fit the social network 𝒀𝑰 and the covariates
𝒀𝑰 𝑨 separately using LSM. We fit the latent space model to the friendship network using

18
Figure 4: The estimated latent person and attribute positions, 𝒖˜𝒊 and 𝒗˜𝒂 based on the BLSM (left) and
the APLSM (right). The white ellipses represent 80% approximate credible intervals for the ṽ𝑎 , and the
grey ellipses represent 80% approximate credible intervals for the ũ𝑖 . The latent positions of the attributes
Center, Right, Social register, ENA and Socialist are colored as blue, green, green, purple and red. The
black edges represent the friendship edges.

the variational inference proposed in Gollini and Murphy (2016). For the multivariate
covariates, we fit the bipartite latent space model (BLSM) using the variational inference
method we developed in Section 2.2.
In Figure 4, we present the estimated latent person and attribute positions, 𝒖˜ 𝒊 and 𝒗˜𝒂
from fitting the BLSM (left) and from fitting the APLSM (right). In Figure 5, we exclusively
compare the resulting latent person positions using only the social network, 𝒀𝑰 (left), only
the multivariate covariates, 𝒀𝑰 𝑨 (middle), and both the social network and the multivariate
covariates, 𝒀𝑰 and 𝒀𝑰 𝑨 (right). An angular rotation of the latent friendship space is applied
to match the latent person positions in the joint latent space. A congruence coefficient of
0.96 was found between the two sets of latent positions. Used as a measure of dimension
similarity, a congruence coefficient of 0.95 and above indicates that the dimensions are
identical. A congruence coefficient of 0.91 was found between the latent person positions
using the BLSM and the latent person positions using the APLSM. As expected, the latent
person positions’ structure using APLSM more closely resembles the structure using LSM
than that using BLSM.

19
Figure 5: The estimated latent person positions 𝒖˜𝒊 in the friendship latent space using only the social
network 𝒀𝑰 (left), using only the multivariate covariates 𝒀𝑰 𝑨 (middle) and in the joint latent space using
both 𝒀𝑰 and 𝒀𝑰 𝑨 (right). The edges are the observed friendship connections. The grey ellipses represent
80% approximate credible intervals for the ũ𝑖 . The numbers represent the randomly assigned indices for
the French elites.

Figure 6: The ROC curves for 𝒀𝑰 and 𝒀𝑰 𝑨 using the APLSM

20
Assess Model Fit

To assess the model fit of APLSM to the data, we obtain the area under the receiver
operating characteristic curve (AUC) of predicting the presence or absence of a link from
the estimated link probabilities. The receiver operating characteristic curves (ROCs) and
the AUC values for 𝒀𝑰 and 𝒀𝑰 𝑨 are presented in Figure 6. The results show satisfactory
fit for both matrices. In addition, we assess whether the APLSM captures the structure of
the social network and the dependencies among the attributes and the persons’ attributes
information in general.
In Figure 7, we assess whether the APLSM can capture the friendship structure in the
social network. In the left panel, we compare the elite’s distances to the center with the
sum totals of their friendship connections. We plot the rankings of the distances to center
against the rankings of the total friendship connections. In this way, we assess whether the
APLSM preserves the elites’ social hierarchy without being distracted by the distribution
differences between Euclidean distances and friendship counts. A solid reference line is
drawn to illustrate the relationship between the two. The intercept for the solid line equals
to 𝑁, the highest ranking, and the slope equals to −1. As the solid line roughly goes through
the center of the scatter plot, we know that the rankings of the distances to the center
mirror the rankings of the friendship counts in the opposite direction. The Spearman’s
rank correlation between the Euclidean distances and friendship counts is −0.53.
In the right panel of Figure 7, we present the box plot of elites’ pairwise differences in the
absence of friendship versus when they are friends. We can see that the pairwise distances
are generally smaller between elites who are friends with a median of 0.75, compared to
the pairwise distances between elites who are not friends with a median of 1.60. This
observation shows that pairwise distances between elites using the APLSM distinguish
between whether the elites are friends or not.
In Figure 8, we assess the APLSM’s ability to capture the dependencies among at-
tributes. In the left panel, we plot the rankings of attributes’ distances to center against
the rankings of the attribute sum scores. As can be seen, the two types of rankings roughly
follow the solid line with an intercept of 𝑀, the highest ranking, and a slope of −1, sug-
gesting a mirroring of the two rankings in the opposite direction. The Spearman’s rank
correlation between the attribute’s distances to center and attribute sum scores is −0.92.
This observation suggests that attributes’ distances to the center capture the overall re-
sponse rates of the attributes. In the right panel of Figure 8, we plot the rankings of the
pairwise distances against the rankings of the attribute correlation values. As can be seen,
the two types of rankings center around the dashed line with an intercept of 80, the highest

21
Figure 7: Assess the fit of the person latent positions. The left panel shows the rankings of nodes’ distances
to center against the rankings of the nodes’ total friendship connections. The solid line is a reference line
with an intercept of 𝑁 and a slope of −1. The right panel shows the pairwise distances between pairs of
nodes when they are friends versus when they are not friends.

ranking, and a slope of −1, suggesting a mirroring of the two rankings in the opposite
direction; the Spearman’s rank correlation between the two is −0.62. This result shows
that the pairwise distances between attributes using the APLSM capture the correlations
between the attributes.
Using Figure 9 (left), we assess whether the APLSM is able to differentiate the presence
of an attribute from the absence. We compare the pairwise distances between attributes
and persons when the attribute is present to the pairwise distances between persons and
attributes when the attribute is absent. We note that when the attributes are present,
the distances are generally smaller with a median of 1.13, than those when the attributes
are absent with a median of 1.71. Therefore the pairwise distances between persons and
attributes using the APLSM distinguish the present from the absence of attributes.

Redefine the Left and Right Moiety

To replicate the left and right moiety in Kadushin (1995), we apply K-means clustering
(Hartigan et al., 1979) to the latent person and attribute positions with 𝑘 = 2. The
algorithm partitions the latent positions into 𝑘 groups which minimize the sum of squares
within each group. The k-means clustering is performed with the K-means function in the
kknn package, which implements the Hartigan–Wong algorithm (Hechenbichler and Schliep,

22
Figure 8: Assess the fit of the attribute latent positions. The left panel shows the rankings of attributes’
distances to center against the rankings of the attributes’ sum scores. The right panel shows the rankings
of attributes’ pairwise distances against the rankings of the attributes’ correlations.

2004). We run the K-means algorithm with 100 random starting positions and take the
solution which optimizes the objective function. The resulting two clusters are shown in
Figure 10 explaining 50.4% of the total variance in the data (defined as the proportion of
the within-cluster sum of squares to the total sum of squares). The two clusters contain
both attributes and persons.
In Figure 10, we show the latent person and attribute positions colored by the resulting
two clusters. The solid lines represent the observed friendship connections between pairs
of elites. We can see that a more densely connected network can be found on the left side
than the right. Figure 9 (right) shows the probability of a pair of elites being friends in each
cluster and between the two clusters. The median probability of friendship connections in
the left cluster is roughly 0.29; the median probability in the right cluster is roughly 0.22;
while between the two clusters, the median probability is roughly 0.05. This result shows
that there is slightly stronger connectivity among elites in the left cluster than in the right
cluster. Also, there is a separation of French elites based on social connectivity into the
two clusters.
In Figure 10, we also see that attributes: Socialists, Inspector General de France, ENA,
Cabinet, University, Polytechnique and Center (Centrists) are found in the left (red) cluster,
suggesting that the elites in this cluster are more likely to obtain higher education, hold
top office positions and identify as socialists or centrists. On the other hand, attributes

23
Figure 9: (A) The pairwise distances between attributes and persons when the attributes are present
versus when the attributes are absent. (B) The box plot showing the probability of a pair of elites being
friends in the left and the right clusters and between the two clusters.

such as Right (Capitalist), Father Status, Social Register, Particule, Science Po, and Age
are found in the right (green) cluster, which suggest that the elites in the right cluster are
more likely to be of higher class (being present in the social register, having particule in
their names and fathers of higher status), older age and identify as politically right.
Two clusters based on APLSM largely correspond to the left and right moieties found
in Kadushin (1995). However, in contrast to Kadushin (1995), the two clusters here are
joint clusters with both attributes and persons. The addition of the attributes provides
more meaning to the division of french elites. We see the division of the social circles
based on party affiliations, class, education, and career. This finding was also observed
in the previous study though Kadushin (1995) compared the background information for
elites between the left and right moiety after the analysis of the network (The 12.1% of the
variance in only four attribute variables was explained through regression). Because we are
able to use relevant attribute information in the clustering process, there is more confidence
in our observation of the division. Besides, the approximate credible intervals of the person
and attribute nodes using the APLSM is much smaller than those using the BLSM and the
LSM. The drivers behind the division automatically and systematically emerge from the
data.

24
Figure 10: The division of the French financial elites into two clusters. The left panel displays the two
clusters of elites with positions of key attributes. The right panel displays latent attribute positions. Both
the latent person and latent attribute positions are part of the joint latent space shown in Figure 4. The
ellipses represent 80% approximate credible intervals for the latent positions. The solid lines represent the
observed friendship connections between pairs of elites.

Career, Politics and Class

The APLSM captures how social relations between French elites and their career, pol-
itics, and class information are related. In the previous section, we have replicated the
division in the French financial elites between the left and right moiety. Though not men-
tioned in the previous study Kadushin (1995), we believe that the presence of a third cluster
might be justified given the visible separation between the ENA attribute and the Poly-
techniqu attribute in the left moiety and our prior knowledge about the French education
system. Therefore, we run the k-means clustering algorithm with 𝑘 = 3. The algorithm is
again implemented with the kmeans function in the kknn P package (Hechenbichler and
Schliep, 2004) with 100 random starting positions. Approximately 63.9% of the total vari-
ance is explained by the three clusters, with 13.5% of the variance explained by the added
cluster. As before, the resulting three clusters contain both attributes and persons.
Figure 11 displays the estimated latent attribute and person positions (same as before)
colored by the resulting three clusters. The additional (blue) cluster is found with the
Polytechnique and Center attributes, named the Polytechnique cluster. The red cluster,
part of the left moiety, is centered around ENA attributes, called the ENA cluster. The
Science Po attribute is now part of the ENA (red) cluster. The green cluster, part of the
right moiety, is again centered around high-class attributes, called the HighClass cluster.

25
Figure 11: The division of the French financial elites into three clusters. The left panel displays the three
clusters of elites with positions of key attributes. The right panel displays latent attribute positions. Both
the latent person and latent attribute positions are part of the joint latent space. The ellipses represent
80% approximate credible intervals for the latent positions. The solid lines represent the observed friendship
connections between pairs of elites.

As we know, Polytechnique, an alternative for the Science Po, indicates an alternative


career for the French financial elites.
We display the estimated probabilities of the presence of an attribute by elites in each of
the three clusters in Figure 12. For the ENA attribute, we observe clear distinctions among
the three clusters, with the highest probabilities found for elites in the ENA cluster and the
lowest probabilities found for elites in the HighClass cluster. For party affiliations (Socialist,
Centrist, and Capitalist), we observe higher probabilities of socialists being in the ENA
cluster than the Polytechnique cluster and extremely low probabilities of socialists being
in the HighClass cluster. We observe slightly higher probabilities of capitalists being in the
HighClass cluster than the ENA cluster, though both probabilities are much higher than
capitalists being in the Polytechnique cluster. We also observe much higher probabilities
of centrists being Polytechnique graduates than being ENA graduates or from a high-class
background. For class background and being part of the social register, we observe lower
probabilities of being in the social register for elites in the ENA and the Polytechnique
clusters than for elites in the HighClass cluster. These additional information can be used
to generate new hypotheses for further understanding of the French elite class.
In Figure 13, we display the probability of a pair of elites being friends with a person

26
Figure 12: Boxplots showing the probabilities of feature attributes being indicated as positive in each of
the three clusters.

27
Figure 13: The boxplot showing the posterior probability of an edge in the social network between elites
in the same HighClass, ENA and Polytechniqu clusters and between each pair of clusters

from the same cluster and with someone of from each of the other two clusters. The me-
dian probabilities of friendship connections within the HighClass, ENA, and Polytechniqu
clusters are 0.36, 0.32, and 0.36 respectively. The median probabilities between clusters are
as follows: 0.06 between the HighClass cluster and the Polytechniqu cluster, 0.02 between
the HighClass cluster and the ENA cluster, and 0.11 between the ENA cluster and the
Polytechniqu cluster. Clearly, there is more intra-cluster connectivity than between-cluster
connectivity for all clusters. We observe the strongest connectivity between the elites in
the ENA cluster and the Polytechniqu cluster and weakest connectivity between elites in
the ENA cluster and the HighClass cluster, which makes sense because one either has the
position because of one’s education (ENA) or because of the connections of one’s family.
The left panel of Figure 14 displays the latent person positions of socialists (red),
centrists (blue), and capitalists (green) elites along with the latent attribute position of
attributes Socialist, Center, and Right. Elites with no party affiliations are shown as gray
circles in the joint latent space. As can be seen, elites with different party affiliations are
positioned near the corresponding attributes and are far apart in the joint latent space.
The right panel of Figure 14 displays the latent positions of ENA graduates (purple) and
non-ENA graduates (white) along with the attribute ENA. Again, we see a clear separation

28
Figure 14: The latent person positions colored by key attributes: party (left) and ENA (right). The left
panel displays the latent positions of socialist (red), centrist (blue) and capitalist (green) elites along with
the attributes Socialist, Center and Right. Four French elites indicate no party affiliations and are shown
as gray circles in the joint latent space. The right panel displays the latent positions of ENA graduates
(purple) along with the attribute ENA. Non-ENA graduates are shown as white circles in the joint latent
space. The ellipses represent 80% approximate credible intervals for the latent positions.

between ENA graduates and non-ENA graduates as the ENA graduates center around the
ENA attribute.

6 Discussions and conclusion


The APLSM outlined in this article constitutes a principle strategy for jointly analyz-
ing social networks and high dimensional multivariate covariates. We have argued for and
presented evidence that a joint analysis of friendships and individual outcomes is crucial in
understanding human behaviors. In particular, using APLSM, we have analyzed the French
financial elite data, replicated the division in the elite circle between the left and right moi-
ety, explained and consolidated the division using career, class, and political information.
Furthermore, we identified an additional cluster in the joint latent space. Therefore, we
believe that APLSM can significantly help researchers understand how friendships and
multivariate covariates are intertwined and inspire further model development in this area.

29
Supplementary Materials
The supplementary materials contain detailed derivations of the variational bayesian
EM algorithm for the proposed model, the APLSM, parts of which are used to estimate
the BLSM.

References
Agarwal, A., and Xue, L. (2020), “Model-based clustering of nonparametric weighted networks with ap-
plication to water pollution analysis,” Technometrics, 62, 161–172.
Airoldi, E. M., Blei, D. M., Fienberg, S. E., and Xing, E. P. (2008), “Mixed membership stochastic
blockmodels,” Journal of Machine Learning Research, 9, 1981–2014.
Albert, R., and Barabási, A.-L. (2002), “Statistical mechanics of complex networks,” Reviews of modern
physics, 74, 47.
Arroyo, J., Athreya, A., Cape, J., Chen, G., Priebe, C. E., and Vogelstein, J. T. (2019), “Inference for
multiple heterogeneous networks with a common invariant subspace,” arXiv preprint arXiv:1906.10026.
Attias, H. (1999), “Inferring parameters and structure of latent variable models by variational Bayes,”
in Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, Morgan Kaufmann
Publishers Inc., pp. 21–30.
Austin, A., Linkletter, C., and Wu, Z. (2013), “Covariate-defined latent space random effects model,”
Social networks, 35, 338–346.
Barabási, A.-L., and Albert, R. (1999), “Emergence of scaling in random networks,” science, 286, 509–512.
Barbillon, P., Donnet, S., Lazega, E., and Bar-Hen, A. (2015), “Stochastic block models for multiplex
networks: an application to networks of researchers,” arXiv preprint arXiv:1501.06444.
Baum, L. E., Petrie, T., Soules, G., and Weiss, N. (1970), “A maximization technique occurring in the
statistical analysis of probabilistic functions of Markov chains,” The annals of mathematical statistics,
41, 164–171.
Beal, M. J., Ghahramani, Z. et al. (2006), “Variational Bayesian learning of directed graphical models with
hidden variables,” Bayesian Analysis, 1, 793–831.
Beal, M. J. et al. (2003), Variational algorithms for approximate Bayesian inference, university of London
London.
Bickel, P. J., and Chen, A. (2009), “A nonparametric view of network models and Newman–Girvan and
other modularities,” Proceedings of the National Academy of Sciences, 106, 21068–21073.
Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017), “Variational inference: A review for statisticians,”
Journal of the American Statistical Association, 112, 859–877.
Boccaletti, S., Bianconi, G., Criado, R., Del Genio, C. I., Gómez-Gardeñes, J., Romance, M., Sendina-
Nadal, I., Wang, Z., and Zanin, M. (2014), “The structure and dynamics of multilayer networks,” Physics
Reports, 544, 1–122.
Borgatti, S. P., Mehra, A., Brass, D. J., and Labianca, G. (2009), “Network analysis in the social sciences,”
science, 323, 892–895.

30
Bramoullé, Y., Djebbari, H., and Fortin, B. (2009), “Identification of peer effects through social networks,”
Journal of econometrics, 150, 41–55.
Bullmore, E., and Sporns, O. (2009), “Complex brain networks: graph theoretical analysis of structural
and functional systems,” Nature Reviews Neuroscience, 10, 186–198.
Carrington, P. J., Scott, J., and Wasserman, S. (2005), Models and methods in social network analysis,
vol. 28, Cambridge university press.
Celisse, A., Daudin, J. J., and Pierre, L. (2012), “Consistency of maximum-likelihood and variational
estimators in the stochastic block model,” Electronic Journal of Statistics, 6, 1847–1899.
D’Angelo, S., Alfò, M., and Murphy, T. B. (2018), “Node-specific effects in latent space modelling of
multidimensional networks,” in 49th Scientific meeting of the Italian Statistical Society.
Daudin, J. J., Picard, F., and Robin, S. (2008), “A mixture model for random graphs,” Stat Comput, 18,
173–183.
Dean, D. O., Bauer, D. J., and Prinstein, M. J. (2017), “Friendship Dissolution Within Social Networks
Modeled Through Multilevel Event History Analysis,” Multivariate behavioral research, 52, 271–289.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), “Maximum likelihood from incomplete data via
the EM algorithm,” Journal of the Royal Statistical Society. Series B (Methodological), 1–38.
Dorans, N., and Drasgow, F. (1978), “Alternative weighting schemes for linear prediction,” Organizational
Behavior and Human Performance, 21, 316–345.
Ferrara, E., Interdonato, R., and Tagarelli, A. (2014), “Online popularity and topical interests through the
lens of instagram,” in Proceedings of the 25th ACM conference on Hypertext and social media, ACM,
pp. 24–34.
Fosdick, B. K., and Hoff, P. D. (2015), “Testing and modeling dependencies between a network and nodal
attributes,” Journal of the American Statistical Association, 110, 1047–1056.
Frank, K. A., Zhao, Y., and Borman, K. (2004), “Social capital and the diffusion of innovations within
organizations: The case of computer technology in schools,” Sociology of education, 77, 148–171.
Fratiglioni, L., Wang, H.-X., Ericsson, K., Maytan, M., and Winblad, B. (2000), “Influence of social network
on occurrence of dementia: a community-based longitudinal study,” The lancet, 355, 1315–1319.
Friel, N., Rastelli, R., Wyse, J., and Raftery, A. E. (2016), “Interlocking directorates in Irish companies
using a latent space model for bipartite networks,” Proceedings of the National Academy of Sciences,
113, 6629–6634.
Fujimoto, K., Wang, P., and Valente, T. W. (2013), “The decomposed affiliation exposure model: A
network approach to segregating peer influences from crowds and organized sports,” Network Science,
1, 154–169.
Girvan, M., and Newman, M. E. (2002), “Community structure in social and biological networks,” Pro-
ceedings of the national academy of sciences, 99, 7821–7826.
Goldsmith-Pinkham, P., and Imbens, G. W. (2013), “Social networks and the identification of peer effects,”
Journal of Business & Economic Statistics, 31, 253–264.
Gollini, I., and Murphy, T. B. (2016), “Joint modeling of multiple network views,” Journal of Computational
and Graphical Statistics, 25, 246–265.
Guhaniyogi, R., Rodriguez, A. et al. (2020), “Joint modeling of longitudinal relational data and exogenous
variables,” Bayesian Analysis, 15, 477–503.
Handcock, M. S., Raftery, A. E., and Tantrum, J. M. (2007), “Model-based clustering for social networks,”

31
J. Roy. Statist. Soc. Ser. A, 170, 301–354.
Hartigan, J., Wong, M. et al. (1979), “A k-means clustering algorithm,” New Haven.
He, X., Kan, M.-Y., Xie, P., and Chen, X. (2014), “Comment-based multi-view clustering of web 2.0
items,” in Proceedings of the 23rd international conference on World wide web, ACM, pp. 771–782.
Hechenbichler, K., and Schliep, K. (2004), “Weighted k-nearest-neighbor techniques and ordinal classifica-
tion,” .
Henderson, H. V., and Searle, S. R. (1981), “On deriving the inverse of a sum of matrices,” Siam Review,
23, 53–60.
Hoff, P. (2008), “Modeling homophily and stochastic equivalence in symmetric relational data,” in Advances
in neural information processing systems, pp. 657–664.
Hoff, P. D. (2005), “Bilinear mixed-effects models for dyadic data,” Journal of the american Statistical
association, 100, 286–295.
— (2009), “Multiplicative latent factor models for description and prediction of social networks,” Compu-
tational and mathematical organization theory, 15, 261.
— (2018), “Additive and multiplicative effects network models,” arXiv preprint arXiv:1807.08038.
Hoff, P. D., Raftery, A. E., and Handcock, M. S. (2002), “Latent space approaches to social network
analysis,” Journal of the american Statistical association, 97, 1090–1098.
Huang, W., Liu, Y., Chen, Y. et al. (2020), “Mixed membership stochastic blockmodels for heterogeneous
networks,” Bayesian Analysis, 15, 711–736.
Jackson, M. O. et al. (2008), Social and economic networks, vol. 3, Princeton University Press Princeton.
Jensen, J. L. W. V. (1906), “Sur les fonctions convexes et les inégalités entre les valeurs moyennes,” Acta
mathematica, 30, 175–193.
Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999), “An introduction to variational
methods for graphical models,” Machine learning, 37, 183–233.
Kadushin, C. (1995), “Friendship among the French financial elite,” American Sociological Review, 202–
221.
Kéfi, S., Miele, V., Wieters, E. A., Navarrete, S. A., and Berlow, E. L. (2016), “How structured is the
entangled bank? The surprisingly simple organization of multiplex ecological networks leads to increased
persistence and resilience,” PLoS biology, 14, e1002527.
Kim, Y., and Srivastava, J. (2007), “Impact of social influence in e-commerce decision making,” in Pro-
ceedings of the ninth international conference on Electronic commerce, ACM, pp. 293–302.
Kivelä, M., Arenas, A., Barthelemy, M., Gleeson, J. P., Moreno, Y., and Porter, M. A. (2014), “Multilayer
networks,” Journal of Complex Networks, 2, 203–271.
Krivitsky, P. N., and Handcock, M. S. (2008), “Fitting position latent cluster models for social networks
with latentnet,” Journal of Statistical Software, 24.
Krivitsky, P. N., Handcock, M. S., Raftery, A. E., and Hoff, P. D. (2009), “Representing degree distribu-
tions, clustering, and homophily in social networks with latent cluster random effects models,” Social
networks, 31, 204–213.
Kwon, K. H., Stefanone, M. A., and Barnett, G. A. (2014), “Social network influence on online behavioral
choices: exploring group formation on social network sites,” American Behavioral Scientist, 58, 1345–
1360.
Lazer, D. (2011), “Networks in political science: Back to the future,” PS: Political Science & Politics, 44,

32
61–68.
Leenders, R. T. A. (2002), “Modeling social influence through network autocorrelation: constructing the
weight matrix,” Social networks, 24, 21–47.
Liu, X., Liu, W., Murata, T., and Wakita, K. (2014), “A framework for community detection in heteroge-
neous multi-relational networks,” Advances in Complex Systems, 17, 1450018.
Lusher, D., Koskinen, J., and Robins, G. (2013), Exponential random graph models for social networks:
Theory, methods, and applications, Cambridge University Press.
Ma, Z., Ma, Z., and Yuan, H. (2020), “Universal Latent Space Model Fitting for Large Networks with
Edge Covariates.” Journal of Machine Learning Research, 21, 1–67.
Matias, C., and Miele, V. (2016), “Statistical clustering of temporal networks through a dynamic stochastic
block model,” Journal of the Royal Statistical Society: Series B (Statistical Methodology).
McPherson, M., Smith-Lovin, L., and Cook, J. M. (2001), “Birds of a feather: Homophily in social net-
works,” Annual review of sociology, 27, 415–444.
Mele, A., Hao, L., Cape, J., and Priebe, C. E. (2019), “Spectral inference for large Stochastic Blockmodels
with nodal covariates,” arXiv preprint arXiv:1908.06438.
Mercken, L., Snijders, T. A., Steglich, C., Vartiainen, E., and De Vries, H. (2010), “Dynamics of adolescent
friendship networks and smoking behavior,” Social networks, 32, 72–81.
Mucha, P. J., Richardson, T., Macon, K., Porter, M. A., and Onnela, J. P. (2010), “Community structure
in time-dependent, multiscale, and multiplex networks,” Science, 328, 876–878.
Nickel, M., Murphy, K., Tresp, V., and Gabrilovich, E. (2016), “A review of relational machine learning
for knowledge graphs,” Proceedings of the IEEE, 104, 11–33.
Paul, S., and Chen, Y. (2016), “Consistent community detection in multi-relational data through restricted
multi-layer stochastic blockmodel,” Electronic Journal of Statistics, 10, 3807–3870.
Paul, S., Chen, Y. et al. (2020a), “A random effects stochastic block model for joint community detection
in multiple networks with applications to neuroimaging,” Annals of Applied Statistics, 14, 993–1029.
— (2020b), “Spectral and matrix factorization methods for consistent community detection in multi-layer
networks,” The Annals of Statistics, 48, 230–250.
Robins, G., Pattison, P., and Elliott, P. (2001), “Network models for social influence processes,” Psychome-
trika, 66, 161–189.
Rubinov, M., and Sporns, O. (2010), “Complex network measures of brain connectivity: uses and inter-
pretations,” Neuroimage, 52, 1059–1069.
Salter-Townshend, M., and McCormick, T. H. (2017), “Latent space models for multiview network data,”
The annals of applied statistics, 11, 1217.
Salter-Townshend, M., and Murphy, T. B. (2013), “Variational Bayesian inference for the latent position
cluster model for network data,” Computational Statistics & Data Analysis, 57, 661–671.
Sengupta, S., and Chen, Y. (2015), “Spectral clustering in heterogeneous networks,” Statistica Sinica,
1081–1106.
Sewell, D. K.— (2015), “Latent space models for dynamic networks,” Journal of the American Statistical
Association, 110, 1646–1657.
Sewell, D. K.— (2016), “Latent space models for dynamic networks with weighted edges,” Social Networks,
44, 105–116.
Shalizi, C. R., and Thomas, A. C. (2011), “Homophily and contagion are generically confounded in obser-

33
vational social network studies,” Sociological methods & research, 40, 211–239.
Shmulevich, I., Dougherty, E. R., Kim, S., and Zhang, W. (2002), “Probabilistic Boolean networks: a
rule-based uncertainty model for gene regulatory networks,” Bioinformatics, 18, 261–274.
Sun, Y., Yu, Y., and Han, J. (2009), “Ranking-based clustering of heterogeneous information networks with
star network schema,” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge
discovery and data mining, ACM, pp. 797–806.
Sweet, T., and Adhikari, S. (2020), “A Latent Space Network Model for Social Influence,” Psychometrika,
1–24.
Sweet, T. M. (2015), “Incorporating covariates into stochastic blockmodels,” Journal of Educational and
Behavioral Statistics, 40, 635–664.
VanderWeele, T. J. (2011), “Sensitivity analysis for contagion effects in social networks,” Sociological
Methods & Research, 40, 240–255.
VanderWeele, T. J., and An, W. (2013), “Social networks and causal inference,” in Handbook of causal
analysis for social research, Springer, pp. 353–374.
Wasserman, S., Faust, K. et al. (1994), Social network analysis: Methods and applications, vol. 8, Cambridge
university press.
Watts, D. J., and Strogatz, S. H. (1998), “Collective dynamics of ‘small-world’networks,” nature, 393, 440.
Xu, K. S., Kliger, M., and Hero Iii, A. O. (2014), “Adaptive evolutionary clustering,” Data Mining and
Knowledge Discovery, 28, 304–336.
Zhang, X., Xue, S., and Zhu, J. (2020), “A Flexible Latent Space Model for Multilayer Networks,” in
International Conference on Machine Learning, PMLR, pp. 11288–11297.

34
Supplementary Material for ”Joint Latent Space Model for So-
cial Networks with Multivariate Attributes”.

7 The Estimation Procedure for APLSM


7.1 Derivation of KL Divergence
˜ 1 and 𝒖˜ 𝒊 , Λ̃0 , 𝒗˜𝒂 , Λ̃1 , where 𝑞(𝒖 𝒊 ) =
˜0, 𝛼
We set the variational parameter as Θ = 𝛼
𝑁 ( 𝒖˜ 𝒊 , Λ̃0 ), and 𝑞(𝒗 𝒂 ) = 𝑁 (˜
𝒗 𝒂 , Λ̃1 ). We set the variational posterior as:
𝑁
Ö 𝑀
Ö
𝑞(𝑼, 𝑽 |𝒀𝑰 , 𝒀𝑰 𝑨 ) = 𝑞(𝒖 𝒊 ) 𝑞(𝒗 𝒂 )
𝑖=1 𝑎=1

The Kullback-Leiber divergence between the variational posterior and the true posterior
is:

KL[𝑞(𝑼, 𝑽, 𝛼0 , 𝛼1 |𝒀𝑰 , 𝒀𝑰 𝑨 )| 𝑓 (𝑼, 𝑽, 𝛼0 , 𝛼1 |𝒀𝑰 , 𝒀𝑰 𝑨 )]



𝑞(𝑼, 𝑽, 𝛼0 , 𝛼1 |𝒀𝑰 , 𝒀𝑰 𝑨 )
= 𝑞(𝑼, 𝑽, 𝛼0 , 𝛼1 |𝒀𝑰 , 𝒀𝑰 𝑨 ) log 𝑑 (𝑼, 𝑽, 𝛼0 , 𝛼1 )
𝑓 (𝑼, 𝑽, 𝛼0 , 𝛼1 |𝒀𝑰 , 𝒀𝑰 𝑨 )
𝑁 𝑀 Î𝑁 Î𝑀
𝑖=1 𝑞(𝒖 𝒊 ) 𝑎=1 𝑞(𝒗 𝒂 )
∫ Ö Ö
= 𝑞(𝒖 𝒊 ) 𝑞(𝒗 𝒂 ) log Î𝑁 Î𝑀 𝑑 (𝑼, 𝑽, 𝛼0 , 𝛼1 )
𝑖=1 𝑎=1 𝑓 (𝒀 𝑰 , 𝒀 𝑰 𝑨 |𝑼, 𝑽, 𝛼0 , 𝛼 1 ) 𝑖=1 𝑓 (𝒖 𝒊 ) 𝑎=1 𝑓 (𝒗 𝒂 )
𝑁 ∫ 𝑀 ∫
∑︁ 𝑞(𝒖 𝒊 ) ∑︁ 𝑞(𝒗 𝒂 )
= 𝑞(𝒖 𝒊 ) log 𝑑𝒖 𝒊 + 𝑞(𝒗 𝒂 ) log 𝑑𝒗 𝒂
𝑓 (𝒖 𝒊 ) 𝑓 (𝒗 𝒂 )
∫𝑖=1 𝑎=1

− 𝑞(𝑼, 𝑽, 𝛼0 , 𝛼1 |𝒀𝑰 , 𝒀𝑰 𝑨 ) log 𝑓 (𝒀𝑰 , 𝒀𝑰 𝑨 |𝑼, 𝑽, 𝛼0 , 𝛼1 )𝑑 (𝑼, 𝑽, 𝛼0 , 𝛼1 )


𝑁
∑︁ 𝑀
∑︁
= KL[𝑞(𝒖 𝒊 )| 𝑓 (𝒖 𝒊 )] + KL[𝑞(𝒗 𝒂 )| 𝑓 (𝒗 𝒂 )]
𝑖=1 𝑎=1

− 𝐸 𝑞(𝑼,𝑽,𝛼0 ,𝛼1 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [log 𝑓 (𝒀𝑰 , 𝒀𝑰 𝑨 |𝑼, 𝑽, 𝛼0 , 𝛼1 )],

where each of the components are calculated as follows:


𝑁
∑︁
KL[𝑞(𝒖 𝒊 )|| 𝑓 (𝒖 𝒊 )]
𝑖=1
𝑁 ∫
∑︁ 𝑓 (𝒖 𝒊 )
=− 𝑞(𝒖 𝒊 ) log 𝑑𝒖 𝒊
𝑖=1
𝑞(𝒖 𝒊 )

35
𝑁 ∫  !
∑︁ 1 1
=− 𝑞(𝒖 𝒊 ) − 𝐷 log(𝜆20 ) + log(det( Λ̃0 )) − 2 𝒖 𝒊 𝑇 𝒖 𝒊 + (𝒖 𝒊 − 𝒖˜𝒊 )𝑇 Λ̃−0 1 (𝒖 𝒊 − 𝒖˜𝒊 )
𝑖=1
2 𝜆0
𝑁
!
1  ∑︁ 1 1
= 𝐷𝑁 log(𝜆20 ) − 𝑁 log(det( Λ̃0 )) + 𝐸
2 𝑞(𝒖 𝒊 )
[𝒖 𝒊 𝑇 𝒖 𝒊 ] − 𝐸 𝑞(𝒖𝒊 ) [(𝒖 𝒊 − 𝒖˜𝒊 )𝑇 Λ̃−0 1 (𝒖 𝒊 − 𝒖˜𝒊 )]
2 𝑖=1
2 𝜆 0
𝑁
1  ∑︁ 1  2 1
= 𝐷𝑁 log(𝜆20 ) − 𝑁 log(det( Λ̃0 )) + 2
Var(𝒖 𝒊 ) + 𝐸 𝑞(𝒖 𝒊 ) [𝒖 𝒊 ] − 𝑁𝐷
2 𝑖=1
2𝜆 0
2
1  𝑁 tr( Λ̃ ) Í𝑁 𝒖˜ 𝒊 𝑇 𝒖˜ 𝒊 1
0
2
= 𝐷𝑁 log(𝜆 0 ) − 𝑁 log(det( Λ̃0 )) + 2
+ 𝑖=1 2 − 𝑁𝐷
2 2𝜆 0 2𝜆 0 2
𝑀
∑︁
KL[𝑞(𝒗 𝒂 )|| 𝑓 (𝒗 𝒂 )]
𝑎=1
Í𝑀
1 2
 𝑀 tr( Λ̃ )
1 𝑎=1 𝒗˜𝒂 𝑇 𝒗˜𝒂 1
= 𝐷 𝑀 log(𝜆 1 ) − 𝑀 log(det( Λ̃1 )) + 2
+ − 𝑀𝐷
2 2𝜆 1 2𝜆21 2

𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [log 𝑓 (𝒀𝑰 , 𝒀𝑰 𝑨 |𝑼, 𝑽)] can be expanded into 6 components:

𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [log 𝑓 (𝒀𝑰 , 𝒀𝑰 𝑨 |𝑼, 𝑽)]


𝑁 ∑︁
∑︁ 𝑀
= 𝑦𝑖𝑎 𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [𝛼1 − (𝒖 𝒊 − 𝒗 𝒂 )𝑇 (𝒖 𝒊 − 𝒗 𝒂 )]
𝑖=1 𝑎=1
𝑁
∑︁ 𝑁
∑︁
+ 𝑦𝑖 𝑗 𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [𝛼0 − (𝒖 𝒊 − 𝒖 𝒋 )𝑇 (𝒖 𝒊 − 𝒖 𝒋 )]
𝑖=1 𝑗=1, 𝑗≠𝑖
𝑁 ∑︁
∑︁ 𝑀
− 𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [log(1 + exp(𝛼1 − (𝒖 𝒊 − 𝒗 𝒂 )𝑇 (𝒖 𝒊 − 𝒗 𝒂 )))]
𝑖=1 𝑎=1
𝑁
∑︁ 𝑁
∑︁
− 𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [log(1 + exp(𝛼0 − (𝒖 𝒊 − 𝒖 𝒋 )𝑇 (𝒖 𝒊 − 𝒖 𝒋 )))]
𝑖=1 𝑗=1, 𝑗≠𝑖

First 2 components of 𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [log 𝑓 (𝒀𝑰 , 𝒀𝑰 𝑨 |𝑼, 𝑽)] are calculated as follows:

𝑁
∑︁ 𝑁
∑︁
𝑦𝑖 𝑗 𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [𝛼0 − (𝒖 𝒊 − 𝒖 𝒋 )(𝒖 𝒊 − 𝒖 𝒋 )𝑇 ]
𝑖=1 𝑗=1, 𝑗≠𝑖
𝑁
∑︁ 𝑁
∑︁ ∫

= 𝑦𝑖 𝑗 𝛼0 − (𝒖 𝒊 − 𝒖 𝒋 )(𝒖 𝒊 − 𝒖 𝒋 )𝑇 𝑞(𝒖 𝒊 )𝑞(𝒖 𝒋 )𝑑 (𝒖 𝒊 , 𝒖 𝒋 )
𝑖=1 𝑗=1, 𝑗≠𝑖

36
𝑁 𝑁
" ∫ #
∑︁ ∑︁
= ˜0 −
𝑦𝑖 𝑗 𝛼 (𝒖 𝒊 − 𝒖 𝒋 )(𝒖 𝒊 − 𝒖 𝒋 )𝑇 𝑞(𝒖 𝒊 )𝑞(𝒖 𝒋 )𝑑 (𝒖 𝒊 , 𝒖 𝒋 )
𝑖=1 𝑗=1, 𝑗≠𝑖
𝑁 𝑁
" 𝐷
∫ ∑︁ #
∑︁ ∑︁
= ˜0 −
𝑦𝑖 𝑗 𝛼 (𝑢𝑖𝑑 − 𝑢 𝑗 𝑑 ) 2 𝑞(𝒖 𝒊 )𝑞(𝒖 𝒋 )𝑑 (𝒖 𝒊 , 𝒖 𝒋 )
𝑖=1 𝑗=1, 𝑗≠𝑖 𝑑=1
𝑁 𝑁
" 𝐷
 ∑︁ ∫ ∫ ∫ ∫ #
∑︁ ∑︁
2
𝑢 2𝑗 𝑑 𝑞(𝑢 𝑗 𝑑 )𝑑𝑢 𝑗 𝑑 −
 
= ˜0 −
𝑦𝑖 𝑗 𝛼 𝑢𝑖𝑑 𝑞(𝑢𝑖𝑑 )𝑑𝑢𝑖𝑑 + 2𝑢𝑖𝑑 𝑢 𝑗 𝑑 𝑞(𝑢𝑖𝑑 )𝑞(𝑢 𝑗 𝑑 )𝑑𝑢𝑖𝑑 , 𝑑𝑢 𝑗 𝑑
𝑖=1 𝑗=1, 𝑗≠𝑖 𝑑=1
𝑁 𝑁
" 𝐷
 ∑︁ #
∑︁ ∑︁
2
] + 𝐸 [𝑢 2𝑗 𝑑 ] − 2𝐸 [𝑢𝑖𝑑 ]𝐸 [𝑢 𝑗 𝑑 ]
 
= ˜0 −
𝑦𝑖 𝑗 𝛼 𝐸 [𝑢𝑖𝑑
𝑖=1 𝑗=1, 𝑗≠𝑖 𝑑=1
𝑁 𝑁
" 𝐷
 ∑︁ #
∑︁ ∑︁
𝑉 𝑎𝑟 [𝑢𝑖𝑑 ] + 𝐸 [𝑢𝑖𝑑 ] 2 + 𝑉 𝑎𝑟 [𝑢 𝑗 𝑑 ] + 𝐸 [𝑢 𝑗 𝑑 ] 2 − 2𝐸 [𝑢𝑖𝑑 ]𝐸 [𝑢 𝑗 𝑑 ]
 
= ˜0 −
𝑦𝑖 𝑗 𝛼
𝑖=1 𝑗=1, 𝑗≠𝑖 𝑑=1
𝑁 𝑁
" #
∑︁ ∑︁
= ˜ 0 − 2 tr( Λ̃0 ) − ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )𝑇 ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )
𝑦𝑖 𝑗 𝛼
𝑖=1 𝑗=1, 𝑗≠𝑖

37
𝑁 ∑︁
∑︁ 𝑀
𝑦𝑖𝑎 𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [𝛼1 − (𝒖 𝒊 − 𝒗 𝒂 )(𝒖 𝒊 − 𝒗 𝒂 )𝑇 ]
𝑖=1 𝑎=1
∑︁𝑁 ∑︁𝑀 ∫

= 𝑦𝑖𝑎 𝛼1 − (𝒖 𝒊 − 𝒗 𝒂 )(𝒖 𝒊 − 𝒗 𝒂 )𝑇 𝑞(𝒖 𝒊 )𝑞(𝒗 𝒂 )𝑑 (𝒖 𝒊 , 𝒗 𝒂 )
𝑖=1 𝑎=1
𝑁 ∑︁𝑀
" ∫ #
∑︁
= ˜1 −
𝑦𝑖𝑎 𝛼 (𝒖 𝒊 − 𝒗 𝒂 )(𝒖 𝒊 − 𝒗 𝒂 )𝑇 𝑞(𝒖 𝒊 )𝑞(𝒗 𝒂 )𝑑 (𝒖 𝒊 , 𝒗 𝒂 )
𝑖=1 𝑎=1
𝑁 ∑︁𝑀
" 𝐷
∫ ∑︁ #
∑︁
= ˜1 −
𝑦𝑖𝑎 𝛼 (𝑢𝑖𝑑 − 𝑣 𝑎𝑑 ) 2 𝑞(𝒖 𝒊 )𝑞(𝒗 𝒂 )𝑑 (𝒖 𝒊 , 𝒗 𝒂 )
𝑖=1 𝑎=1 𝑑=1
𝑁 ∑︁𝑀
" 𝐷
 ∑︁ ∫ ∫ ∫ ∫ #
∑︁
2
𝑣 2𝑎𝑑 𝑞(𝑣 𝑎𝑑 )𝑑𝑣 𝑎𝑑 −
 
= ˜1 −
𝑦𝑖𝑎 𝛼 𝑢𝑖𝑑 𝑞(𝑢𝑖𝑑 )𝑑𝑢𝑖𝑑 + 2𝑢𝑖𝑑 𝑣 𝑎𝑑 𝑞(𝑢𝑖𝑑 )𝑞(𝑣 𝑎𝑑 )𝑑𝑢𝑖𝑑 , 𝑑𝑣 𝑎𝑑
𝑖=1 𝑎=1 𝑑=1
𝑁 ∑︁
𝑀
" 𝐷
 ∑︁ #
∑︁
2
] + 𝐸 [𝑣 2𝑎𝑑 ] − 2𝐸 [𝑢𝑖𝑑 ]𝐸 [𝑣 𝑎𝑑 ]
 
= ˜1 −
𝑦𝑖𝑎 𝛼 𝐸 [𝑢𝑖𝑑
𝑖=1 𝑎=1 𝑑=1
𝑁 ∑︁𝑀
" 𝐷
 ∑︁ #
∑︁
𝑉 𝑎𝑟 [𝑢𝑖𝑑 ] + 𝐸 [𝑢𝑖𝑑 ] 2 + 𝑉 𝑎𝑟 [𝑣 𝑎𝑑 ] + 𝐸 [𝑣 𝑎𝑑 ] 2 − 2𝐸 [𝑢𝑖𝑑 ]𝐸 [𝑣 𝑎𝑑 ]
 
= ˜1 −
𝑦𝑖𝑎 𝛼
𝑖=1 𝑎=1 𝑑=1
𝑁 ∑︁𝑀
" #
∑︁
= ˜ 1 − tr( Λ̃0 ) − tr( Λ̃1 ) − ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
𝑦𝑖𝑎 𝛼
𝑖=1 𝑎=1

The last 2 expectations of the log functions can be simplified using Jensen’s inequality and
𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [log 𝑓 (𝒀𝑰 , 𝒀𝑰 𝑨 |𝑼, 𝑽)] is now:

𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [log 𝑓 (𝒀𝑰 , 𝒀𝑰 𝑨 |𝑼, 𝑽)]


𝑁 ∑︁𝑀
" #
∑︁
≤ 𝑦𝑖𝑎 𝛼 ˜ 1 − tr( Λ̃0 ) − tr( Λ̃1 ) − ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
𝑖=1 𝑎=1
𝑁 𝑁
" #
∑︁ ∑︁
+ ˜ 0 − 2 tr( Λ̃0 ) − ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )𝑇 ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )
𝑦𝑖 𝑗 𝛼
𝑖=1 𝑗=1, 𝑗≠𝑖
𝑁 ∑︁
∑︁ 𝑀
− 𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [log(1 + exp(𝛼1 − (𝒖 𝒊 − 𝒗 𝒂 )𝑇 (𝒖 𝒊 − 𝒗 𝒂 )))]
𝑖=1 𝑎=1
𝑁
∑︁ 𝑁
∑︁
− log(1 + 𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [exp(𝛼0 − (𝒖 𝒊 − 𝒖 𝒋 )𝑇 (𝒖 𝒊 − 𝒖 𝒋 ))])
𝑖=1 𝑗=1, 𝑗≠𝑖

𝑖𝑖𝑑
Recall 𝒖 𝒊 , 𝒖 𝒋 are 𝐷 × 1 column vectors. Define u = 𝒖˜ 𝒊 − 𝒖˜ 𝒋 . Then we have, 𝒖 𝒊 − 𝒖 𝒋 =
𝑁 (u, 2Λ̃0 ), where u is a 𝐷×1 vector and Λ̃0 is an 𝐷×𝐷 positive semidefinite matrix. Further
define Z = (2Λ̃0 ) −1/2 (𝒖 𝒊 − 𝒖 𝒋 − ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )). Then clearly Z follows 𝐷 dimensional multivariate

38
standard normal distribution and its density function is given by 𝑓 𝑍 (𝑧) = √1 exp(− 21 z𝑇 z).
2𝜋
Consequently, we have 𝒖 𝒊 − 𝒖 𝒋 = 2Λ̃10/2 Z + u.
Therefore, we can reparameterize

𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [exp(−(𝒖 𝒊 − 𝒖 𝒋 )𝑇 (𝒖 𝒊 − 𝒖 𝒋 ))]


" !#
  
= 𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) exp − Z𝑇 (2Λ̃0 ) 1/2 + u𝑇 (2Λ̃0 ) 1/2 Z + u
" !#
= 𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) exp − Z𝑇 (2Λ̃0 )Z − 2Z𝑇 (2Λ̃0 ) 1/2 u − u𝑇 u)
∫ !
1 1
exp − Z𝑇 2Λ̃0 + I Z − 2Z𝑇 (2Λ̃0 ) 1/2 u − u𝑇 u 𝑑Z

=√
2𝜋 2

Now define 𝑄 = u(2Λ̃0 + 21 I) −1 (2Λ̃0 ) 1/2 . Then the above integral becomes
∫ !
1 1 1
√ exp − (Z − 𝑄)𝑇 (2Λ̃0 + I)(Z − 𝑄) − u𝑇 u + u𝑇 (2Λ̃0 + I) −1 (2Λ̃0 )u 𝑑Z
2𝜋 2 2
 1  1
= exp − u𝑇 u + u𝑇 (2Λ̃0 + I) −1 (2Λ̃0 )u det(I + 4Λ̃0 ) − 2
2
 1 −1  1
= exp − u (I − (2Λ̃0 + I) (2Λ̃0 ))u det(I + 4Λ̃0 ) − 2
𝑇

 2
−1 1
= exp − u (4Λ̃0 + I) u det(I + 4Λ̃0 ) − 2 .
𝑇

The last line follows since for any two invertible matrices 𝐴 and 𝐵, if 𝐴 + 𝐵 is also invertible,
then by Henderson and Searle (1981)

( 𝐴 + 𝐵) −1 = 𝐴−1 − 𝐴−1 𝐵(𝐼 + 𝐴−1 𝐵) −1 𝐴−1 .

Letting 𝐵 = 4Λ̃0 and 𝐴 = 𝐼 gives:


  1
𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [exp(−(𝒖 𝒊 −𝒖 𝒋 )𝑇 (𝒖 𝒊 −𝒖 𝒋 ))] = exp −( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )𝑇 (I+4Λ̃0 ) −1 ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 ) det(I+4Λ̃0 ) − 2

Recall 𝒖 𝒊 , 𝒗 𝒂 are 𝐷 × 1 column vectors. Define u = 𝒖˜ 𝒊 − 𝒗˜𝒂 . Then we have, 𝒖 𝒊 −


𝑖𝑖𝑑
𝒗 𝒂 = 𝑁 (u, Λ̃0 + Λ̃1 ), where u is a 𝐷 × 1 vector and Λ̃0 is an 𝐷 × 𝐷 positive semidefinite
matrix. Further define Z = ( Λ̃0 + Λ̃1 ) −1/2 (𝒖 𝒊 − 𝒗 𝒂 − ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )). Then clearly Z follows 𝐷
dimensional multivariate standard normal distribution and its density function is given by
𝑓 𝑍 (𝑧) = √1 exp(− 12 z𝑇 z). Consequently, we have 𝒖 𝒊 − 𝒗 𝒂 = ( Λ̃0 + Λ̃1 ) 1/2 Z + u.
2𝜋
Therefore, we can reparameterize

𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [exp(−(𝒖 𝒊 − 𝒗 𝒂 )𝑇 (𝒖 𝒊 − 𝒗 𝒂 ))]

39
" !#
  
= 𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) exp − Z𝑇 ( Λ̃0 + Λ̃1 ) 1/2 + u𝑇 ( Λ̃0 + Λ̃1 ) 1/2 Z + u
" !#
= 𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) exp − Z𝑇 ( Λ̃0 + Λ̃1 )Z − 2Z𝑇 ( Λ̃0 + Λ̃1 ) 1/2 u − u𝑇 u)
∫ !
1 1
exp − Z𝑇 Λ̃0 + Λ̃1 + I Z − 2Z𝑇 ( Λ̃0 + Λ̃1 ) 1/2 u − u𝑇 u 𝑑Z

=√
2𝜋 2

Now define 𝑄 = u( Λ̃0 + Λ̃1 + 12 I) −1 ( Λ̃0 + Λ̃1 ) 1/2 . Then the above integral becomes
∫ !
1 1 1
√ exp − (Z − 𝑄)𝑇 ( Λ̃0 + Λ̃1 + I)(Z − 𝑄) − u𝑇 u + u𝑇 ( Λ̃0 + Λ̃1 + I) −1 ( Λ̃0 + Λ̃1 )u 𝑑Z
2𝜋 2 2
 1  1
= exp − u𝑇 u + u𝑇 ( Λ̃0 + Λ̃1 + I) −1 ( Λ̃0 + Λ̃1 )u det(I + 2Λ̃0 + 2Λ̃1 ) − 2
2
 1 −1  1
= exp − u (I − ( Λ̃0 + Λ̃1 + I) ( Λ̃0 + Λ̃1 ))u det(I + 2Λ̃0 + 2Λ̃1 ) − 2
𝑇

 2 
−1 1
= exp − u (I + 2Λ̃0 + 2Λ̃1 ) u det(I + 2Λ̃0 + 2Λ̃1 ) − 2 .
𝑇

The last line follows since for any two invertible matrices 𝐴 and 𝐵, if 𝐴 + 𝐵 is also invertible,
then by Henderson and Searle (1981)

( 𝐴 + 𝐵) −1 = 𝐴−1 − 𝐴−1 𝐵(𝐼 + 𝐴−1 𝐵) −1 𝐴−1 .

Letting 𝐴 = 𝐼 and 𝐵 = 2Λ̃0 + 2Λ̃1 gives:


  1
−1
𝑇
𝐸 𝑞(𝑼,𝑽 |𝒀𝑰 ,𝒀𝑰 𝑨 ) [exp(−(𝒖 𝒊 −𝒗 𝒂 ) (𝒖 𝒊 −𝒗 𝒂 ))] = exp −( 𝒖˜ 𝒊 −˜ 𝑇
𝒗 𝒂 ) det(I+2Λ̃0 +2Λ̃1 ) − 2
𝒗 𝒂 ) (I+2Λ̃0 +2Λ̃1 ) ( 𝒖˜ 𝒊 −˜

Finally, the Kullback-Leiber divergence between the variational posterior and the true
posterior is

KL[𝑞(𝑼, 𝑽 |𝒀𝑰 , 𝒀𝑰 𝑨 )|| 𝑓 (𝑼, 𝑽 |𝒀𝑰 , 𝒀𝑰 𝑨 )]


1  𝑁 tr( Λ̃ ) Í𝑁 𝒖˜ 𝒊 𝑇 𝒖˜ 𝒊 1
0
2
≥ 𝐷𝑁 log(𝜆 0 ) − 𝑁 log(det( Λ̃0 )) + 2
+ 𝑖=1 2 − 𝑁𝐷
2 2𝜆 0 2𝜆 0 2
Í 𝑀 𝑇
1  𝑀 tr( Λ̃ )
1 𝒗˜𝒂 𝒗˜𝒂 1
+ 𝐷 𝑀 log(𝜆21 ) − 𝑀 log(det( Λ̃1 )) + 2
+ 𝑎=1 2 − 𝑀𝐷
2 2𝜆 1 2𝜆 1 2
𝑁 ∑︁
𝑀
" #
∑︁
− 𝑦𝑖𝑎 𝛼˜ 1 − tr( Λ̃0 ) − tr( Λ̃1 ) − ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
𝑖=1 𝑎=1
𝑁 𝑁
" #
∑︁ ∑︁
− ˜ 0 − 2 tr( Λ̃0 ) − ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )𝑇 ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )
𝑦𝑖 𝑗 𝛼
𝑖=1 𝑗=1, 𝑗≠𝑖

40
𝑁 ∑︁
𝑀
!
∑︁ ˜1)
exp( 𝛼  
+ log 1 + 1
exp − ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
𝑖=1 𝑎=1 det(I + 2Λ̃0 + 2Λ̃1 ) 2

𝑁 𝑁
!
∑︁ ∑︁ ˜0)
exp( 𝛼  
+ log 1 + exp − ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )𝑇 (I + 4Λ̃0 ) −1 ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 ) 𝒖𝒊
+ Const˜
𝑖=1 𝑗=1, 𝑗≠𝑖 det(I + 4Λ̃0 ) 1/2

7.2 Derivations of EM algorithms


E-step: Estimate 𝒖˜ 𝒊 , 𝒗˜𝒂 , Λ̃0 and Λ̃1 by minimizing the KL divergence.

KL𝒖˜𝒊 [𝑞(𝑼, 𝑽 |𝒀𝑰 , 𝒀𝑰 𝑨 )|| 𝑓 (𝑼, 𝑽 |𝒀𝑰 , 𝒀𝑰 𝑨 )]


Í𝑁
𝑖=1 𝒖˜ 𝒊 𝑇 𝒖˜ 𝒊

2𝜆20
𝑁 ∑︁
𝑀
" #
∑︁
− ˜ 1 − tr( Λ̃0 ) − tr( Λ̃1 ) − ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
𝑦𝑖𝑎 𝛼
𝑖=1 𝑎=1
𝑁 𝑁
" #
∑︁ ∑︁
− ˜ 0 − 2 tr( Λ̃0 ) − ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )𝑇 ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )
𝑦𝑖 𝑗 𝛼
𝑖=1 𝑗=1, 𝑗≠𝑖
𝑁 ∑︁
𝑀
!
∑︁ ˜1)
exp( 𝛼  
+ log 1 + 1
exp − ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
𝑖=1 𝑎=1 det(I + 2Λ̃0 + 2Λ̃1 ) 2

𝑁 𝑁
!
∑︁ ∑︁ ˜0)
exp( 𝛼  
+ log 1 + exp − ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )𝑇 (I + 4Λ̃0 ) −1 ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 ) 𝒖𝒊
+ Const˜
𝑖=1 𝑗=1, 𝑗≠𝑖 det(I + 4Λ̃0 ) 1/2

To find the closed form updates of 𝒖˜ 𝒊 , we use second-order Taylor-expansions of

𝑁 ∑︁
𝑀
!
∑︁ ˜1)
exp( 𝛼  
𝑭𝒊𝒂 = log 1 + 1
exp − ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
𝑖=1 𝑎=1 det(I + 2Λ̃0 + 2Λ̃1 ) 2

(13)
𝑁 𝑁
!
∑︁ ∑︁ ˜0)
exp( 𝛼 
𝑇 −1

𝑭𝒊 = log 1 + exp − ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 ) (I + 4Λ̃0 ) ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 ) (14)
𝑖=1 𝑗=1, 𝑗≠𝑖 det(I + 4Λ̃0 ) 1/2

The gradients of 𝑭𝒊 and 𝑭𝒊𝒂 with respect to 𝒖˜ 𝒊 are


 −1
" #
det(I + 4Λ̃0 ) 1/2
𝑁
∑︁ 
−1 𝑇 −1
𝑮 𝒊 (˜
𝒖 𝒊 ) = −2(I + 4Λ̃0 ) ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 ) 1 + exp ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 ) (I + 4Λ̃0 ) ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )
𝑗=1, 𝑗≠𝑖
exp( 𝛼˜0)
𝑀
∑︁
𝒖 𝒊 ) = −2(I + 2Λ̃0 + 2Λ̃1 ) −1
𝑮 𝒊𝒂 (˜ ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
𝑎=1

41
" # −1
det(I + Λ̃0 + Λ̃1 ) 1/2  
1+ exp ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
˜1)
exp( 𝛼

The second-order partial derivatives (Hessian matrices) of 𝑭𝒊 , 𝑭𝒊𝒂 with respect to 𝒖˜ 𝒊 are

 −1
det(I + 4Λ̃0 ) 1/2
𝑁
∑︁   
−1 𝑇 −1
𝑯 𝒊 ( 𝒖˜ 𝒊 ) = −2(I + 4Λ̃0 ) 1+ exp ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 ) (I + 4Λ̃0 ) ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )
𝑗=1, 𝑗≠𝑖
exp( 𝛼˜0)
 
 2( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )𝑇 (I + 4Λ̃0 ) −1 
I −
 
 
 1 + exp ( 𝛼˜0 ) 1/2 exp − ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )𝑇 (I + 4Λ̃0 ) −1 ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 ) 

 det ( I+4Λ̃0 ) 
−1
𝑯 𝒊𝒂 ( 𝒖˜ 𝒊 ) = −2(I + 2Λ̃0 + 2Λ̃1 )
𝑀 
∑︁ det(I + 2Λ̃0 + 2Λ̃1 ) 1/2    −1
𝑇 −1
1+ exp ( 𝒖˜ 𝒊 − 𝒗˜𝒂 ) (I + 2Λ̃0 + 2Λ̃1 ) ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
𝑎=1
exp( ˜
𝛼 1 )
 
 2( 𝒖˜ 𝒊 − 𝒗˜𝒂 )( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 
I −
 
 
exp ( 𝛼
˜1 )
 1+ exp − ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 ) 

 det ( I+2Λ̃0 +2Λ̃1 ) 1/2 

𝑀 𝑁
𝑇 1 ∑︁ ∑︁ 1 
𝐾 𝐿 𝒖˜𝒊 = 𝒖˜ 𝒊 2
+ 𝑦𝑖𝑎 + (𝑦𝑖 𝑗 + 𝑦 𝑗𝑖 ) + 𝑯 𝒊 ( 𝒖˜ 𝒊 ) + 𝑯 𝒊𝒂 ( 𝒖˜ 𝒊 ) 𝒖˜ 𝒊
2𝜆 0 𝑖=𝑎 𝑗=1, 𝑗≠𝑖
2
𝑀 𝑁
∑︁ ∑︁ 1 1 
𝒖𝒊
− 2˜ 𝑦𝑖𝑎 𝒗˜𝒂 + (𝑦𝑖 𝑗 + 𝑦 𝑗𝑖 ) 𝒖˜ 𝒋 − 𝑮 𝒊 ( 𝒖˜ 𝒊 ) − 𝑮 𝒊𝒂 ( 𝒖˜ 𝒊 ) + (𝑯 𝒊 ( 𝒖˜ 𝒊 ) + 𝑯 𝒊𝒂 ( 𝒖˜ 𝒊 )) 𝒖˜ 𝒊 .
𝑖=𝑎 𝑗=1, 𝑗≠𝑖
2 2

With the Taylor-expansions of the log functions, we can obtain the closed form update
rule of 𝒖˜ 𝒊 by setting the partial derivative of KL equal to 0. Finally, we have
" 𝑁 𝑀
# −1
1 ∑︁ ∑︁ 1
𝒖˜ 𝒊 = + (𝑦 𝑖 𝑗 + 𝑦 𝑗𝑖 ) + 𝑦 𝑖𝑎 ) 𝑰 + 𝑯 𝒊 ( 𝒖
˜ 𝒊 ) + 𝑯 𝒊𝒂 ( 𝒖˜ 𝒊 )
2𝜆20 𝑗=1, 𝑗≠𝑖 𝑎=1
2
" 𝑁 𝑀
#
∑︁ ∑︁  1  1
(𝑦𝑖 𝑗 + 𝑦 𝑗𝑖 ) 𝒖˜ 𝒋 + 𝑦𝑖𝑎 𝒗˜𝒂 − 𝑮 𝒊 ( 𝒖˜ 𝒊 ) + 𝑯 𝒊 ( 𝒖˜ 𝒊 ) + 𝑯 𝒊𝒂 ( 𝒖˜ 𝒊 ) 𝒖˜ 𝒊 − 𝑮 𝒊𝒂 ( 𝒖˜ 𝒊 )
𝑗=1, 𝑗≠𝑖 𝑎=1
2 2

Similarly, we can obtain the closed form update rule for 𝒗˜𝒂 by taking the second order
Taylor-expansion of 𝑭𝒊𝒂 (see Equation 13) The gradient and Hessian matrix of 𝑭𝒊𝒂 with
respect to 𝒗˜𝒂 are

42
𝑁
∑︁
𝒗 𝒂 ) = −2(I + 2Λ̃0 + 2Λ̃1 ) −1
𝑮 𝒊𝒂 (˜ 𝒗 𝒂 − 𝒖˜ 𝒊 )

𝑖=1
" # −1
det(I + Λ̃0 + Λ̃1 ) 1/2  
1+ exp ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
˜1)
exp( 𝛼
𝒗 𝒂 ) = −2(I + 2Λ̃0 + 2Λ̃1 ) −1
𝑯 𝒊𝒂 (˜
𝑁 
∑︁ det(I + 2Λ̃0 + 2Λ̃1 ) 1/2    −1
𝑇 −1
1+ exp ( 𝒖˜ 𝒊 − 𝒗˜𝒂 ) (I + 2Λ̃0 + 2Λ̃1 ) ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
𝑖=1
˜1)
exp( 𝛼
 
 2(˜𝒗 𝒂 − 𝒖˜ 𝒊 )(˜𝒗 𝒂 − 𝒖˜ 𝒊 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 
I −
 
 
exp ( 𝛼
˜1 )
 1+ exp − ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 ) 

 det ( I+2Λ̃0 +2Λ̃1 ) 1/2 

𝑁 ∑︁
𝑀 𝑁 ∑︁𝑀
1 ∑︁ 1  ∑︁ 1 
𝐾 𝐿 ˜𝒗 𝒂 = 𝒗˜𝒂 𝑇 + 𝑦 𝑖𝑎 − 𝑯 𝒊𝒂 𝒗
(˜ 𝒂 ) 𝒗
˜ 𝒂 − 𝒗
2˜ 𝒂 𝑦 𝑖𝑎 𝒖
˜ 𝒊 − 𝑮 𝒊𝒂 𝒗
(˜ 𝒂 ) .
2𝜆21 𝑖=1 𝑖=𝑎 2 𝑖=1 𝑖=𝑎
2

With the Taylor-expansions of the log functions, we can obtain the closed form update
rule of 𝒗˜𝒂 by setting the partial derivative of KL equal to 0. Then, we have
" 𝑁
! # −1
1 ∑︁ 1
𝒗˜𝒂 = + 𝑦𝑖𝑎 𝑰 − 𝑯 𝒊𝒂 (˜
𝒗𝒂)
2𝜆21 𝑖=1 2
" 𝑁 #
∑︁ 1
𝑦𝑖𝑎 𝒖˜ 𝒊 − 𝑮 𝒊𝒂 (˜
𝒗𝒂)
𝑖=1
2

To find the closed form updates of Λ̃0 and Λ̃1 we used the first-order Taylor-expansions of
𝑭𝒊 and 𝑭𝒊𝒂 . The gradients of 𝑭𝒊 and 𝑭𝒊𝒂 with respect to Λ̃0 are:
 −1
" #
det(I + 4Λ̃0 ) 1/2
𝑁
∑︁ 𝑁
∑︁ 
𝑮 𝒊 ( Λ̃0 ) = 1+ exp ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )𝑇 (I + 4Λ̃0 ) −1 ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )
𝑖=1 𝑗=1, 𝑗≠𝑖
exp( ˜
𝛼 0 )
!
−1 𝑇 −1 1
4(I + 4Λ̃0 ) ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 ) (I + 4Λ̃0 ) − I
2
 −1
" #
det(I + 2Λ̃0 + 2Λ̃1 ) 1/2
𝑁 ∑︁
∑︁ 𝑀 
𝑮 𝒊𝒂 ( Λ̃0 ) = 1+ exp ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
𝑖=1 𝑎=1
exp( ˜
𝛼 1 )
!
1
2(I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 − I
2

43
The gradients of 𝑭𝒂 and 𝑭𝒊𝒂 with respect to Λ̃1 are:
" # −1
det(I + 2Λ̃0 + 2Λ̃1 ) 1/2
𝑁 ∑︁
∑︁ 𝑀  
𝑮 𝒊𝒂 ( Λ̃1 ) = 1+ exp ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
𝑖=1 𝑎=1
exp( 𝛼 ˜1)
!
1
2(I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 − I
2

𝑁 ∑︁
𝑀 𝑁 ∑︁𝑁
𝑁 ∑︁ ∑︁  𝑁
𝐾 𝐿 Λ0 = tr( Λ̃0 ) 2
+ 𝑦𝑖𝑎 + 2 𝑦𝑖 𝑗 − log(det( Λ̃0 )) + 𝑮 𝒊 ( Λ̃0 ) Λ̃0 + 𝑮 𝒊𝒂 ( Λ̃0 ) Λ̃0
2𝜆 0 𝑖=1 𝑎=1 𝑖=1 𝑗=1
2
𝑁 𝑀
𝑀 ∑︁ ∑︁  𝑀
𝐾 𝐿 Λ1 = tr( Λ̃1 ) + 𝑦𝑖𝑎 − log(det( Λ̃1 )) + 𝑮 𝒊𝒂 ( Λ̃1 ) Λ̃1
2𝜆20 𝑖=1 𝑎=1 2

With the Taylor-expansions of the log functions, we can obtain the closed form update
rule of Λ̃0 Λ̃1 by setting the partial derivative of KL equal to 0. Then, we have

" 𝑁 ∑︁𝑁 𝑁 ∑︁𝑀


! # −1
𝑁 𝑁 1 ∑︁ ∑︁
Λ̃0 = +2 𝑦𝑖 𝑗 + 𝑦𝑖𝑎 𝑰 + 𝑮 𝒊 ( Λ̃0 ) + 𝑮 𝒊𝒂 ( Λ̃0 )
2 2 𝜆20 𝑖=1 𝑗=1 𝑖=1 𝑎=1
" 𝑁 𝑀
! # −1
𝑀 𝑀 1 ∑︁ ∑︁
Λ̃1 = + 𝑦𝑖𝑎 𝑰 + 𝑮 𝒊𝒂 ( Λ̃1 )
2 2 𝜆21 𝑖=1 𝑎=1

˜0, 𝛼
M-step: Estimate 𝛼 ˜ 1 and 𝛼
˜ 2 by minimizing the KL divergence. To find the closed
˜0, 𝛼
form updates of 𝛼 ˜ 1 and 𝛼
˜ 2 , we used second-order Taylor-expansions of the log functions
˜0, 𝛼
and set the partial derivatives of KL with respects to 𝛼 ˜ 1 and 𝛼
˜ 2 as zeros. Then we have
Í𝑁 Í𝑁
𝑖=1 𝑗≠𝑖, 𝑗=1 𝑦 𝑖 𝑗 − 𝑔𝑖 ( 𝛼
˜0) + 𝛼˜ 0 ℎ𝑖 ( 𝛼
˜0)
˜0 =
𝛼
ℎ𝑖 ( 𝛼
˜0)
Í𝑁 Í 𝑀
𝑦𝑖𝑎 − 𝑔𝑖𝑎 ( 𝛼 ˜2) + 𝛼˜ 2 ℎ𝑖𝑎 ( 𝛼
˜2)
˜ 1 = 𝑖=1 𝑎=1
𝛼
ℎ𝑖𝑎 ( 𝛼˜2)
where
" # −1
det(I + 4Λ̃0 ) 1/2
𝑁
∑︁ 𝑁
∑︁  
𝑇 −1
𝑔𝑖 ( 𝛼
˜0) = 1+ exp ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 ) (I + 4Λ̃0 ) ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )
𝑖=1 𝑗=1, 𝑗≠𝑖
exp( 𝛼˜0)
" # −1
det(I + 4Λ̃0 ) 1/2
𝑁
∑︁ 𝑁
∑︁  
ℎ𝑖 ( 𝛼
˜0) = 1+ exp ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )𝑇 (I + 4Λ̃0 ) −1 ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )
𝑖=1 𝑗=1, 𝑗≠𝑖
exp( 𝛼˜0)

44
" # −1
˜0)
exp( 𝛼  
1+ exp − ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )𝑇 (I + 4Λ̃0 ) −1 ( 𝒖˜ 𝒊 − 𝒖˜ 𝒋 )
det(I + 4Λ̃0 ) 1/2
 −1
" #
det(I + 2Λ̃0 + 2Λ̃1 ) 1/2
∑︁𝑁 ∑︁𝑀 
𝑔𝑖𝑎 ( 𝛼
˜1) = 1+ exp ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
𝑖=1 𝑎=1
exp( ˜
𝛼 1 )
 −1
" #
det(I + 2Λ̃0 + 2Λ̃1 ) 1/2
∑︁𝑁 ∑︁𝑀 
𝑇 −1
ℎ𝑖𝑎 ( 𝛼
˜1) = 1+ exp ( 𝒖˜ 𝒊 − 𝒗˜𝒂 ) (I + 2Λ̃0 + 2Λ̃1 ) ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
𝑖=1 𝑎=1
exp( 𝛼˜1)
 −1
" #
˜1)
exp( 𝛼 
1+ exp − ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )𝑇 (I + 2Λ̃0 + 2Λ̃1 ) −1 ( 𝒖˜ 𝒊 − 𝒗˜𝒂 )
det(I + 2Λ̃0 + 2Λ̃1 ) 1/ 2

45

You might also like